Golang核心編程(9)-使用net/http及goquery庫爬取CSDN首頁文章


更多關於Golang核心編程知識的文章請看:Golang核心編程(0)-目錄頁


goquery是golang的一個爬蟲常用第三方庫,它主要的作用是處理html文檔,將其我們需要的內容進行篩選處理。goquery是golang領域的jquery,它的使用和jquery的選擇器有十分相似,如果你學過jquery,那麼將十分容易上手。

一、goquery庫的安裝

具體的安裝方式網上講得很清楚,但是你可能會遇到以下問題:
報錯package golang.org/x/net/websocket: unrecognized import path
原因在於本地缺少一個golang.org/x/net的包,用以下方法可以解決:

https://blog.csdn.net/qq_31967569/article/details/81060525

二、goquery的使用

網上有兩篇文章講得很清楚,這裏就不再講了,大家可以查閱:

三、爬取CSDN首頁文章

3.1、需求分析

顯示,打開CSDN的首頁,我們先確定我們需要爬的數據是什麼
在這裏插入圖片描述
這次我打算爬的是首頁文章中的文章名文章地址文章作者以及閱讀數

3.2、分析當前頁面的html文檔

右鍵,查看頁面的源代碼,並將源代碼拷貝到閱讀工具中方便閱讀,這裏我用的是notepad++去分析html文檔
在這裏插入圖片描述

  • 1、首先鎖定文章名所在的位置
    在這裏插入圖片描述

很快可以發現,文章名的外層是一個a標籤,而所有的文章數據都以 list_con這個class的方式循環而得,所以,獲取文章名的goquery選擇器可以這麼寫:

document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		name1 := selection.Find("div.title h2 a")
		fmt.Printf("name is :%v\n",name1.Text())
	})

得出以下結果,成功地獲取了所有文章的名字,第一步完成
在這裏插入圖片描述

  • 2、鎖定文章地址位置
    在這裏插入圖片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		url,_:= selection.Find(".read_num").Find("a").Attr("href")
		fmt.Printf("url is :%v\n",url)
	})

這裏鎖定了a標籤之後,用了Attr方法去獲得href屬性中的地址值:
在這裏插入圖片描述

  • 3、鎖定文章作者
    在這裏插入圖片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		author := selection.Find(".name").Find("a")
		fmt.Printf("author is :%v\n",author.Text())
	})

在這裏插入圖片描述

  • 4、鎖定文章閱讀數
    在這裏插入圖片描述
document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
   	numbers := selection.Find(".read_num").Find("a span.num")
   	fmt.Printf("numbers is :%v\n",numbers.Text())
   })

在這裏插入圖片描述

四、爬蟲完整程序

package main

import (
	"net/http"
	"github.com/PuerkitoBio/goquery"
	"fmt"
	"strings"
)

func main() {
	//獲得response
	response, err := http.Get("https://www.csdn.net/")
	if err != nil{
		return
	}
	//使用goquery解析response響應體獲得html文檔
	document, err := goquery.NewDocumentFromReader(response.Body)
	if err != nil{
		return
	}
	defer response.Body.Close()
	//開始解析
	 document.Find("#feedlist_id").Find("div.list_con").Each(func(i int, selection *goquery.Selection) {
		name1 := selection.Find("div.title h2 a")
		 url,_:= selection.Find(".read_num").Find("a").Attr("href")
		 author := selection.Find(".name").Find("a")
		 numbers := selection.Find(".read_num").Find("a span.num")

		fmt.Printf("index is :%d|name is :%v|author is :%v|numbers is :%v|url is :%v \n",i,strings.TrimSpace(name1.Text()),strings.TrimSpace(author.Text()),numbers.Text(),strings.TrimSpace(url))
	})
 
}

成功地爬取了CSDN首頁的文章數據,之後可以將其寫入文件中或者數據庫中,如需改進性能的話可以改用多協程,有興趣的朋友可以深入研究!

index is :0|name is :Python爬取抖音APP,竟然只需要十行代碼|author is :嬌兮心有之|numbers is :2262|url is :https://blog.csdn.net/qq_40925239/article/details/83786958 
index is :1|name is :千萬別做老闆最不能容忍的三種人 z|author is :這個也很漂亮|numbers is :1712|url is :https://blog.csdn.net/hdfghh/article/details/83955147 
index is :2|name is :程序員曬出小學兒子滿分作文《我的爸爸》,真實的讓人心疼|author is :taya_a|numbers is :1642|url is :https://blog.csdn.net/taya_a/article/details/83958356 
index is :3|name is :騰訊 阿里 華爲的崗位薪資情況概述|author is :小風花|numbers is :2114|url is :https://blog.csdn.net/hdfyhf/article/details/83931804 
index is :4|name is :震驚,20年開發經驗的技術總監不會搭建Java開發環境|author is :Java填坑之路|numbers is :3733|url is :https://blog.csdn.net/yelvgou9995/article/details/83961061 
index is :5|name is :在操作系統、芯片領域跌倒的中國程序員,如何崛起?|author is :殘留的淡影|numbers is :835|url is :https://blog.csdn.net/weixin_43587861/article/details/83958910 
index is :6|name is :剛寫完排序算法,就被開除了…|author is :Java技術棧|numbers is :605|url is :https://blog.csdn.net/youanyyou/article/details/84026290 
index is :7|name is :有個程序員男友是什麼感覺?女網友:連約個會都要處理BUG!|author is :不玩代碼的一鳴|numbers is :2051|url is :https://blog.csdn.net/weixin_43338842/article/details/83932502 
index is :8|name is :程序員吐槽阿里加班文化上班太累,網友:做程序員這也算高強度?|author is :不玩代碼的一鳴|numbers is :1502|url is :https://blog.csdn.net/weixin_43338842/article/details/83932471 
index is :9|name is :sql 存儲過程|author is :樹葉子hza|numbers is :1712|url is :https://blog.csdn.net/hza419763578/article/details/83961826 
index is :10|name is :【軟件設計師】——總結|author is :邢美玲|numbers is :233|url is :https://blog.csdn.net/xml1996/article/details/83959290 
index is :11|name is :虛擬機和Docker的最大區別|author is :JerryWangSAP|numbers is :467|url is :https://blog.csdn.net/i042416/article/details/84034510 
index is :12|name is :快進來看程序員風格的修真小說!|author is :Java填坑之路|numbers is :485|url is :https://blog.csdn.net/yelvgou9995/article/details/84067063 
index is :13|name is :ORACLE/MYSQL數據庫的常用SQL命令|author is :SunJW_2017|numbers is :399|url is :https://blog.csdn.net/SunJW_2017/article/details/84023425 
index is :14|name is :ERP工程師的職責是什麼|author is :這個也很漂亮|numbers is :432|url is :https://blog.csdn.net/hdfghh/article/details/84059994 
index is :15|name is :springMVC學習心得及手寫springMVC簡單實現|author is :棒叔叔|numbers is :245|url is :https://blog.csdn.net/qq_41785135/article/details/83781493 
index is :16|name is :自動化運維一體化|author is :Stestack|numbers is :585|url is :https://blog.csdn.net/Stestack/article/details/83963083 
index is :17|name is :#程序員式幽默趣圖!從高的職業,現實的殘酷!|author is :javam16|numbers is :240|url is :https://blog.csdn.net/javam16/article/details/83957962 
index is :18|name is :Springboot實現用戶登錄|author is :HOWSUNSHINE|numbers is :421|url is :https://blog.csdn.net/HOWSUNSHINE/article/details/83988456 
index is :19|name is :多臺SQLServer數據實時同步|author is :weixin_37691493|numbers is :296|url is :https://blog.csdn.net/weixin_37691493/article/details/83960586 
index is :20|name is :一名年薪百萬阿里P8架構師寫給Java程序員一些建議(架構師必備)|author is :M阿|numbers is :420|url is :https://blog.csdn.net/yupi1057/article/details/84068697 
index is :21|name is :在 Java 中初始化 List 的五種方法|author is :Java填坑之路|numbers is :310|url is :https://blog.csdn.net/yelvgou9995/article/details/83933095 
index is :22|name is :徐小平 不做人生規劃,你離捱餓只有三天|author is :這個也很漂亮|numbers is :274|url is :https://blog.csdn.net/hdfghh/article/details/83955208 
index is :23|name is :Spring 的體系結構|author is :mukes|numbers is :180|url is :https://blog.csdn.net/mukes/article/details/84071658 
index is :24|name is :淺淡XSS跨站腳本攻擊的防禦方法|author is :白帽夢想家|numbers is :176|url is :https://blog.csdn.net/sdb5858874/article/details/84033195 
index is :25|name is :是否在公司裏 老闆叫你做什麼 就做什麼的總結|author is :牛仔褲新的|numbers is :204|url is :https://blog.csdn.net/jgfyyfd/article/details/83935051 
index is :26|name is :長相一般的普通程序員怎麼找到喜歡程序員的妹子做女友?|author is :北辰丶|numbers is :175|url is :https://blog.csdn.net/qq_43093708/article/details/83933576 
index is :27|name is :棧的基本函數C++實現|author is :liaolian1|numbers is :154|url is :https://blog.csdn.net/liaolian1/article/details/84074829 
index is :28|name is :Java併發——阻塞隊列|author is :Crazy_CMT|numbers is :168|url is :https://blog.csdn.net/qq_38386085/article/details/84035841 
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章