java spring+mybatis整合實現爬蟲之《今日頭條》搞笑動態圖片爬取 頂 原 薦

java spring+mybatis整合實現爬蟲之《今日頭條》搞笑動態圖片爬取(詳細)

原文地址原博客地址

先上效果圖

抓取的動態圖:

這裏寫圖片描述

數據庫:

這裏寫圖片描述 一.此爬蟲介紹

今日頭條本身就是做爬蟲的,爬取各大網站的圖片文字信息,再自己整合後推送給用戶,特別是裏面的動態圖片,很有意思。在網上搜了搜,大多都是用Python來寫的,本人是學習javaweb這塊的,對正則表達式也不是很熟悉,就想着能不能換個我熟悉的方式來寫。此爬蟲使用spring+mybatis框架整合實現,使用mysql數據庫保存爬取的數據,用jsoup來操作HTML的標籤節點(完美避開正則表達式),獲取文章中動態圖片的鏈接,通過響應頭中“Content-Type”的值來判斷圖片的格式,再將圖片保存在本地。當然也可以爬取裏面的文字,比如一些搞笑的黃段子,在此基礎上稍加改動就可以實現,此爬蟲只是提供一個入門的思路,更多好玩的爬蟲玩法還待大家去開發,哈哈。

二.技術選型

  1. 核心語言:java;
  2. 核心框架:spring;
  3. 持久層框架:mybatis;
  4. 數據庫連接池:Alibaba Drui;
  5. 日誌管理:Log4j;
  6. jar包管理:maven; 。。。。

三.找規律,劃重點

打開頭條首頁,找到點擊搞笑模塊,點擊F12,下滾後加載下一頁,發現是通過ajax請求api來獲取的數據,如下圖:

這裏寫圖片描述

這是響應的json數據,裏面的參數和值顧名思義大家都懂得。

這裏寫圖片描述

是ajax訪問就好解決了,通過我百度谷歌各種研究後發現,ajax請求的前三個參數是不變的,改變category參數是請求不同的模塊,本列子是請求的搞笑模塊所以值爲funny,max_behot_time和max_behot_time_tmp這兩個參數值是時間戳,首次請求是0,之後的值是響應json數據裏面的next中的值。as和cp值是通過一段js生成的,其實就是一個加密了的時間戳而已。js代碼後面會貼。

四.開始搭框架擼代碼

項目搭建後之後爲下圖所示的文件結構,不懂得自行谷歌 哈哈

這裏寫圖片描述

不多說直接上核心代碼了:

public class TouTiaoCrawler {

	// 搞笑板塊的api地址
	public static final String FUNNY = "http://www.toutiao.com/api/pc/feed/?utm_source=toutiao&widen=1";

	// 頭條首頁地址
	public static final String TOUTIAO = "http://www.toutiao.com";

	// 使用"spring.xml"和"spring-mybatis.xml"這兩個配置文件創建Spring上下文
	static ApplicationContext ac = new ClassPathXmlApplicationContext(
			"spring-mybatis.xml");

	// 從Spring容器中根據bean的id取出我們要使用的funnyMapper對象
	static FunnyMapper funnyMapper = (FunnyMapper) ac.getBean("funnyMapper");

	// 接口訪問次數
	private static int refreshCount = 0;

	// 時間戳
	private static long time = 0;

	public static void main(String[] args) {
		System.out.println("----------開始幹活!-----------------");
		while (true) {
			crawler(time);
		}
	}

	public static void crawler(long hottime) {// 傳入時間戳,會獲取這個時間戳的內容
		refreshCount++;
		System.out.println("----------第" + refreshCount + "次刷新------返回的請求時間爲:"
				+ hottime + "----------");
		String url = FUNNY + "&max_behot_time=" + hottime
				+ "&max_behot_time_tmp=" + hottime;
		JSONObject param = getUrlParam(); // 獲取用js代碼得到的as和cp的值
		// 定義接口訪問的模塊
		/*
		 * __all__ : 推薦 news_hot: 熱點 funny:搞笑
		 */
		String module = "funny";
		url += "&as=" + param.get("as") + "&cp=" + param.get("cp")
				+ "&category=" + module;
		JSONObject json = null;
		try {
			json = getReturnJson(url);// 獲取json串
		} catch (Exception e) {
			e.printStackTrace();
		}
		if (json != null) {
			time = json.getJSONObject("next").getLongValue("max_behot_time");
			JSONArray data = json.getJSONArray("data");
			for (int i = 0; i < data.size(); i++) {
				try {
					JSONObject obj = (JSONObject) data.get(i);
					// 判斷這條文章是否已經爬過
					if (funnyMapper.selectByGroupId((String) obj
							.get("group_id")) != null) {
						System.out
								.println("----------此文章已經爬過啦!-----------------");
						continue;
					}
					// 訪問頁面返回document對象
					String url1 = TOUTIAO + "/a" + obj.getString("group_id");
					Document document = getArticleInfo(url1);
					System.out.println("----------成功訪問了文章:" + url1
							+ "-----------------");
					// 將document也存入
					obj.put("document", document.toString());
					// 將json對象轉換成java Entity對象
					Funny funny = JSON.parseObject(obj.toString(), Funny.class);
					// json入庫
					funny.setBehotTime(new Date());
					funnyMapper.insertSelective(funny);
				} catch (Exception e) {
					e.printStackTrace();
				}
			}
		} else {
			System.out.println("----------返回的json列表爲空----------");
		}
	}

	// 訪問接口,返回json封裝的數據格式
	public static JSONObject getReturnJson(String url) {
		try {
			URL httpUrl = new URL(url);
			BufferedReader in = new BufferedReader(new InputStreamReader(
					httpUrl.openStream(), "UTF-8"));
			String line = null;
			String content = "";
			while ((line = in.readLine()) != null) {
				content += line;
			}
			in.close();
			return JSONObject.parseObject(content);
		} catch (Exception e) {
			System.err.println("訪問失敗:" + url);
			e.printStackTrace();
		}
		return null;
	}

	// 獲取網站的document對象
	public static Document getArticleInfo(String url) {
		try {
			Connection connect = Jsoup.connect(url);
			Document document;
			document = connect.get();
			Elements article = document.getElementsByClass("article-content");
			if (article.size() > 0) {
				Elements a = article.get(0).getElementsByTag("img");
				if (a.size() > 0) {
					for (Element e : a) {
						String url2 = e.attr("src");
						// 下載img標籤裏面的圖片到本地
						saveToFile(url2);
					}
				}
			}
			return document;
		} catch (IOException e) {
			System.err.println("訪問文章頁失敗:" + url + "  原因" + e.getMessage());
			return null;
		}
	}

	// 執行js獲取as和cp參數值
	public static JSONObject getUrlParam() {
		JSONObject jsonObject = null;
		FileReader reader = null;
		try {
			ScriptEngineManager manager = new ScriptEngineManager();
			ScriptEngine engine = manager.getEngineByName("javascript");

			String jsFileName = "toutiao.js"; // 讀取js文件

			reader = new FileReader(jsFileName); // 執行指定腳本
			engine.eval(reader);

			if (engine instanceof Invocable) {
				Invocable invoke = (Invocable) engine;
				Object obj = invoke.invokeFunction("getParam");
				jsonObject = JSONObject.parseObject(obj != null ? obj
						.toString() : null);
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			try {
				if (reader != null) {
					reader.close();
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
		return jsonObject;
	}

	// 通過url獲取圖片並保存在本地
	public static void saveToFile(String destUrl) {
		FileOutputStream fos = null;
		BufferedInputStream bis = null;
		HttpURLConnection httpUrl = null;
		URL url = null;
		String uuid = UUID.randomUUID().toString();
		String fileAddress = "d:\\imag/" + uuid;// 存儲本地文件地址
		int BUFFER_SIZE = 1024;
		byte[] buf = new byte[BUFFER_SIZE];
		int size = 0;
		try {
			url = new URL(destUrl);
			httpUrl = (HttpURLConnection) url.openConnection();
			httpUrl.connect();
			String Type = httpUrl.getHeaderField("Content-Type");
			if (Type.equals("image/gif")) {
				fileAddress += ".gif";
			} else if (Type.equals("image/png")) {
				fileAddress += ".png";
			} else if (Type.equals("image/jpeg")) {
				fileAddress += ".jpg";
			} else {
				System.err.println("未知圖片格式");
				return;
			}
			bis = new BufferedInputStream(httpUrl.getInputStream());
			fos = new FileOutputStream(fileAddress);
			while ((size = bis.read(buf)) != -1) {
				fos.write(buf, 0, size);
			}
			fos.flush();
			System.out.println("圖片保存成功!地址:" + fileAddress);
		} catch (IOException e) {
			e.printStackTrace();
		} catch (ClassCastException e) {
			e.printStackTrace();
		} finally {
			try {
				fos.close();
				bis.close();
				httpUrl.disconnect();
			} catch (IOException e) {
				e.printStackTrace();
			} catch (NullPointerException e) {
				e.printStackTrace();
			}
		}
	}
}

獲取as和cp參數的js代碼

function getParam(){
    var asas;
    var cpcp;
    var t = Math.floor((new Date).getTime() / 1e3)
      , e = t.toString(16).toUpperCase()
      , i = md5(t).toString().toUpperCase();
    if (8 != e.length){
        asas = "479BB4B7254C150";
        cpcp = "7E0AC8874BB0985";
    }else{
        for (var n = i.slice(0, 5), o = i.slice(-5), a = "", s = 0; 5 > s; s++){
            a += n[s] + e[s];
        }
        for (var r = "", c = 0; 5 > c; c++){
            r += e[c + 3] + o[c];
        }
        asas = "A1" + a + e.slice(-3);
        cpcp= e.slice(0, 3) + r + "E1";
    }
    return '{"as":"'+asas+'","cp":"'+cpcp+'"}';
}

!function(e) {
    "use strict";
    function t(e, t) {
        var n = (65535 & e) + (65535 & t)
          , r = (e >> 16) + (t >> 16) + (n >> 16);
        return r << 16 | 65535 & n
    }
    function n(e, t) {
        return e << t | e >>> 32 - t
    }
    function r(e, r, o, i, a, u) {
        return t(n(t(t(r, e), t(i, u)), a), o)
    }
    function o(e, t, n, o, i, a, u) {
        return r(t & n | ~t & o, e, t, i, a, u)
    }
    function i(e, t, n, o, i, a, u) {
        return r(t & o | n & ~o, e, t, i, a, u)
    }
    function a(e, t, n, o, i, a, u) {
        return r(t ^ n ^ o, e, t, i, a, u)
    }
    function u(e, t, n, o, i, a, u) {
        return r(n ^ (t | ~o), e, t, i, a, u)
    }
    function s(e, n) {
        e[n >> 5] |= 128 << n % 32,
        e[(n + 64 >>> 9 << 4) + 14] = n;
        var r, s, c, l, f, p = 1732584193, d = -271733879, h = -1732584194, m = 271733878;
        for (r = 0; r < e.length; r += 16)
            s = p,
            c = d,
            l = h,
            f = m,
            p = o(p, d, h, m, e[r], 7, -680876936),
            m = o(m, p, d, h, e[r + 1], 12, -389564586),
            h = o(h, m, p, d, e[r + 2], 17, 606105819),
            d = o(d, h, m, p, e[r + 3], 22, -1044525330),
            p = o(p, d, h, m, e[r + 4], 7, -176418897),
            m = o(m, p, d, h, e[r + 5], 12, 1200080426),
            h = o(h, m, p, d, e[r + 6], 17, -1473231341),
            d = o(d, h, m, p, e[r + 7], 22, -45705983),
            p = o(p, d, h, m, e[r + 8], 7, 1770035416),
            m = o(m, p, d, h, e[r + 9], 12, -1958414417),
            h = o(h, m, p, d, e[r + 10], 17, -42063),
            d = o(d, h, m, p, e[r + 11], 22, -1990404162),
            p = o(p, d, h, m, e[r + 12], 7, 1804603682),
            m = o(m, p, d, h, e[r + 13], 12, -40341101),
            h = o(h, m, p, d, e[r + 14], 17, -1502002290),
            d = o(d, h, m, p, e[r + 15], 22, 1236535329),
            p = i(p, d, h, m, e[r + 1], 5, -165796510),
            m = i(m, p, d, h, e[r + 6], 9, -1069501632),
            h = i(h, m, p, d, e[r + 11], 14, 643717713),
            d = i(d, h, m, p, e[r], 20, -373897302),
            p = i(p, d, h, m, e[r + 5], 5, -701558691),
            m = i(m, p, d, h, e[r + 10], 9, 38016083),
            h = i(h, m, p, d, e[r + 15], 14, -660478335),
            d = i(d, h, m, p, e[r + 4], 20, -405537848),
            p = i(p, d, h, m, e[r + 9], 5, 568446438),
            m = i(m, p, d, h, e[r + 14], 9, -1019803690),
            h = i(h, m, p, d, e[r + 3], 14, -187363961),
            d = i(d, h, m, p, e[r + 8], 20, 1163531501),
            p = i(p, d, h, m, e[r + 13], 5, -1444681467),
            m = i(m, p, d, h, e[r + 2], 9, -51403784),
            h = i(h, m, p, d, e[r + 7], 14, 1735328473),
            d = i(d, h, m, p, e[r + 12], 20, -1926607734),
            p = a(p, d, h, m, e[r + 5], 4, -378558),
            m = a(m, p, d, h, e[r + 8], 11, -2022574463),
            h = a(h, m, p, d, e[r + 11], 16, 1839030562),
            d = a(d, h, m, p, e[r + 14], 23, -35309556),
            p = a(p, d, h, m, e[r + 1], 4, -1530992060),
            m = a(m, p, d, h, e[r + 4], 11, 1272893353),
            h = a(h, m, p, d, e[r + 7], 16, -155497632),
            d = a(d, h, m, p, e[r + 10], 23, -1094730640),
            p = a(p, d, h, m, e[r + 13], 4, 681279174),
            m = a(m, p, d, h, e[r], 11, -358537222),
            h = a(h, m, p, d, e[r + 3], 16, -722521979),
            d = a(d, h, m, p, e[r + 6], 23, 76029189),
            p = a(p, d, h, m, e[r + 9], 4, -640364487),
            m = a(m, p, d, h, e[r + 12], 11, -421815835),
            h = a(h, m, p, d, e[r + 15], 16, 530742520),
            d = a(d, h, m, p, e[r + 2], 23, -995338651),
            p = u(p, d, h, m, e[r], 6, -198630844),
            m = u(m, p, d, h, e[r + 7], 10, 1126891415),
            h = u(h, m, p, d, e[r + 14], 15, -1416354905),
            d = u(d, h, m, p, e[r + 5], 21, -57434055),
            p = u(p, d, h, m, e[r + 12], 6, 1700485571),
            m = u(m, p, d, h, e[r + 3], 10, -1894986606),
            h = u(h, m, p, d, e[r + 10], 15, -1051523),
            d = u(d, h, m, p, e[r + 1], 21, -2054922799),
            p = u(p, d, h, m, e[r + 8], 6, 1873313359),
            m = u(m, p, d, h, e[r + 15], 10, -30611744),
            h = u(h, m, p, d, e[r + 6], 15, -1560198380),
            d = u(d, h, m, p, e[r + 13], 21, 1309151649),
            p = u(p, d, h, m, e[r + 4], 6, -145523070),
            m = u(m, p, d, h, e[r + 11], 10, -1120210379),
            h = u(h, m, p, d, e[r + 2], 15, 718787259),
            d = u(d, h, m, p, e[r + 9], 21, -343485551),
            p = t(p, s),
            d = t(d, c),
            h = t(h, l),
            m = t(m, f);
        return [p, d, h, m]
    }
    function c(e) {
        var t, n = "";
        for (t = 0; t < 32 * e.length; t += 8)
            n += String.fromCharCode(e[t >> 5] >>> t % 32 & 255);
        return n
    }
    function l(e) {
        var t, n = [];
        for (n[(e.length >> 2) - 1] = void 0,
        t = 0; t < n.length; t += 1)
            n[t] = 0;
        for (t = 0; t < 8 * e.length; t += 8)
            n[t >> 5] |= (255 & e.charCodeAt(t / 8)) << t % 32;
        return n
    }
    function f(e) {
        return c(s(l(e), 8 * e.length))
    }
    function p(e, t) {
        var n, r, o = l(e), i = [], a = [];
        for (i[15] = a[15] = void 0,
        o.length > 16 && (o = s(o, 8 * e.length)),
        n = 0; 16 > n; n += 1)
            i[n] = 909522486 ^ o[n],
            a[n] = 1549556828 ^ o[n];
        return r = s(i.concat(l(t)), 512 + 8 * t.length),
        c(s(a.concat(r), 640))
    }
    function d(e) {
        var t, n, r = "0123456789abcdef", o = "";
        for (n = 0; n < e.length; n += 1)
            t = e.charCodeAt(n),
            o += r.charAt(t >>> 4 & 15) + r.charAt(15 & t);
        return o
    }
    function h(e) {
        return unescape(encodeURIComponent(e))
    }
    function m(e) {
        return f(h(e))
    }
    function g(e) {
        return d(m(e))
    }
    function v(e, t) {
        return p(h(e), h(t))
    }
    function y(e, t) {
        return d(v(e, t))
    }
    function b(e, t, n) {
        return t ? n ? v(t, e) : y(t, e) : n ? m(e) : g(e)
    }
    "function" == typeof define && define.amd ? define("static/js/lib/md5", ["require"], function() {
        return b
    }) : "object" == typeof module && module.exports ? module.exports = b : e.md5 = b
}(this)

五.最後

我還發現了頭條有個簡約版,研究後發現這個簡約版應該更好爬一些。

這裏寫圖片描述

訪問的格式是p+頁碼,直接讀取每頁裏面的鏈接,就可以進行爬取了,就不再通過json串來獲取文章地址,也不需要傳什麼限制參數,在本項目上稍加改動就可以了

這裏寫圖片描述

這裏寫圖片描述

六.JUST DO IT

。。。。。。。。。。。。。。。。。。。。。。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章