剛剛入職一家教育機構,被要求爬取一些學校的新聞到數據庫來豐富公司對外系統的頁面豐富性,接下來是一些簡單的教程。
一.配置文件
applicationContent如下
<?xml version="1.0" encoding="UTF-8"?><bean id="tellingTheTimeService" class="service.impl.TellingTheTimeServiceImpl"/>
<!-- 配置一個Job -->
<bean id="tellTheTimeJob" class="org.springframework.scheduling.quartz.JobDetailBean">
<property name="jobClass" value="jobs.TellingTheTimeJob"/>
<property name="jobDataAsMap">
<map>
<entry key="tellingTheTimeService" value-ref="tellingTheTimeService"></entry>
</map>
</property>
</bean>
<!-- 簡單的觸發器 -->
<bean id="simpleTellTheTimeTrigger" class="org.springframework.scheduling.quartz.SimpleTriggerBean">
<property name="jobDetail">
<ref bean="tellTheTimeJob" />
</property>
<!-- 以毫秒爲單位,啓動後一分鐘觸發 -->
<property name="startDelay">
<value>1000</value>
</property>
<!-- 每間隔一分鐘觸發一次 -->
<property name="repeatInterval">
<value>600000</value>
</property>
</bean>
<!-- 複雜的觸發器 -->
<bean id="complexTellTheTimeTrigger" class="org.springframework.scheduling.quartz.CronTriggerBean">
<property name="jobDetail">
<ref bean="tellTheTimeJob"/>
</property>
<property name="cronExpression">
<!-- 這裏是重點,可以自定義表達式實現定時觸發。以下含義是每分鐘觸發一次 -->
<value>*/5 * * * * ?</value>
</property>
</bean>
<!-- Spring觸發工廠 -->
<bean class="org.springframework.scheduling.quartz.SchedulerFactoryBean">
<property name="triggers">
<list>
<!--<ref bean="complexTellTheTimeTrigger"/>
--><ref bean="simpleTellTheTimeTrigger"/>
<!-- ....下面可以繼續添加其他觸發器 -->
</list>
</property>
</bean>
其他就是一些數據庫連接相關,根據自己需求配置
二.瞭解需要爬取界面的結構。
因爲是通過jsoup來獲取頁面標籤對象的,對於不同的網頁,需要修改不同的參數。
如果我們需要爬取這個界面的學院新聞模塊,應該怎麼起手?
因爲這個界面比較複雜,一般界面標籤會設置id,很容易就定位,但是這個很難做到,而且頁面的class都會有重複,最後我實在沒辦法 只能從單挑新聞起手,doc.getElementsByAttributeValue("class", "c50319")
得到對象後獲取父類對象.parent()
,然後再下一個.nextElementSibling()
;(如果想要熟練爬取,需要了解jsoup的各種api)
完整代碼如下
核心代碼
package utils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import entity.News;
public class JsoupTest
{
// public static String url1 = "http://www.zjtongji.edu.cn/xyxw.htm?tdsourcetag=s_pcqq_aiomsg";
// public static String url2 = "http://www.zjtongji.edu.cn/ysdwxx.htm?tdsourcetag=s_pcqq_aiomsg";
public static String url2 = "http://www.tjmc.edu.cn/";
public static void getNewsContent()
{
// List<String> urls = new ArrayList<String>();
// urls.add(url2);
// urls.add(url1);
Document doc = null;
try {
doc = Jsoup.connect(url2).timeout(5000).get();
Elements list = doc.getElementsByAttributeValue("class", "c50319");
for(Element e : list) {
Element time = e.parent().nextElementSibling();
String times = time.text();
System.out.println("新聞title:" + e.attr("title"));
String y = times.substring(1, 3);
if(y.equals("12") || y.equals("08")) {
times = "2018-"+times.substring(1, times.length()-1);
}else {
times = "2019-"+times.substring(1, times.length()-1);
}
System.out.println("新聞時間:" + times);
String element_a_href = e.attr("href");
doc = Jsoup.connect("http://www.tjmc.edu.cn/" + element_a_href).timeout(5000).get();
Element econtent = doc.getElementById("vsb_content");
if(null == econtent) {
econtent = doc.getElementById("vsb_content_2");
}
String content = econtent.html();
content = content.replaceAll("../../images/", "http://www.tjmc.edu.cn/images/");
content = content.replaceAll("/__local/", "http://www.tjmc.edu.cn/__local/");
System.out.println("新聞內容:" + content);
News entity = new News();
entity.setLlbm(properties.getPropertiesByString("xyxwbbm"));
entity.setTitle(e.attr("title"));
entity.setCreateTime(times);
entity.setAuthor("");
entity.setSource("");
entity.setContent(content);
entity.setRemotelogourl(element_a_href);
// jdbcInsertUtils.insertNews(entity);
}
// for(String url: urls) {
// doc = Jsoup.connect(url).timeout(5000).get();
// Elements listDiv = doc.getElementsByAttributeValue("class", "list");
// System.out.println(listDiv.size());
// for(Element listDivElement:listDiv){
// Elements list_li = listDivElement.getElementsByTag("li");
// System.out.println(list_li.size());
// for (Element element_li : list_li) {
// try {
// //獲取標題
// Element element_a = element_li.getElementsByTag("a").get(0);
// Element element_span = element_li.getElementsByTag("span").get(0);
// if (element_a.attr("title") != null) if (element_a.attr("title") != "") {
// News entity = new News();
// entity.setTitle(element_a.attr("title"));
// entity.setCreateTime(element_span.text());
// System.out.println("新聞title:" + element_a.attr("title"));
// System.out.println("新聞時間:" + element_span.text());
//
// String element_a_href = element_a.attr("href");
// entity.setLlbm(properties.getPropertiesByString("xyxwbbm"));
// if(element_a_href.contains("1055")){
// entity.setLlbm(properties.getPropertiesByString("xyxwbbm"));
// }else if(element_a_href.contains("1056")){
// entity.setLlbm(properties.getPropertiesByString("ysdwxxbbm"));
// }
//
// doc = Jsoup.connect("http://www.zjtongji.edu.cn/" + element_a_href).timeout(5000).get();
//
// Element econtent = doc.getElementById("vsb_content");
// if(null == econtent) {
// econtent = doc.getElementById("vsb_content_2");
// }
// String content = econtent.html();
//
// content = content.replaceAll("/images/", "http://www.zjtongji.edu.cn/images/");
// content = content.replaceAll("/__local/", "http://www.zjtongji.edu.cn/__local/");
// System.out.println("新聞內容:" + content);
// entity.setAuthor("");
// entity.setSource("");
// entity.setContent(content);
// entity.setRemotelogourl(element_a.attr("href"));
// jdbcInsertUtils.insertNews(entity);
// }
// } catch (Exception e) {
// System.out.println(e.fillInStackTrace());
// }
// }
// }
// }
}
catch (Exception e)
{
System.out.println(e.fillInStackTrace());
}
}
public static void main(String[] args)
throws Exception
{
getNewsContent();
// String x = "[12-31]";
// System.out.println(x.substring(1, 2));
}
}
上面的jdbcInsertUtils.insertNews(entity);
是將我爬取的新聞保存到數據庫,還有對於圖片路徑的處理,爬取的時候一定要加上域名!!!
定時任務的代碼,順便將爬取時間統計
package jobs;
import java.util.Calendar;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
import org.springframework.scheduling.quartz.QuartzJobBean;
import service.ITellingTheTimeService;
import utils.JsoupTest;
public class TellingTheTimeJob extends QuartzJobBean
{
private ITellingTheTimeService tellingTheTimeService = null;
protected void executeInternal(JobExecutionContext arg0)
throws JobExecutionException
{
Calendar now = Calendar.getInstance();
System.out.println("任務調度開始,現在是北京時間:" + now.getTime());
System.out.println("1213");
JsoupTest test = new JsoupTest();
JsoupTest.getNewsContent();
System.out.println("任務調度結束,現在是北京時間:" + now.getTime());
}
public ITellingTheTimeService getTellingTheTimeService() {
return this.tellingTheTimeService;
}
public void setTellingTheTimeService(ITellingTheTimeService tellingTheTimeService)
{
this.tellingTheTimeService = tellingTheTimeService;
}
}
接下來把項目部署到tomcat服務器下面,只要目標url新聞更新,就會插入到數據庫裏面了;美滋滋啊