Solo  当前访客:1 开始使用

java爬虫实现抓取小说最新章节


1、实例成功

前言:实在吐槽小说网站,就想追几本小说,结果放了很多垃圾广告,费流量不说,地铁上还得偷偷的看 ~^~

扫码访问:

本例采用jsoup+quartz实现

2、代码示例

1)根据大多小说网站的html布局,抽出公用的模型,设计字段

2)根据配置处理数据

public String runSearch(boolean isTest,String book,String href,String menuSelector,int menuC,String contentSt,String exclude,String refer){
		String wz = href;
		//int wmsjCode = 14;
		try {
			Document doc = Jsoup.connect(wz).timeout(20000).get();//设置超时,放置页面资源太多超时
		Elements contentsMenu = doc.select(menuSelector);//选取目录
		StringBuffer sb = new StringBuffer();
		if(contentsMenu != null){
			menuC = menuC>=contentsMenu.size()?contentsMenu.size():menuC;
			for(int i=menuC;i>0;i--){
				Elements a = contentsMenu.eq(contentsMenu.size()-i);
				
				//System.out.println(c.get(0).html());
				//Elements a = c.get(0).select("a");
				if(a != null && a.size()>0){
					String ah = a.attr("href");
					if(ah.startsWith("http://") || ah.startsWith("https://")){
						
					}else if(ah.startsWith("/")){
						ah = refer+ah;
					}else{
						if(wz.endsWith("/")){
							ah = wz+a.attr("href");
						}else{
							ah = wz+"/"+a.attr("href");
						}
						
					}
					sb.append(parseCont(isTest,book,a.html(),ah,contentSt,exclude,refer));//处理内容页
				}
			}
		}
		return sb.toString();
	} catch (IOException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
		return "";
	}
}

	
private String parseCont(Boolean isTest,String novelName,String title,String href,String contentSt,String exclude,String refer) throws IOException{
	System.out.println(novelName+";"+title+";"+href);
	//获取内容
	Document doc = Jsoup.connect(href).timeout(20000).get();
	Elements e = doc.select(contentSt);
	//e.select("script").remove();
	if(exclude != null && !exclude.equals("")){
		String[] arr = exclude.split("\\|");
		for(String a : arr){
			if(!a.trim().equals("")){
				e.select(a.trim()).remove();
			}
		}
	}
	//System.out.println(e.html());
	if(isTest){
		return "book:"+novelName+"<br/>title:"+title+"<br/>content:"+e.html()+"<br/>";
	}else{
		//根据需要入库<br />                      // save or update
		return "";
	}
}</pre>

3、配置定时任务(quartz时间配置

public class TaskManager {  
    private static SchedulerFactory gSchedulerFactory = new StdSchedulerFactory();  
    private static int index =1; 
    public static void run(String jobname,Class jobclass,String ruler){
	try {
		Scheduler sche = gSchedulerFactory.getScheduler();
		
		 JobDetail job = JobBuilder.newJob(jobclass)
			      .withIdentity(jobname, "group1")
			      .build();
		 CronTrigger trigger = 
			      (CronTrigger)TriggerBuilder.newTrigger().withIdentity("trigger"+(index++), "group1")
			      .withSchedule(CronScheduleBuilder.cronSchedule(ruler)).build();

		  sche.scheduleJob(job, trigger);
		  
		  sche.start();
	} catch (SchedulerException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}
	 
}

}

4、小结

目前只做了小说类配置,

如果有同爱看小时的同学,可以留言小说名或网址,添加到任务中,下次可通过扫二维码直接阅读最新章节


标题:java爬虫实现抓取小说最新章节
作者:hugh0524
地址:https://blog.uproject.cn/articles/2016/07/24/1469336068809.html

, , , 0 0