java爬虫实现抓取小说最新章节
1、实例成功
前言:实在吐槽小说网站,就想追几本小说,结果放了很多垃圾广告,费流量不说,地铁上还得偷偷的看 ~^~
扫码访问:
本例采用jsoup+quartz实现
2、代码示例
1)根据大多小说网站的html布局,抽出公用的模型,设计字段
2)根据配置处理数据
public String runSearch(boolean isTest,String book,String href,String menuSelector,int menuC,String contentSt,String exclude,String refer){ String wz = href; //int wmsjCode = 14; try { Document doc = Jsoup.connect(wz).timeout(20000).get();//设置超时,放置页面资源太多超时Elements contentsMenu = doc.select(menuSelector);//选取目录 StringBuffer sb = new StringBuffer(); if(contentsMenu != null){ menuC = menuC>=contentsMenu.size()?contentsMenu.size():menuC; for(int i=menuC;i>0;i--){ Elements a = contentsMenu.eq(contentsMenu.size()-i); //System.out.println(c.get(0).html()); //Elements a = c.get(0).select("a"); if(a != null && a.size()>0){ String ah = a.attr("href"); if(ah.startsWith("http://") || ah.startsWith("https://")){ }else if(ah.startsWith("/")){ ah = refer+ah; }else{ if(wz.endsWith("/")){ ah = wz+a.attr("href"); }else{ ah = wz+"/"+a.attr("href"); } } sb.append(parseCont(isTest,book,a.html(),ah,contentSt,exclude,refer));//处理内容页 } } } return sb.toString(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); return ""; } } private String parseCont(Boolean isTest,String novelName,String title,String href,String contentSt,String exclude,String refer) throws IOException{ System.out.println(novelName+";"+title+";"+href); //获取内容 Document doc = Jsoup.connect(href).timeout(20000).get(); Elements e = doc.select(contentSt); //e.select("script").remove(); if(exclude != null && !exclude.equals("")){ String[] arr = exclude.split("\\|"); for(String a : arr){ if(!a.trim().equals("")){ e.select(a.trim()).remove(); } } } //System.out.println(e.html()); if(isTest){ return "book:"+novelName+"<br/>title:"+title+"<br/>content:"+e.html()+"<br/>"; }else{ //根据需要入库<br /> // save or update return ""; } }</pre>
3、配置定时任务(quartz时间配置)
public class TaskManager { private static SchedulerFactory gSchedulerFactory = new StdSchedulerFactory(); private static int index =1; public static void run(String jobname,Class jobclass,String ruler){try { Scheduler sche = gSchedulerFactory.getScheduler(); JobDetail job = JobBuilder.newJob(jobclass) .withIdentity(jobname, "group1") .build(); CronTrigger trigger = (CronTrigger)TriggerBuilder.newTrigger().withIdentity("trigger"+(index++), "group1") .withSchedule(CronScheduleBuilder.cronSchedule(ruler)).build(); sche.scheduleJob(job, trigger); sche.start(); } catch (SchedulerException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
}
4、小结
目前只做了小说类配置,
如果有同爱看小时的同学,可以留言小说名或网址,添加到任务中,下次可通过扫二维码直接阅读最新章节
标题:java爬虫实现抓取小说最新章节
作者:hugh0524
地址:https://blog.uproject.cn/articles/2016/07/24/1469336068809.html
0 0