elasticsearch 问题

java爬虫爬取Elastic中文社区用作es测试数据

Elasticsearch • kl 发表了文章 • 1 个评论 • 8634 次浏览 • 2016-03-29 23:10 • 来自相关话题

前言为了测试es的完美功能，笔者使用爬虫爬取了Elastic中文社区和CSDN的大量数据，作为测试之用，下面简单介绍一下折腾的过程 认识 WebCollector WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。 WebCollector致力于维护一个稳定、可扩的爬虫内核，便于开发者进行灵活的二次开发。内核具有很强的扩展性，用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup，可进行精准的网页解析。2.x版本中集成了selenium，可以处理javascript生成的数据。官网地址：http://crawlscript.github.io/WebCollector/ 使用步骤 导入jar依赖，笔者是maven项目，所有加入如下pom.xml依赖 ps:笔者这里是使用的最新版的，maven仓库目前最新版的是2.09，所以使用最新的就自己下载打包吧环境有了后，直接新建一个类继承BreadthCrawler类重新visit方法，你的处理逻辑都在visit方法里面，下面楼主贴下我的代码 爬取Elastic中文社区资源

/**
 * Created by 小陈 on 2016/3/29.
 */
@Component
public class ElasticCrawler extends BreadthCrawler {
    @Autowired
     IpaDao ipaDao;
    public ElasticCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("xxx");
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("xxx");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
             content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
          /*抽取标题*/
        String title=page.getDoc().title();
        System.out.println("-------------------->"+title);
        if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
               ipaDao.save(pa);//持久化到数据库
            }
    }

爬取CSDN资源

/**
 * @author kl by 2016/3/29
 * @boke www.kailing.pub
 */
@Component
public class CSDNCrawler extends BreadthCrawler {
    @Autowired
    IpaDao ipaDao;
    public CSDNCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("http://blog.csdn.net/.*");//添加种子地址
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("http://blog.csdn.net/.*/article/details/.*");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
            content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {
            String title = page.select("div[class=article_title]").first().text();
            String author = page.select("div[id=blog_userface]").first().text();//获取作者名
            System.out.println("title:" + title + "\tauthor:" + author);
            if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
                ipaDao.save(pa);
            }
        }
    }

ps:Elastic中文社区的爬取规则和谐了，楼主是爱社区的，大家可以放心的爬CSDN吧，WebCollector功能很强大，爬虫的一个关键就是需要知道网站的url规则，有兴趣的可以研究下，Elastic的数据不多，分吧钟就够了，CSDN爬了5，6分钟，没有做深度的爬，取了大概二三十万的数据样子，只取标题和正文 去我博客查看原文 http://www.kailing.pub/article/index/arcid/86.html 下面是导入数据的截图

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

Elasticsearch • Max 发表了文章 • 0 个评论 • 9278 次浏览 • 2016-03-18 16:51 • 来自相关话题

今天在es中对http日志的状态码status进行aggs搜索出现字段内容显示不正常的问题，记录过程： http日志的情况： 1、http日志从logstash写入es时，状态码配置为status，其内容为 200 ，302 ，400 ，404等。 2、使用kibana对该日志的索引进行查询，在discover页面中显示的status内容跟logstash的内容一致，是正常的。出现问题的场景：（我这里使用的是kibana的sense插件进行的查询，如果直接使用curl python-ES也是一样的）查询该索引： POST http-2016.03.18/_search { "fields": ["status"], "query":{ "bool":{ "must": [ { "range" : { "@timestamp" : {"gte" : "now-5m"} } } ] } }, "_source": "false", "size": 0, "aggs": { "status_type": { "terms":{"field":"status"} } } } 查询返回的结果中aggregations部分的内容： "aggregations" : { "status_type" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : -56, "doc_count" : 376341 }, { "key" : 46, "doc_count" : 51439 }, { "key" : 45, "doc_count" : 5543 }, { "key" : 48, "doc_count" : 1669 }, { "key" : -108, "doc_count" : 1068 }, { "key" : -50, "doc_count" : 11 }, { "key" : -109, "doc_count" : 8 }, { "key" : -112, "doc_count" : 4 } 寻找原因：起先先去掉了查询的aggs部分，单独查询query的内容： POST http-2016.03.18/_search { "fields": ["status"], "query":{ "bool":{ "must": [ { "range" : { "@timestamp" : {"gte" : "now-5m"} } } ] } } } 返回的结果中，hits显示的status字段内容是正常的： "hits": { "total": 1242104, "max_score": 1, "hits": [ { "_index": "http-2016.03.18", "_type": "log", "_id": "AVOI3EiwidwPAhB1e7gQ", "_score": 1, "fields": { "status": [ "200" ] } } ...... 然后查询了http索引的索引信息和模版配置： GET /http-2016.03.18/ GET /_template/http 发现其中http的status的属性type类型的内容是byte ： "properties": { "@timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, ...... ...... "status": { "type": "byte" }, ...... ...... 原因：在aggs查询中发现了status字段显示错误的情况，status的type类型在es模版中定义成了byte类型，当status的值超过127后将出现溢出的情况，因此修改为short后，恢复了正常。（对于http的状态码status，其type类型使用short已经足够了，如果使用integer，long或默认的string类型也是可以的，这里影响的是存储空间占用的大小。）

java爬虫爬取Elastic中文社区用作es测试数据

Elasticsearch • kl 发表了文章 • 1 个评论 • 8634 次浏览 • 2016-03-29 23:10 • 来自相关话题

/**
 * Created by 小陈 on 2016/3/29.
 */
@Component
public class ElasticCrawler extends BreadthCrawler {
    @Autowired
     IpaDao ipaDao;
    public ElasticCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("xxx");
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("xxx");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
             content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
          /*抽取标题*/
        String title=page.getDoc().title();
        System.out.println("-------------------->"+title);
        if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
               ipaDao.save(pa);//持久化到数据库
            }
    }

爬取CSDN资源

/**
 * @author kl by 2016/3/29
 * @boke www.kailing.pub
 */
@Component
public class CSDNCrawler extends BreadthCrawler {
    @Autowired
    IpaDao ipaDao;
    public CSDNCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("http://blog.csdn.net/.*");//添加种子地址
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("http://blog.csdn.net/.*/article/details/.*");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
            content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {
            String title = page.select("div[class=article_title]").first().text();
            String author = page.select("div[id=blog_userface]").first().text();//获取作者名
            System.out.println("title:" + title + "\tauthor:" + author);
            if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
                ipaDao.save(pa);
            }
        }
    }

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

Elasticsearch • Max 发表了文章 • 0 个评论 • 9278 次浏览 • 2016-03-18 16:51 • 来自相关话题

更多...

java爬虫爬取Elastic中文社区用作es测试数据

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

java爬虫爬取Elastic中文社区用作es测试数据

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

话题描述

活动推荐

相关话题

9 人关注该话题