elasticsearch 搜索

用zabbix监控es

Elasticsearch • Max 发表了文章 • 0 个评论 • 10196 次浏览 • 2017-05-06 14:18 • 来自相关话题

https://vastxiao.github.io/zabbixMonitorES/

用es的官方工具curator4来配置和管理和优化es索引

Elasticsearch • Max 发表了文章 • 0 个评论 • 6056 次浏览 • 2017-03-12 17:18 • 来自相关话题

刚发在了github,给个链接就好了: https://vastxiao.github.io/art ... ning/

ElasticSearch插件集

Elasticsearch • kl 发表了文章 • 0 个评论 • 14502 次浏览 • 2016-03-30 18:07 • 来自相关话题

ElasticSearch的很多功能都是官方或第三方基于ElasticSearch的AbstractPlugin类实现的插件来提供的，所以，在里里记录下一些常用的及实用的插件地址，以备不时之需分词插件 Combo Analysis Plugin (作者 Olivier Favre, Yakaz) 简介：组合分词器，可以把多个分词器的结果组合在一起。 Smart Chinese Analysis Plugin (作者 elasticsearch 团队) 简介：lucene默认的中文分词器 ICU Analysis plugin (作者 elasticsearch 团队) 简介：lucene自带的ICU分词，ICU是一套稳定、成熟、功能强大、轻便易用和跨平台支持Unicode 的开发包。 Stempel (Polish) Analysis plugin (作者 elasticsearch 团队) 简介：法文分词器 IK Analysis Plugin (作者 Medcl) 简介：大名鼎鼎的ik分词，都懂的！ Mmseg Analysis Plugin (作者 Medcl) 简介：mmseg中文分词 Hunspell Analysis Plugin (作者 Jörg Prante) 简介：lucene自带的Hunspell模块 Japanese (Kuromoji) Analysis plugin (作者 elasticsearch 团队). 简介：日文分词器 Japanese Analysis plugin (作者 suguru). 简介：日文分词器 Russian and English Morphological Analysis Plugin (作者 Igor Motov) 简介：俄文英文分词器 Pinyin Analysis Plugin (作者 Medcl) 简介：拼音分词器 String2Integer Analysis Plugin (作者 Medcl) 简介：字符串转整型工具。主要用在facet这个功能上，如果facet的field的值是字符串的话，计算起来比较耗资源。可以把字符串映射成整型，对整型进行facet操作要比对字符串的快很多。同步插件 CouchDB River Plugin (作者 elasticsearch 团队) 简介：CouchDB和elasticsearch的同步插件 Wikipedia River Plugin (作者 elasticsearch 团队) 简介：wikipedia文件读取插件。wikipedia是维基百科的一个离线库，不定期发布最新数据，是以xml形式发布的。这个river读取这个文件来建索引。 Twitter River Plugin (作者 elasticsearch 团队) 简介：twitter的同步插件，可以同步你twitter上的微博。 RabbitMQ River Plugin (作者 elasticsearch 团队) 简介：rabbitmq同步插件，读取rabbitmq上的队列信息并索引。 RSS River Plugin (作者 David Pilato) 简介：定期索引指定一个或多个RSS源的数据。 MongoDB River Plugin (作者 Richard Louapre) 简介：mongodb同步插件，mongodb必须搭成副本集的模式，因为这个插件的原理是通过定期读取mongodb中的oplog来同步数据。 Open Archives Initiative (OAI) River Plugin (作者 Jörg Prante) 简介：可以索引oai数据提供者提供的数据。 Sofa River Plugin (作者 adamlofts) 简介：这个插件可以把多个CouchDB的数据库同步到同一个es索引中。 JDBC River Plugin (作者 Jörg Prante) 简介：关系型数据库的同步插件 FileSystem River Plugin (作者 David Pilato) 简介：本地文件系统文件同步插件，使用方法是指定一个本地目录路径，es会定期扫描索引该目录下的文件。 LDAP River Plugin (作者 Tanguy Leroux) 简介：索引LDAP目录下的文件数据。 Dropbox River Plugin (作者 David Pilato) 简介：索引dropbox网盘上的文件。通过oauth协议来调用dropbox上的api建索引。 ActiveMQ River Plugin (作者 Dominik Dorn) 简介：activemq队列的同步插件，和之前rabbitmq的类似 Solr River Plugin (作者 Luca Cavanna) 简介：solr同步插件，可以把solr里面的索引同步到es CSV River Plugin (作者 Martin Bednar) 简介：通过指定目录地址来索引csv文件。数据传输插件 Servlet transport (作者 elasticsearch 团队) 简介：Servlet rest插件，通过servlet来封装rest接口。 Memcached transport plugin (作者 elasticsearch 团队) 简介：本插件可以通过memcached协议进行rest接口的调用。注意：这里不是使用memcache作为es的缓存。 Thrift Transport (作者 elasticsearch 团队) 简介：使用thrift进行数据传输。 ZeroMQ transport layer plugin (作者 Tanguy Leroux) 简介：使用zeromq进rest接口的调用。 Jetty HTTP transport plugin (作者 Sonian Inc.) 简介：使用jetty来提供http rest接口。默认是使用netty。这个插件的好处是可以对http接口进行一些权限的设置。脚本插件 Python language Plugin (作者 elasticsearch 团队) 简介：python脚本支持 JavaScript language Plugin (作者 elasticsearch 团队) 简介：javascript脚本支持 Groovy lang Plugin (作者 elasticsearch 团队) 简介：groovy脚本支持 Clojure Language Plugin (作者 Kevin Downey) 简介：clojure脚本支持站点插件（以网页形式展现） BigDesk Plugin (作者 Lukáš Vlček) 简介：监控es状态的插件，推荐！ Elasticsearch Head Plugin (作者 Ben Birch) 简介：很方便对es进行各种操作的客户端。 Paramedic Plugin (作者 Karel Minařík) 简介：es监控插件 SegmentSpy Plugin (作者 Zachary Tong) 简介：查看es索引segment状态的插件 Inquisitor Plugin (作者 Zachary Tong) 简介：这个插件主要用来调试你的查询。其它插件 Mapper Attachments Type plugin (作者 elasticsearch 团队) 简介：附件类型插件，通过tika库把各种类型的文件格式解析成字符串。 Hadoop Plugin (作者 elasticsearch team) 简介：hadoop和elasticsearch的集成插件，可以通过hadoop的mapreduce算法来并行建立索引，同时支持cascading，hive和pig等框架。 AWS Cloud Plugin (作者 elasticsearch 团队) 简介：elasticsearch与amazon web services的集成。 ElasticSearch Mock Solr Plugin (作者 Matt Weber) 简介：elasticsearch的solr api接口。用了这个插件可以使用solr的api来调用es，直接用solrj就可以调用es。比较适用于从solr转es时暂时过度。 Suggester Plugin (作者 Alexander Reelsen) 简介：es 搜索提示功能插件，不过es0.9版本后自带了这个功能， ElasticSearch PartialUpdate Plugin (作者 Medcl) 简介：elasticsearch的部分更新插件。 ZooKeeper Discovery Plugin (作者 Sonian Inc.) 简介：通过zookeeper管理集群的插件。通过这个插件，es的分布式架构和solrcloud相似。 ElasticSearch Changes Plugin (作者 Thomas Peuss) 简介：elasticsearch索引操作记录插件。通过这个插件可以查看用户对索引的增删改操作。 ElasticSearch View Plugin (作者 Tanguy Leroux) 简介：这个插件可以把es的文档以html，xml或text的方式显示出来，它也可以通过查询生成web页面。 ElasticSearch New Relic Plugin (作者 Vinicius Carvalho) 简介：elasticsearch和newrelic的集成插件。newrelica是一个性能监控工具。这个插件会把节点的状态数据传到newrelic的账号上。社区的编辑器好像不支持复制富文本信息，所以插件都没有链接，插件太多懒得一个个打链接了，想点地址的可以移步寒舍http://www.kailing.pub/article/index/arcid/87.html

java爬虫爬取Elastic中文社区用作es测试数据

Elasticsearch • kl 发表了文章 • 1 个评论 • 9271 次浏览 • 2016-03-29 23:10 • 来自相关话题

前言为了测试es的完美功能，笔者使用爬虫爬取了Elastic中文社区和CSDN的大量数据，作为测试之用，下面简单介绍一下折腾的过程 认识 WebCollector WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架（内核），它提供精简的的API，只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本，支持分布式爬取。 WebCollector致力于维护一个稳定、可扩的爬虫内核，便于开发者进行灵活的二次开发。内核具有很强的扩展性，用户可以在内核基础上开发自己想要的爬虫。源码中集成了Jsoup，可进行精准的网页解析。2.x版本中集成了selenium，可以处理javascript生成的数据。官网地址：http://crawlscript.github.io/WebCollector/ 使用步骤 导入jar依赖，笔者是maven项目，所有加入如下pom.xml依赖 ps:笔者这里是使用的最新版的，maven仓库目前最新版的是2.09，所以使用最新的就自己下载打包吧环境有了后，直接新建一个类继承BreadthCrawler类重新visit方法，你的处理逻辑都在visit方法里面，下面楼主贴下我的代码 爬取Elastic中文社区资源

/**
 * Created by 小陈 on 2016/3/29.
 */
@Component
public class ElasticCrawler extends BreadthCrawler {
    @Autowired
     IpaDao ipaDao;
    public ElasticCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("xxx");
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("xxx");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
             content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
          /*抽取标题*/
        String title=page.getDoc().title();
        System.out.println("-------------------->"+title);
        if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
               ipaDao.save(pa);//持久化到数据库
            }
    }

爬取CSDN资源

/**
 * @author kl by 2016/3/29
 * @boke www.kailing.pub
 */
@Component
public class CSDNCrawler extends BreadthCrawler {
    @Autowired
    IpaDao ipaDao;
    public CSDNCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("http://blog.csdn.net/.*");//添加种子地址
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("http://blog.csdn.net/.*/article/details/.*");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
            content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {
            String title = page.select("div[class=article_title]").first().text();
            String author = page.select("div[id=blog_userface]").first().text();//获取作者名
            System.out.println("title:" + title + "\tauthor:" + author);
            if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
                ipaDao.save(pa);
            }
        }
    }

ps:Elastic中文社区的爬取规则和谐了，楼主是爱社区的，大家可以放心的爬CSDN吧，WebCollector功能很强大，爬虫的一个关键就是需要知道网站的url规则，有兴趣的可以研究下，Elastic的数据不多，分吧钟就够了，CSDN爬了5，6分钟，没有做深度的爬，取了大概二三十万的数据样子，只取标题和正文 去我博客查看原文 http://www.kailing.pub/article/index/arcid/86.html 下面是导入数据的截图

java使用HTTP Rest client 客户端Jest连接操作es，功能很强大

Elasticsearch • kl 发表了文章 • 6 个评论 • 28185 次浏览 • 2016-03-28 23:30 • 来自相关话题

前言在了解jest框架前，楼主一直尝试用官方的Elasticsearch java api连接es服务的，可是，不知何故，一直报如下的异常信息，谷歌了很久，都说是jvm版本不一致导致的问题，可我是本地测试的，jvm肯定是一致的，这个问题现在都木有解决，but，这怎么能阻止我探索es的脚步呢，so，让我发现了jest 这个框架 org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream 我的测试代码是参考官方api实例的，官方api地址：Elasticsearch java api,代码如下： Client client = new TransportClient().addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300)); QueryBuilder queryBuilder = QueryBuilders.termQuery("content", "搜"); SearchResponse searchResponse = client.prepareSearch("indexdata").setTypes("fulltext") .setQuery(queryBuilder) .execute() .actionGet(); SearchHits hits = searchResponse.getHits(); System.out.println("查询到记录数:" + hits.getTotalHits()); SearchHit[] searchHists = hits.getHits(); for(SearchHit sh : searchHists){ System.out.println("content:"+sh.getSource().get("content")); } client.close(); 如果有人知道怎么回事，告诉一下楼主吧，让楼主坑的明白，感激不尽了，我的es版本是2.2.0 进入正题了解jest jest是一个基于 HTTP Rest 的连接es服务的api工具集，功能强大，能够使用es java api的查询语句，项目是开源的，github地址：https://github.com/searchbox-io/Jest 我的测试用例分词器：ik，分词器地址：https://github.com/medcl/elasticsearch-analysis-ik ，es的很多功能都是基于插件提供的，es版本升级都2.2.0后，安装插件的方式不一样了，如果你安装ik分词插件有问题，请点击右上角的qq联系博主新建索引 curl -XPUT http://localhost:9200/indexdata 创建索引的mapping，指定分词器 curl -XPOST http://localhost:9200/indexdata/fulltext/_mapping { "fulltext": { "_all": { "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "term_vector": "no", "store": "false" }, "properties": { "content": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true", "boost": 8 }, "description": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true", "boost": 8 }, "title": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true", "boost": 8 }, "keyword": { "type": "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true", "boost": 8 } } } } mapping信息可以用head插件查看，如下导入数据和查询，看代码吧 @RunWith(SpringJUnit4ClassRunner.class) @SpringApplicationConfiguration(classes = ElasticSearchTestApplication.class) public class JestTestApplicationTests { @Autowired private KlarticleDao klarticleDao; //得到JestClient实例 public JestClient getClient()throws Exception{ JestClientFactory factory = new JestClientFactory(); factory.setHttpClientConfig(new HttpClientConfig .Builder("http://127.0.0.1:9200") .multiThreaded(true) .build()); return factory.getObject(); } /** * 导入数据库数据到es * @throws Exception */ @Test public void contextLoads() throws Exception{ JestClient client=getClient(); Listlists=klarticleDao.findAll(); for(Klarticle k:lists){ Index index = new Index.Builder(k).index("indexdata").type("fulltext").id(k.getArcid()+"").build(); System.out.println("添加索引----》"+k.getTitle()); client.execute(index); } //批量新增的方式,效率更高 Bulk.Builder bulkBuilder = new Bulk.Builder(); for(Klarticle k:lists){ Index index = new Index.Builder(k).index("indexdata").type("fulltext").id(k.getArcid()+"").build(); bulkBuilder.addAction(index); } client.execute(bulkBuilder.build()); client.shutdownClient(); } //搜索测试 @Test public void JestSearchTest()throws Exception{ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.matchQuery("content", "搜索")); Search search = new Search.Builder(searchSourceBuilder.toString()) // multiple index or types can be added. .addIndex("indexdata") .build(); JestClient client =getClient(); SearchResult result= client.execute(search); // List> hits = result.getHits(Klarticle.class); Listarticles = result.getSourceAsObjectList(Klarticle.class); for(Klarticle k:articles){ System.out.println("------->："+k.getTitle()); } } }下面是依赖的jar，maven项目 <dependency> <groupId>io.searchbox</groupId> <artifactId>jest</artifactId> <version>2.0.0</version> </dependency>  <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>1.6.1</version> </dependency> <dependency> <groupId>org.elasticsearch</groupId> <artifactId>elasticsearch</artifactId> <version>2.2.0</version> </dependency> </dependencies> 去我的博客查看原文：http://www.kailing.pub/article/index/arcid/84.html

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

Elasticsearch • Max 发表了文章 • 0 个评论 • 9902 次浏览 • 2016-03-18 16:51 • 来自相关话题

今天在es中对http日志的状态码status进行aggs搜索出现字段内容显示不正常的问题，记录过程： http日志的情况： 1、http日志从logstash写入es时，状态码配置为status，其内容为 200 ，302 ，400 ，404等。 2、使用kibana对该日志的索引进行查询，在discover页面中显示的status内容跟logstash的内容一致，是正常的。出现问题的场景：（我这里使用的是kibana的sense插件进行的查询，如果直接使用curl python-ES也是一样的）查询该索引： POST http-2016.03.18/_search { "fields": ["status"], "query":{ "bool":{ "must": [ { "range" : { "@timestamp" : {"gte" : "now-5m"} } } ] } }, "_source": "false", "size": 0, "aggs": { "status_type": { "terms":{"field":"status"} } } } 查询返回的结果中aggregations部分的内容： "aggregations" : { "status_type" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : -56, "doc_count" : 376341 }, { "key" : 46, "doc_count" : 51439 }, { "key" : 45, "doc_count" : 5543 }, { "key" : 48, "doc_count" : 1669 }, { "key" : -108, "doc_count" : 1068 }, { "key" : -50, "doc_count" : 11 }, { "key" : -109, "doc_count" : 8 }, { "key" : -112, "doc_count" : 4 } 寻找原因：起先先去掉了查询的aggs部分，单独查询query的内容： POST http-2016.03.18/_search { "fields": ["status"], "query":{ "bool":{ "must": [ { "range" : { "@timestamp" : {"gte" : "now-5m"} } } ] } } } 返回的结果中，hits显示的status字段内容是正常的： "hits": { "total": 1242104, "max_score": 1, "hits": [ { "_index": "http-2016.03.18", "_type": "log", "_id": "AVOI3EiwidwPAhB1e7gQ", "_score": 1, "fields": { "status": [ "200" ] } } ...... 然后查询了http索引的索引信息和模版配置： GET /http-2016.03.18/ GET /_template/http 发现其中http的status的属性type类型的内容是byte ： "properties": { "@timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, ...... ...... "status": { "type": "byte" }, ...... ...... 原因：在aggs查询中发现了status字段显示错误的情况，status的type类型在es模版中定义成了byte类型，当status的值超过127后将出现溢出的情况，因此修改为short后，恢复了正常。（对于http的状态码status，其type类型使用short已经足够了，如果使用integer，long或默认的string类型也是可以的，这里影响的是存储空间占用的大小。）

/**
 * Created by 小陈 on 2016/3/29.
 */
@Component
public class ElasticCrawler extends BreadthCrawler {
    @Autowired
     IpaDao ipaDao;
    public ElasticCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("xxx");
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("xxx");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
             content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
          /*抽取标题*/
        String title=page.getDoc().title();
        System.out.println("-------------------->"+title);
        if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
               ipaDao.save(pa);//持久化到数据库
            }
    }

爬取CSDN资源

/**
 * @author kl by 2016/3/29
 * @boke www.kailing.pub
 */
@Component
public class CSDNCrawler extends BreadthCrawler {
    @Autowired
    IpaDao ipaDao;
    public CSDNCrawler() {
        super("crawl", true);
        /*start page*/
        this.addSeed("http://blog.csdn.net/.*");//添加种子地址
        /*fetch url like http://news.hfut.edu.cn/show-xxxxxxhtml*/
        this.addRegex("http://blog.csdn.net/.*/article/details/.*");
        /*do not fetch jpg|png|gif*/
        this.addRegex("-.*\\.(jpg|png|gif).*");
        /*do not fetch url contains #*/
//        this.addRegex("-.*#.*");
    }
    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.getUrl();
        String content="";
        try {
            content = ContentExtractor.getContentByUrl(url);
        }catch (Exception e){
            e.printStackTrace();
        }
        if (page.matchUrl("http://blog.csdn.net/.*/article/details/.*")) {
            String title = page.select("div[class=article_title]").first().text();
            String author = page.select("div[id=blog_userface]").first().text();//获取作者名
            System.out.println("title:" + title + "\tauthor:" + author);
            if(!title.isEmpty() && ! content.isEmpty()){
                Pa pa=new Pa(title,content);
                ipaDao.save(pa);
            }
        }
    }

java使用HTTP Rest client 客户端Jest连接操作es，功能很强大

Elasticsearch • kl 发表了文章 • 6 个评论 • 28185 次浏览 • 2016-03-28 23:30 • 来自相关话题

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

Elasticsearch • Max 发表了文章 • 0 个评论 • 9902 次浏览 • 2016-03-18 16:51 • 来自相关话题

更多...

用zabbix监控es

用es的官方工具curator4来配置和管理和优化es索引

ElasticSearch插件集

java爬虫爬取Elastic中文社区用作es测试数据

java使用HTTP Rest client 客户端Jest连接操作es，功能很强大

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

Shards数量对 elasticsearch搜素性能的影响

elasticsearch 支持按位与搜索吗

Shards数量对 elasticsearch搜素性能的影响

elasticsearch 支持按位与搜索吗

用zabbix监控es

用es的官方工具curator4来配置和管理和优化es索引

ElasticSearch插件集

java爬虫爬取Elastic中文社区用作es测试数据

java使用HTTP Rest client 客户端Jest连接操作es，功能很强大

es索引模版配置不当导致的aggs聚合查询字段显示错误的问题

话题描述

活动推荐

相关话题

最佳回复者

31 人关注该话题