logstash5.X 时差8小时问题
ruby {
code => "event.set('timestamp', event.get('@timestamp').time.localtime + 8*60*60)"
}
ruby {
code => "event.set('@timestamp',event.get('timestamp'))"
}
mutate {
remove_field => ["timestamp"]
}
ruby {
code => "event.set('timestamp', event.get('@timestamp').time.localtime + 8*60*60)"
}
ruby {
code => "event.set('@timestamp',event.get('timestamp'))"
}
mutate {
remove_field => ["timestamp"]
} 收起阅读 »
Elastic日报 第289期 (2018-06-01)
https://elasticsearch.cn/article/648
2、ElasticSearch + Canal 开发千万级的实时搜索系统
http://t.cn/R8vjBwD
3、【线下活动】2018-06-30 南京Elastic Meetup日程安排
https://elasticsearch.cn/article/647
编辑:铭毅天下
归档:https://elasticsearch.cn/article/649
订阅:https://tinyletter.com/elastic-daily
https://elasticsearch.cn/article/648
2、ElasticSearch + Canal 开发千万级的实时搜索系统
http://t.cn/R8vjBwD
3、【线下活动】2018-06-30 南京Elastic Meetup日程安排
https://elasticsearch.cn/article/647
编辑:铭毅天下
归档:https://elasticsearch.cn/article/649
订阅:https://tinyletter.com/elastic-daily
收起阅读 »
Elasticsearch snapshot 备份的使用方法
常见的数据库都会提供备份的机制,以解决在数据库无法使用的情况下,可以开启新的实例,然后通过备份来恢复数据减少损失。虽然 Elasticsearch 有良好的容灾性,但由于以下原因,其依然需要备份机制。
- 数据灾备。在整个集群无法正常工作时,可以及时从备份中恢复数据。
- 归档数据。随着数据的积累,比如日志类的数据,集群的存储压力会越来越大,不管是内存还是磁盘都要承担数据增多带来的压力,此时我们往往会选择只保留最近一段时间的数据,比如1个月,而将1个月之前的数据删除。如果你不想删除这些数据,以备后续有查看的需求,那么你就可以将这些数据以备份的形式归档。
- 迁移数据。当你需要将数据从一个集群迁移到另一个集群时,也可以用备份的方式来实现。
Elasticsearch 做备份有两种方式,一是将数据导出成文本文件,比如通过 elasticdump、esm 等工具将存储在 Elasticsearch 中的数据导出到文件中。二是以备份 elasticsearch data 目录中文件的形式来做快照,也就是 Elasticsearch 中 snapshot 接口实现的功能。第一种方式相对简单,在数据量小的时候比较实用,当应对大数据量场景效率就大打折扣。我们今天就着重讲解下第二种备份的方式,即 snapshot api 的使用。
备份要解决备份到哪里、如何备份、何时备份和如何恢复的问题,那么我们接下来一个个解决。
1. 备份到哪里
在 Elasticsearch 中通过 repository 定义备份存储类型和位置,存储类型有共享文件系统、AWS 的 S3存储、HDFS、微软 Azure的存储、Google Cloud 的存储等,当然你也可以自己写代码实现国内阿里云的存储。我们这里以最简单的共享文件系统为例,你也可以在本地做实验。
首先,你要在 elasticsearch.yml 的配置文件中注明可以用作备份路径 path.repo ,如下所示:
path.repo: ["/mount/backups", "/mount/longterm_backups"]
配置好后,就可以使用 snapshot api 来创建一个 repository 了,如下我们创建一个名为 my_backup 的 repository。
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups/my_backup"
}
}
之后我们就可以在这个 repository 中来备份数据了。
2. 如何备份
有了 repostiroy 后,我们就可以做备份了,也叫快照,也就是记录当下数据的状态。如下所示我们创建一个名为 snapshot_1 的快照。
PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true
wait_for_completion 为 true 是指该 api 在备份执行完毕后再返回结果,否则默认是异步执行的,我们这里为了立刻看到效果,所以设置了该参数,线上执行时不用设置该参数,让其在后台异步执行即可。
执行成功后会返回如下结果,用于说明备份的情况:
{
"snapshots": [
{
"snapshot": "snapshot_1",
"uuid": "52Lr4aFuQYGjMEv5ZFeFEg",
"version_id": 6030099,
"version": "6.3.0",
"indices": [
".monitoring-kibana-6-2018.05.30",
".monitoring-es-6-2018.05.28",
".watcher-history-7-2018.05.30",
".monitoring-beats-6-2018.05.29",
"metricbeat-6.2.4-2018.05.28",
".monitoring-alerts-6",
"metricbeat-6.2.4-2018.05.30"
],
"include_global_state": true,
"state": "SUCCESS",
"start_time": "2018-05-31T12:45:57.492Z",
"start_time_in_millis": 1527770757492,
"end_time": "2018-05-31T12:46:15.214Z",
"end_time_in_millis": 1527770775214,
"duration_in_millis": 17722,
"failures": [],
"shards": {
"total": 28,
"failed": 0,
"successful": 28
}
}
]
}
返回结果的参数意义都是比较直观的,比如 indices 指明此次备份涉及到的索引名称,由于我们没有指定需要备份的索引,这里备份了所有索引;state 指明状态;duration_in_millis 指明备份任务执行时长等。
我们可以通过 GET _snapshot/my_backup/snapshot_1
获取 snapshot_1 的执行状态。
此时如果去 /mount/backups/my_backup 查看,会发现里面多了很多文件,这些文件其实都是基于 elasticsearch data 目录中的文件生成的压缩存储的备份文件。大家可以通过 du -sh . 命令看一下该目录的大小,方便后续做对比。
3. 何时备份
通过上面的步骤我们成功创建了一个备份,但随着数据的新增,我们需要对新增的数据也做备份,那么我们如何做呢?方法很简单,只要再创建一个快照 snapshot_2 就可以了。
PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
当执行完毕后,你会发现 /mount/backups/my_backup 体积变大了。这说明新数据备份进来了。要说明的一点是,当你在同一个 repository 中做多次 snapshot 时,elasticsearch 会检查要备份的数据 segment 文件是否有变化,如果没有变化则不处理,否则只会把发生变化的 segment file 备份下来。这其实就实现了增量备份。
elasticsearch 的资深用户应该了解 force merge 功能,即可以强行将一个索引的 segment file 合并成指定数目,这里要注意的是如果你主动调用 force merge api,那么 snapshot 功能的增量备份功能就失效了,因为 api 调用完毕后,数据目录中的所有 segment file 都发生变化了。
另一个就是备份时机的问题,虽然 snapshot 不会占用太多的 cpu、磁盘和网络资源,但还是建议大家尽量在闲时做备份。
4. 如何恢复
所谓“养兵千日,用兵一时”,我们该演练下备份的成果,将其恢复出来。通过调用如下 api 即可快速实现恢复功能。
POST /_snapshot/my_backup/snapshot_1/_restore?wait_for_completion=true
{
"indices": "index_1",
"rename_replacement": "restored_index_1"
}
通过上面的 api,我们可以将 index_1 索引恢复到 restored_index_1 中。这个恢复过程完全是基于文件的,因此效率会比较高。
虽然我们这里演示的是在同一个集群做备份与恢复,你也可以在另一个集群上连接该 repository 做恢复。我们这里就不做说明了。
5. 其他
由于 Elasticsearch 版本更新比较快,因此大家在做备份与恢复的时候,要注意版本问题,同一个大版本之间的备份与恢复是没有问题的,比如都是 5.1 和 5.6 之间可以互相备份恢复。但你不能把一个高版本的备份在低版本恢复,比如将 6.x 的备份在 5.x 中恢复。而低版本备份在高版本恢复有一定要求:
1) 5.x 可以在 6.x 恢复
2) 2.x 可以在 5.x 恢复
3) 1.x 可以在 2.x 恢复
其他跨大版本的升级都是不可用的,比如1.x 的无法在 5.x 恢复。这里主要原因还是 Lucene 版本问题导致的,每一次 ES 的大版本升级都会伴随 Lucene 的大版本,而 Lucene 的版本是尽量保证向前兼容,即新版可以读旧版的文件,但版本跨越太多,无法实现兼容的情况也在所难免了。
6. 继续学习
本文只是简单对 snapshot 功能做了一个演示,希望这足够引起你的兴趣。如果你想进一步深入的了解该功能,比如备份的时候如何指定部分索引、如何查询备份和还原的进度、如何跨集群恢复数据、如何备份到 HDFS 等,可以详细阅读官方手册https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html,如果在使用的过程中遇到了问题,欢迎留言讨论。
常见的数据库都会提供备份的机制,以解决在数据库无法使用的情况下,可以开启新的实例,然后通过备份来恢复数据减少损失。虽然 Elasticsearch 有良好的容灾性,但由于以下原因,其依然需要备份机制。
- 数据灾备。在整个集群无法正常工作时,可以及时从备份中恢复数据。
- 归档数据。随着数据的积累,比如日志类的数据,集群的存储压力会越来越大,不管是内存还是磁盘都要承担数据增多带来的压力,此时我们往往会选择只保留最近一段时间的数据,比如1个月,而将1个月之前的数据删除。如果你不想删除这些数据,以备后续有查看的需求,那么你就可以将这些数据以备份的形式归档。
- 迁移数据。当你需要将数据从一个集群迁移到另一个集群时,也可以用备份的方式来实现。
Elasticsearch 做备份有两种方式,一是将数据导出成文本文件,比如通过 elasticdump、esm 等工具将存储在 Elasticsearch 中的数据导出到文件中。二是以备份 elasticsearch data 目录中文件的形式来做快照,也就是 Elasticsearch 中 snapshot 接口实现的功能。第一种方式相对简单,在数据量小的时候比较实用,当应对大数据量场景效率就大打折扣。我们今天就着重讲解下第二种备份的方式,即 snapshot api 的使用。
备份要解决备份到哪里、如何备份、何时备份和如何恢复的问题,那么我们接下来一个个解决。
1. 备份到哪里
在 Elasticsearch 中通过 repository 定义备份存储类型和位置,存储类型有共享文件系统、AWS 的 S3存储、HDFS、微软 Azure的存储、Google Cloud 的存储等,当然你也可以自己写代码实现国内阿里云的存储。我们这里以最简单的共享文件系统为例,你也可以在本地做实验。
首先,你要在 elasticsearch.yml 的配置文件中注明可以用作备份路径 path.repo ,如下所示:
path.repo: ["/mount/backups", "/mount/longterm_backups"]
配置好后,就可以使用 snapshot api 来创建一个 repository 了,如下我们创建一个名为 my_backup 的 repository。
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups/my_backup"
}
}
之后我们就可以在这个 repository 中来备份数据了。
2. 如何备份
有了 repostiroy 后,我们就可以做备份了,也叫快照,也就是记录当下数据的状态。如下所示我们创建一个名为 snapshot_1 的快照。
PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true
wait_for_completion 为 true 是指该 api 在备份执行完毕后再返回结果,否则默认是异步执行的,我们这里为了立刻看到效果,所以设置了该参数,线上执行时不用设置该参数,让其在后台异步执行即可。
执行成功后会返回如下结果,用于说明备份的情况:
{
"snapshots": [
{
"snapshot": "snapshot_1",
"uuid": "52Lr4aFuQYGjMEv5ZFeFEg",
"version_id": 6030099,
"version": "6.3.0",
"indices": [
".monitoring-kibana-6-2018.05.30",
".monitoring-es-6-2018.05.28",
".watcher-history-7-2018.05.30",
".monitoring-beats-6-2018.05.29",
"metricbeat-6.2.4-2018.05.28",
".monitoring-alerts-6",
"metricbeat-6.2.4-2018.05.30"
],
"include_global_state": true,
"state": "SUCCESS",
"start_time": "2018-05-31T12:45:57.492Z",
"start_time_in_millis": 1527770757492,
"end_time": "2018-05-31T12:46:15.214Z",
"end_time_in_millis": 1527770775214,
"duration_in_millis": 17722,
"failures": [],
"shards": {
"total": 28,
"failed": 0,
"successful": 28
}
}
]
}
返回结果的参数意义都是比较直观的,比如 indices 指明此次备份涉及到的索引名称,由于我们没有指定需要备份的索引,这里备份了所有索引;state 指明状态;duration_in_millis 指明备份任务执行时长等。
我们可以通过 GET _snapshot/my_backup/snapshot_1
获取 snapshot_1 的执行状态。
此时如果去 /mount/backups/my_backup 查看,会发现里面多了很多文件,这些文件其实都是基于 elasticsearch data 目录中的文件生成的压缩存储的备份文件。大家可以通过 du -sh . 命令看一下该目录的大小,方便后续做对比。
3. 何时备份
通过上面的步骤我们成功创建了一个备份,但随着数据的新增,我们需要对新增的数据也做备份,那么我们如何做呢?方法很简单,只要再创建一个快照 snapshot_2 就可以了。
PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
当执行完毕后,你会发现 /mount/backups/my_backup 体积变大了。这说明新数据备份进来了。要说明的一点是,当你在同一个 repository 中做多次 snapshot 时,elasticsearch 会检查要备份的数据 segment 文件是否有变化,如果没有变化则不处理,否则只会把发生变化的 segment file 备份下来。这其实就实现了增量备份。
elasticsearch 的资深用户应该了解 force merge 功能,即可以强行将一个索引的 segment file 合并成指定数目,这里要注意的是如果你主动调用 force merge api,那么 snapshot 功能的增量备份功能就失效了,因为 api 调用完毕后,数据目录中的所有 segment file 都发生变化了。
另一个就是备份时机的问题,虽然 snapshot 不会占用太多的 cpu、磁盘和网络资源,但还是建议大家尽量在闲时做备份。
4. 如何恢复
所谓“养兵千日,用兵一时”,我们该演练下备份的成果,将其恢复出来。通过调用如下 api 即可快速实现恢复功能。
POST /_snapshot/my_backup/snapshot_1/_restore?wait_for_completion=true
{
"indices": "index_1",
"rename_replacement": "restored_index_1"
}
通过上面的 api,我们可以将 index_1 索引恢复到 restored_index_1 中。这个恢复过程完全是基于文件的,因此效率会比较高。
虽然我们这里演示的是在同一个集群做备份与恢复,你也可以在另一个集群上连接该 repository 做恢复。我们这里就不做说明了。
5. 其他
由于 Elasticsearch 版本更新比较快,因此大家在做备份与恢复的时候,要注意版本问题,同一个大版本之间的备份与恢复是没有问题的,比如都是 5.1 和 5.6 之间可以互相备份恢复。但你不能把一个高版本的备份在低版本恢复,比如将 6.x 的备份在 5.x 中恢复。而低版本备份在高版本恢复有一定要求:
1) 5.x 可以在 6.x 恢复
2) 2.x 可以在 5.x 恢复
3) 1.x 可以在 2.x 恢复
其他跨大版本的升级都是不可用的,比如1.x 的无法在 5.x 恢复。这里主要原因还是 Lucene 版本问题导致的,每一次 ES 的大版本升级都会伴随 Lucene 的大版本,而 Lucene 的版本是尽量保证向前兼容,即新版可以读旧版的文件,但版本跨越太多,无法实现兼容的情况也在所难免了。
6. 继续学习
本文只是简单对 snapshot 功能做了一个演示,希望这足够引起你的兴趣。如果你想进一步深入的了解该功能,比如备份的时候如何指定部分索引、如何查询备份和还原的进度、如何跨集群恢复数据、如何备份到 HDFS 等,可以详细阅读官方手册https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html,如果在使用的过程中遇到了问题,欢迎留言讨论。
收起阅读 »【线下活动】2018-06-30 南京Elastic Meetup日程安排
活动地址: http://meetup.elasticsearch.cn/2018/nanjing.html
Elastic Meetup 南京
主办方
Elastic中文社区、趋势科技
协办方
IT 大咖说、阿里云、开源中国
时间地点
-
活动时间:2018年6月30日 13:00 - 18:00
- 活动地点:雨花区软件大道48号苏豪国际广场B座 趋势科技中国研发中心(靠花神庙地铁站)
报名地址
http://elasticsearch.mikecrm.com/fUqiv0T
名额有限,速速报名!
直播地址
主题
分享一:Elastic 探秘之遗落的珍珠
标签:elastic stack 讲师简介:
曾勇(Medcl) Elastic 中国首席布道师 Elasticsearch爱好者,2015年加入Elastic,Elastic 中文社区的发起人,Elastic在中国的首位员工。
主题简介: Elastic Stack 功能越来越丰富了,有很多功能可能你只听说过名字,有很多功能也许没有机会尝试过,其实你可能错过了很多宝贝,所以让我们来探究探究,本次分享主要介绍 Elastic Stack 技术栈里面,一些可能看起来不太起眼但却非常有意思的功能,定义为非干货,尽量轻拍,不过相信对于刚接触 Elastic 的同学来说,也会有所收获。
分享二:基于ELK的大数据分析平台实践
标签:运维、DevOps 讲师简介:
涂海波 南京云利来有限公司 曾在亚信联创电信事业部从事计费产品工作多年,2年前加入南京云利来。
主题简介: 主题围绕Elasticsearch在集群搭建和运维过程中的使用经验,分享工作期间主要遇到的问题和解决思路,希望能够帮助大家在elasticsearch使用过程中少走一些弯路
分享三:ElasticLog with ES in CloudEdge
标签:Ops、AWS、Log
讲师简介:
赵伟,趋势科技CloudEdge Team 负责大数据研发,个人技术兴趣广泛,擅长系统设计与服务调优,目前专注于大数据领域。 主题简介: 作为趋势科技下一代应用安全网关产品,CloudEdge的用户规模不断增长。面对每日数亿级数据,如何实现快速处理与查询?本次演讲,主要分享CloudEdge的大数据架构,介绍如何在AWS云上构建大数据系统,如何利用Elasticsearch实现热数据查询,以及在Elasticsearch方面的诸多实践经验
分享四:华泰证券Elasticsearch应用实践
标签:金融IT、大数据、DevOps、日志
讲师简介:
李文强,华泰证券数据平台架构师 负责Hadoop平台和Elasticsearch集群的管理、架构和部分项目管理,目前正积极研究基于k8s的人工智能平台落地方案。
主题简介: 经过几年的发展,Elasticsearch已经在华泰证券内部生根发芽,已经有不少业务都使用了Elasticsearch,其中一个非常重要的应用是日志搜索和分析系统,该系统统一收集和分析各个系统的日志,既提升运维效率,又提高运营质量。在这些实践中,我们也不断地对Elasticsearch进行调优,使其能够长期稳定运行,保障业务稳定。
分享五:es在苏宁的实践
标签:实践,大数据,平台化
讲师简介:
韩宝君,苏宁大数据平台 ES平台组负责人 2015年从事大数据研究工作,目前负责Elasticsearch的源码研究工作和定制化开发,对苏宁使用Elasticsearch的业务提供技术支持和解决方案。
主题简介: 本次分享大纲如下:
- 苏宁ES平台总体介绍,典型使用场景和规模;
- ES平台化之路-演进路线以及过程中我们的思考;
- 实战经验:遇到的问题及对应的解决方案;
活动地址: http://meetup.elasticsearch.cn/2018/nanjing.html
Elastic Meetup 南京
主办方
Elastic中文社区、趋势科技
协办方
IT 大咖说、阿里云、开源中国
时间地点
-
活动时间:2018年6月30日 13:00 - 18:00
- 活动地点:雨花区软件大道48号苏豪国际广场B座 趋势科技中国研发中心(靠花神庙地铁站)
报名地址
http://elasticsearch.mikecrm.com/fUqiv0T
名额有限,速速报名!
直播地址
主题
分享一:Elastic 探秘之遗落的珍珠
标签:elastic stack 讲师简介:
曾勇(Medcl) Elastic 中国首席布道师 Elasticsearch爱好者,2015年加入Elastic,Elastic 中文社区的发起人,Elastic在中国的首位员工。
主题简介: Elastic Stack 功能越来越丰富了,有很多功能可能你只听说过名字,有很多功能也许没有机会尝试过,其实你可能错过了很多宝贝,所以让我们来探究探究,本次分享主要介绍 Elastic Stack 技术栈里面,一些可能看起来不太起眼但却非常有意思的功能,定义为非干货,尽量轻拍,不过相信对于刚接触 Elastic 的同学来说,也会有所收获。
分享二:基于ELK的大数据分析平台实践
标签:运维、DevOps 讲师简介:
涂海波 南京云利来有限公司 曾在亚信联创电信事业部从事计费产品工作多年,2年前加入南京云利来。
主题简介: 主题围绕Elasticsearch在集群搭建和运维过程中的使用经验,分享工作期间主要遇到的问题和解决思路,希望能够帮助大家在elasticsearch使用过程中少走一些弯路
分享三:ElasticLog with ES in CloudEdge
标签:Ops、AWS、Log
讲师简介:
赵伟,趋势科技CloudEdge Team 负责大数据研发,个人技术兴趣广泛,擅长系统设计与服务调优,目前专注于大数据领域。 主题简介: 作为趋势科技下一代应用安全网关产品,CloudEdge的用户规模不断增长。面对每日数亿级数据,如何实现快速处理与查询?本次演讲,主要分享CloudEdge的大数据架构,介绍如何在AWS云上构建大数据系统,如何利用Elasticsearch实现热数据查询,以及在Elasticsearch方面的诸多实践经验
分享四:华泰证券Elasticsearch应用实践
标签:金融IT、大数据、DevOps、日志
讲师简介:
李文强,华泰证券数据平台架构师 负责Hadoop平台和Elasticsearch集群的管理、架构和部分项目管理,目前正积极研究基于k8s的人工智能平台落地方案。
主题简介: 经过几年的发展,Elasticsearch已经在华泰证券内部生根发芽,已经有不少业务都使用了Elasticsearch,其中一个非常重要的应用是日志搜索和分析系统,该系统统一收集和分析各个系统的日志,既提升运维效率,又提高运营质量。在这些实践中,我们也不断地对Elasticsearch进行调优,使其能够长期稳定运行,保障业务稳定。
分享五:es在苏宁的实践
标签:实践,大数据,平台化
讲师简介:
韩宝君,苏宁大数据平台 ES平台组负责人 2015年从事大数据研究工作,目前负责Elasticsearch的源码研究工作和定制化开发,对苏宁使用Elasticsearch的业务提供技术支持和解决方案。
主题简介: 本次分享大纲如下:
- 苏宁ES平台总体介绍,典型使用场景和规模;
- ES平台化之路-演进路线以及过程中我们的思考;
- 实战经验:遇到的问题及对应的解决方案;
社区日报 第288期 (2018-05-31)
-
七个更好的Elasticsearch基准测试技巧。 http://t.cn/R16rZXS
-
ElasticSearch的搭建与数据统计。 http://t.cn/R3eBB2S
- Elasticsearch的选举机制。 http://t.cn/R3COV8R
-
七个更好的Elasticsearch基准测试技巧。 http://t.cn/R16rZXS
-
ElasticSearch的搭建与数据统计。 http://t.cn/R3eBB2S
- Elasticsearch的选举机制。 http://t.cn/R3COV8R
社区日报 第287期 (2018-05-30)
-
利用ES完成文本标注。 http://t.cn/R1Idxmd
-
利用ES进行图像相似搜索。 http://t.cn/Rq9AvuD
- refresh与flush的区别。 http://t.cn/R1Idxmg
编辑: bsll
-
利用ES完成文本标注。 http://t.cn/R1Idxmd
-
利用ES进行图像相似搜索。 http://t.cn/Rq9AvuD
- refresh与flush的区别。 http://t.cn/R1Idxmg
编辑: bsll
归档:https://elasticsearch.cn/article/645
订阅:https://tinyletter.com/elastic-daily
收起阅读 »社区日报 第286期 (2018-05-29)
http://t.cn/R1tuJMq
2.SpringBoot整合ElasticSearch实现多版本的兼容。
http://t.cn/R3VlVu7
3.Elasticsearch learning to rank 详细入门文档。
http://t.cn/R1tu9Nw
编辑:叮咚光军
归档:https://elasticsearch.cn/article/644
订阅:https://tinyletter.com/elastic-daily
http://t.cn/R1tuJMq
2.SpringBoot整合ElasticSearch实现多版本的兼容。
http://t.cn/R3VlVu7
3.Elasticsearch learning to rank 详细入门文档。
http://t.cn/R1tu9Nw
编辑:叮咚光军
归档:https://elasticsearch.cn/article/644
订阅:https://tinyletter.com/elastic-daily
收起阅读 »
logstash-filter-elasticsearch的简易安装
官方提供的方法因为需要联网,并且需要调整插件管理源,比较麻烦,针对logstash-filter-elasticsearch插件,使用下面这种方式安装。
logstash-filter-elasticsearch插件安装
1、在git上下载logstash-filter-elasticsearch压缩包,logstash-filter-elasticsearch.zip,
2、在logstash的目录下新建plugins目录,解压logstash-filter-elasticsearch.zip到此目录下。
3、在logstash目录下的Gemfile中添加一行:
gem "logstash-filter-elasticsearch", :path => "./plugins/logstash-filter-elasticsearch"
4、重启logstash即可。
此方法适用logstash-filter-elasticsearch,但不适用全部logstash插件。
官方提供的方法因为需要联网,并且需要调整插件管理源,比较麻烦,针对logstash-filter-elasticsearch插件,使用下面这种方式安装。
logstash-filter-elasticsearch插件安装
1、在git上下载logstash-filter-elasticsearch压缩包,logstash-filter-elasticsearch.zip,
2、在logstash的目录下新建plugins目录,解压logstash-filter-elasticsearch.zip到此目录下。
3、在logstash目录下的Gemfile中添加一行:
gem "logstash-filter-elasticsearch", :path => "./plugins/logstash-filter-elasticsearch"
4、重启logstash即可。
此方法适用logstash-filter-elasticsearch,但不适用全部logstash插件。 收起阅读 »
转载一篇关于shard数量设计的文章,很赞
Elasticsearch is a very versatile platform, that supports a variety of use cases, and provides great flexibility around data organisation and replication strategies. This flexibility can however sometimes make it hard to determine up-front how to best organize your data into indices and shards, especially if you are new to the Elastic Stack. While suboptimal choices will not necessarily cause problems when first starting out, they have the potential to cause performance problems as data volumes grow over time. The more data the cluster holds, the more difficult it also becomes to correct the problem, as reindexing of large amounts of data can sometimes be required.
When we come across users that are experiencing performance problems, it is not uncommon that this can be traced back to issues around how data is indexed and number of shards in the cluster. This is especially true for use-cases involving multi-tenancy and/or use of time-based indices. When discussing this with users, either in person at events or meetings or via our forum, some of the most common questions are “How many shards should I have?” and “How large should my shards be?”.
This blog post aims to help you answer these questions and provide practical guidelines for use cases that involve the use of time-based indices, e.g. logging or security analytics, in a single place.
What is a shard?
Before we start, we need to establish some facts and terminology that we will need in later sections.
Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.
As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh. How this works is described in greater detail in Elasticsearch: the Definitive Guide.
As the number of segments grow, these are periodically consolidated into larger segments. This process is referred to as merging. As all segments are immutable, this means that the disk space used will typically fluctuate during indexing, as new, merged segments need to be created before the ones they replace can be deleted. Merging can be quite resource intensive, especially with respect to disk I/O.
The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.
TIP: Avoid having very large shards as this can negatively affect the cluster's ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.
Index by retention period
As segments are immutable, updating a document requires Elasticsearch to first find the existing document, then mark it as deleted and add the updated version. Deleting a document also requires the document to be found and marked as deleted. For this reason, deleted documents will continue to tie up disk space and some system resources until they are merged out, which can consume a lot of system resources.
Elasticsearch allows complete indices to be deleted very efficiently directly from the file system, without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.
TIP: Try to use time-based indices for managing data retention whenever possible. Group data into indices based on the retention period. Time-based indices also make it easy to vary the number of primary shards and replicas over time, as this can be changed for the next index to be generated. This simplifies adapting to changing data volumes and requirements.
Are indices and shards not free?
For each Elasticsearch index, information about mappings and state is stored in the cluster state. This is kept in memory for fast access. Having a large number of indices in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.
TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and the mapping size for each individual index.
Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case.
One important characteristic of the segment related overhead is however that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per data volume compared to smaller segments. The difference can be substantial.
In order to be able to store as much data as possible per node, it becomes important to manage heap usage and reduce the amount of overhead as much as possible. The more heap space a node has, the more data and shards it can handle.
Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.
TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.
TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
How does shard size affect performance?
In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.
This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.
TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries. Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.
How do I manage shard size?
When using time-based indices, each index has traditionally been associated with a fixed time period. Daily indices are very common, and often used for holding data with short retention period or large daily volumes. These allow retention period to be managed with good granularity and makes it easy to adjust for changing volumes on a daily basis. Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly induces in order to keep the shard size up. This reduces the number of indices and shards that need to be stored in the cluster over time.
TIP: If using time-based indices covering a fixed period, adjust the period each index covers based on the retention period and expected data volumes in order to reach the target shard size.
Time-based indices with a fixed time interval works well when data volumes are reasonably predictable and change slowly. If the indexing rate can vary quickly, it is very difficult to maintain a uniform target shard size.
In order to be able to better handle this type of scenarios, the Rollover and Shrink APIs were introduced. These add a lot of flexibility to how indices and shards are managed, specifically for time-based indices.
The rollover index API makes it possible to specify the number of documents and index should contain and/or the maximum period documents should be written to it. Once one of these criteria has been exceeded, Elasticsearch can trigger a new index to be created for writing without downtime. Instead of having each index cover a specific time-period, it is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.
In cases where data might be updated, there is no longer a distinct link between the timestamp of the event and the index it resides in when using this API, which may make updates significantly less efficient as each update my need to be preceded by a search.
TIP: If you have time-based, immutable data where volumes can vary significantly over time, consider using the rollover index API to achieve an optimal target shard size by dynamically varying the time-period each index covers. This gives great flexibility and can help avoid having too large or too small shards when volumes are unpredictable.
The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. If an even spread of shards across nodes is desired during indexing, but this will result in too small shards, this API can be used to reduce the number of primary shards once the index is no longer indexed into. This will result in larger shards, better suited for longer term storage of data.
TIP: If you need to have each index cover a specific time period but still want to be able to spread indexing out across a large number of nodes, consider using the shrink API to reduce the number of primary shards once the index is no longer indexed into. This API can also be used to reduce the number of shards in case you have initially configured too many shards.
Conclusions
This blog post has provided tips and practical guidelines around how to best manage data in Elasticsearch. If you are interested in learning more, "Elasticsearch: the definitive guide" contains a section about designing for scale, which is well worth reading even though it is a bit old.
A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available. For more in-depth and personal advice you can engage with us commercially through a subscription and let our Support and Consulting teams help accelerate your project. If you are happy to discuss your use-case in the open, you can also get help from our community and through our public forum.
Elasticsearch is a very versatile platform, that supports a variety of use cases, and provides great flexibility around data organisation and replication strategies. This flexibility can however sometimes make it hard to determine up-front how to best organize your data into indices and shards, especially if you are new to the Elastic Stack. While suboptimal choices will not necessarily cause problems when first starting out, they have the potential to cause performance problems as data volumes grow over time. The more data the cluster holds, the more difficult it also becomes to correct the problem, as reindexing of large amounts of data can sometimes be required.
When we come across users that are experiencing performance problems, it is not uncommon that this can be traced back to issues around how data is indexed and number of shards in the cluster. This is especially true for use-cases involving multi-tenancy and/or use of time-based indices. When discussing this with users, either in person at events or meetings or via our forum, some of the most common questions are “How many shards should I have?” and “How large should my shards be?”.
This blog post aims to help you answer these questions and provide practical guidelines for use cases that involve the use of time-based indices, e.g. logging or security analytics, in a single place.
What is a shard?
Before we start, we need to establish some facts and terminology that we will need in later sections.
Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.
As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh. How this works is described in greater detail in Elasticsearch: the Definitive Guide.
As the number of segments grow, these are periodically consolidated into larger segments. This process is referred to as merging. As all segments are immutable, this means that the disk space used will typically fluctuate during indexing, as new, merged segments need to be created before the ones they replace can be deleted. Merging can be quite resource intensive, especially with respect to disk I/O.
The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.
TIP: Avoid having very large shards as this can negatively affect the cluster's ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.
Index by retention period
As segments are immutable, updating a document requires Elasticsearch to first find the existing document, then mark it as deleted and add the updated version. Deleting a document also requires the document to be found and marked as deleted. For this reason, deleted documents will continue to tie up disk space and some system resources until they are merged out, which can consume a lot of system resources.
Elasticsearch allows complete indices to be deleted very efficiently directly from the file system, without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.
TIP: Try to use time-based indices for managing data retention whenever possible. Group data into indices based on the retention period. Time-based indices also make it easy to vary the number of primary shards and replicas over time, as this can be changed for the next index to be generated. This simplifies adapting to changing data volumes and requirements.
Are indices and shards not free?
For each Elasticsearch index, information about mappings and state is stored in the cluster state. This is kept in memory for fast access. Having a large number of indices in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.
TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and the mapping size for each individual index.
Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case.
One important characteristic of the segment related overhead is however that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per data volume compared to smaller segments. The difference can be substantial.
In order to be able to store as much data as possible per node, it becomes important to manage heap usage and reduce the amount of overhead as much as possible. The more heap space a node has, the more data and shards it can handle.
Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.
TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.
TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.
TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 to 25 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600-750 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
How does shard size affect performance?
In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.
This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.
TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries. Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.
How do I manage shard size?
When using time-based indices, each index has traditionally been associated with a fixed time period. Daily indices are very common, and often used for holding data with short retention period or large daily volumes. These allow retention period to be managed with good granularity and makes it easy to adjust for changing volumes on a daily basis. Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly induces in order to keep the shard size up. This reduces the number of indices and shards that need to be stored in the cluster over time.
TIP: If using time-based indices covering a fixed period, adjust the period each index covers based on the retention period and expected data volumes in order to reach the target shard size.
Time-based indices with a fixed time interval works well when data volumes are reasonably predictable and change slowly. If the indexing rate can vary quickly, it is very difficult to maintain a uniform target shard size.
In order to be able to better handle this type of scenarios, the Rollover and Shrink APIs were introduced. These add a lot of flexibility to how indices and shards are managed, specifically for time-based indices.
The rollover index API makes it possible to specify the number of documents and index should contain and/or the maximum period documents should be written to it. Once one of these criteria has been exceeded, Elasticsearch can trigger a new index to be created for writing without downtime. Instead of having each index cover a specific time-period, it is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.
In cases where data might be updated, there is no longer a distinct link between the timestamp of the event and the index it resides in when using this API, which may make updates significantly less efficient as each update my need to be preceded by a search.
TIP: If you have time-based, immutable data where volumes can vary significantly over time, consider using the rollover index API to achieve an optimal target shard size by dynamically varying the time-period each index covers. This gives great flexibility and can help avoid having too large or too small shards when volumes are unpredictable.
The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. If an even spread of shards across nodes is desired during indexing, but this will result in too small shards, this API can be used to reduce the number of primary shards once the index is no longer indexed into. This will result in larger shards, better suited for longer term storage of data.
TIP: If you need to have each index cover a specific time period but still want to be able to spread indexing out across a large number of nodes, consider using the shrink API to reduce the number of primary shards once the index is no longer indexed into. This API can also be used to reduce the number of shards in case you have initially configured too many shards.
Conclusions
This blog post has provided tips and practical guidelines around how to best manage data in Elasticsearch. If you are interested in learning more, "Elasticsearch: the definitive guide" contains a section about designing for scale, which is well worth reading even though it is a bit old.
A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available. For more in-depth and personal advice you can engage with us commercially through a subscription and let our Support and Consulting teams help accelerate your project. If you are happy to discuss your use-case in the open, you can also get help from our community and through our public forum. 收起阅读 »
社区日报 第285期 (2018-05-28)
http://t.cn/R1G3z3Q
2.去哪儿网ELK安全监控中心踩坑和实践
http://t.cn/R1qhAYL
3. Elasticsearch内核解析 - 写入篇
http://t.cn/R1q5Y5u
编辑:cyberdak
归档:https://elasticsearch.cn/article/641
订阅:https://tinyletter.com/elastic-daily
http://t.cn/R1G3z3Q
2.去哪儿网ELK安全监控中心踩坑和实践
http://t.cn/R1qhAYL
3. Elasticsearch内核解析 - 写入篇
http://t.cn/R1q5Y5u
编辑:cyberdak
归档:https://elasticsearch.cn/article/641
订阅:https://tinyletter.com/elastic-daily 收起阅读 »
社区日报 第284期 (2018-05-27)
http://t.cn/R12Q3zm
2.开源X / MIT许可的用于光栅和矢量地理空间数据格式的转换器库。
http://t.cn/R128rZU
3.(自备梯子)关于数据。
http://t.cn/R12ETtD
编辑:至尊宝
归档:https://elasticsearch.cn/article/640
订阅:https://tinyletter.com/elastic-daily
http://t.cn/R12Q3zm
2.开源X / MIT许可的用于光栅和矢量地理空间数据格式的转换器库。
http://t.cn/R128rZU
3.(自备梯子)关于数据。
http://t.cn/R12ETtD
编辑:至尊宝
归档:https://elasticsearch.cn/article/640
订阅:https://tinyletter.com/elastic-daily 收起阅读 »
社区日报 第283期 (2018-05-26)
-
postmark使用curator经验分享。 http://t.cn/R1wYJxL
-
kreuzwerker数据从SQL Server迁移到ES经验。 http://t.cn/R1wYJxZ
- 在kibana中使用自定义底图描绘区域和坐标。 http://t.cn/R1wYJx2
-
postmark使用curator经验分享。 http://t.cn/R1wYJxL
-
kreuzwerker数据从SQL Server迁移到ES经验。 http://t.cn/R1wYJxZ
- 在kibana中使用自定义底图描绘区域和坐标。 http://t.cn/R1wYJx2
社区日报 第282期 (2018-05-25)
http://t.cn/R17PZJv
2、Elasticsearch 架构以及源码概览
http://t.cn/R17PGhf
3、Elasticsearch图像检索实践
http://t.cn/R17PVoX
编辑:铭毅天下
归档:https://elasticsearch.cn/article/638
订阅:https://tinyletter.com/elastic-daily
http://t.cn/R17PZJv
2、Elasticsearch 架构以及源码概览
http://t.cn/R17PGhf
3、Elasticsearch图像检索实践
http://t.cn/R17PVoX
编辑:铭毅天下
归档:https://elasticsearch.cn/article/638
订阅:https://tinyletter.com/elastic-daily
收起阅读 »
[技术交流] Elasticsearch冉冉升起,几款开源数据引擎对比
Elasticsearch强势进入前十,是一颗冉冉升起的新星。
一些小伙伴经常会碰到选取何种数据引擎的情况。在此,我们将把几个热门的开源数据引擎进行对比,供大家参考。
从上表看出:
MySQL:将数据保存在不同的表中,使用SQL语言进行交互,是目前非常流行的关系型数据库管理系统。如果您需要一个传统的数据库,那么MySQL是不错的选择。
MongoDB:基于分布式文件存储的NoSQL数据库,数据结构由键值对组成,具有可扩展性。如果您需要存储和查询非结构化信息,不太需要分析或全文检索,那么MongoDB是不错的选择。
Redis:高性能的键值对数据库,支持一定程度的事务性。主要应用为缓存,对读写有非常高性能要求的场景中,是不错的选择。但因为是靠全内存加速,所以数据量大的情况下,配置要求也很高。
Elasticsearch:基于Lucene的分布式搜索引擎,具有全文检索,同义词处理,相关度排名,复杂数据分析的能力。如果您想做文本类检索,及相关性排序,以及指标类分析,Elasticsearch会非常适合。它在文档全文检索(网站搜索,APP搜索)和日志分析(运营,运维)领域拥有得天独厚的优势。
此文抛砖引玉,欢迎小伙伴留言讨论不同数据引擎的适用场景~~
另有想快速体验Elasticsearch欢迎戳下面链接~~~
华为云搜索服务是云上的Elasticsearch,具有简单易用、无忧运维、弹性灵活、数据可靠等特点,欢迎使用~~~~
Elasticsearch强势进入前十,是一颗冉冉升起的新星。
一些小伙伴经常会碰到选取何种数据引擎的情况。在此,我们将把几个热门的开源数据引擎进行对比,供大家参考。
从上表看出:
MySQL:将数据保存在不同的表中,使用SQL语言进行交互,是目前非常流行的关系型数据库管理系统。如果您需要一个传统的数据库,那么MySQL是不错的选择。
MongoDB:基于分布式文件存储的NoSQL数据库,数据结构由键值对组成,具有可扩展性。如果您需要存储和查询非结构化信息,不太需要分析或全文检索,那么MongoDB是不错的选择。
Redis:高性能的键值对数据库,支持一定程度的事务性。主要应用为缓存,对读写有非常高性能要求的场景中,是不错的选择。但因为是靠全内存加速,所以数据量大的情况下,配置要求也很高。
Elasticsearch:基于Lucene的分布式搜索引擎,具有全文检索,同义词处理,相关度排名,复杂数据分析的能力。如果您想做文本类检索,及相关性排序,以及指标类分析,Elasticsearch会非常适合。它在文档全文检索(网站搜索,APP搜索)和日志分析(运营,运维)领域拥有得天独厚的优势。
此文抛砖引玉,欢迎小伙伴留言讨论不同数据引擎的适用场景~~
另有想快速体验Elasticsearch欢迎戳下面链接~~~
华为云搜索服务是云上的Elasticsearch,具有简单易用、无忧运维、弹性灵活、数据可靠等特点,欢迎使用~~~~
收起阅读 »
社区日报 第281期 (2018-05-24)
-
重磅!kibana中文手册发布。 http://t.cn/R3eoVvc
-
Elasticsearch如何实现 SQL语句中 Group By 和 Limit 的功能。 http://t.cn/R3k85NN
- Laravel 中使用 ElasticSearch。 http://t.cn/R3k8V48
-
重磅!kibana中文手册发布。 http://t.cn/R3eoVvc
-
Elasticsearch如何实现 SQL语句中 Group By 和 Limit 的功能。 http://t.cn/R3k85NN
- Laravel 中使用 ElasticSearch。 http://t.cn/R3k8V48