使用Elasticsearch实现同段和同句搜索
Elasticsearch • trycatchfinal 发表了文章 • 4 个评论 • 6483 次浏览 • 2020-03-07 10:38
同句搜索要求搜索多个关键词时,返回的文章不只要包含关键词,而且这些关键词必须在同一句中。
同段搜素类似,只是范围为同一段落。
SpanQuery
同段、同句搜索,使用常用的term、match查询,没有找到办法可以实现。
Elasticsearch提供了SpanQuery,官方文档中如下的介绍:
Span queries are low-level positional queries which provide expert control over the order and proximity of the specified terms. These are typically used to implement very specific queries on legal documents or patents.
上面提到,SpanQuery常常应用在法律或专利的特定搜索。这些领域,常常提供同段 /同句搜索 。
下面我们看一下三种类型的SpanQuery,能否实现我们的需求:
准备数据
<br /> PUT article<br /> <br /> <br /> POST article/_mapping<br /> {<br /> "properties": {<br /> "maincontent": {<br /> "type": "text"<br /> }<br /> }<br /> }<br /> <br /> <br /> POST article/_doc/1<br /> {<br /> "maincontent":"the quick red fox jumps over the sleepy cat"<br /> }<br /> <br /> <br /> POST article/_doc/2<br /> {<br /> "maincontent":"the quick brown fox jumps over the lazy dog"<br /> }<br />SpanTermQuery
SpanTermQuery 和 Term Query类似, 下面的查询会返回_id为1的doc。
the quick red fox jumps over the sleepy cat
<br /> POST article/_search<br /> {<br /> "profile": "true",<br /> "query": {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "red"<br /> }<br /> }<br /> }<br /> }<br />
SpanNearQuery
SpanNearQuery 表示邻近搜索,查找多个term是否邻近,slop可以设置邻近距离,如果设置为0,那么代表两个term是挨着的,相当于matchphase
in_order参数,代表文档中的term和查询设置的term保持相同的顺序。
<br /> POST article/_search<br /> {<br /> "query": {<br /> "span_near": {<br /> "clauses": [<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "quick"<br /> }<br /> }<br /> },<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "brown"<br /> }<br /> }<br /> }<br /> ],<br /> "slop": 0,<br /> "in_order": true<br /> }<br /> }<br /> }<br />
上面的查询会返回_id为2的doc。
the quick brown fox jumps over the lazy dog
SpanNotQuery
SpanNotQuery非常重要,它要求两个SpanQuery的跨度,不能够重合。
看下面的例子:
- include: 匹配的SpanQuery,例子为需要一个包含quick和fox两个词的邻近搜索。
- exclude:设置一个SpanQuery,要求include中的SpanQuery不能包含这个SpanQuery
<br /> POST article/_search<br /> {<br /> "query": {<br /> "span_not": {<br /> "include": {<br /> "span_near": {<br /> "clauses": [<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "quick"<br /> }<br /> }<br /> },<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "fox"<br /> }<br /> }<br /> }<br /> ],<br /> "slop": 1,<br /> "in_order": true<br /> }<br /> },<br /> "exclude": {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "red"<br /> }<br /> }<br /> }<br /> }<br /> }<br /> }<br />
上面的查询会返回_id为2的doc。
因为_id为1的文档,虽然quick red fox符合include中的SpanQuery,但是red也符合exclude中的SpanQuery。因此,这篇文章需要排除掉。
the quick red fox jumps over the sleepy cat
同句/同段搜索原理
同句搜索,反向来说,就是搜索词不能够跨句。再进一步,就是搜索词之间不能够有
。、?、!等其他标点符号。
其对应的查询类似如下:
<br /> POST article/_search<br /> {<br /> "query": {<br /> "span_not": {<br /> "include": {<br /> "span_near": {<br /> "clauses": [<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "word1"<br /> }<br /> }<br /> },<br /> {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "word2"<br /> }<br /> }<br /> }<br /> ],<br /> "slop": 1,<br /> "in_order": true<br /> }<br /> },<br /> "exclude": {<br /> "span_term": {<br /> "maincontent": {<br /> "value": "。/?/!"<br /> }<br /> }<br /> }<br /> }<br /> }<br /> }<br />
同段搜素类似,对应分隔符变为\n,或者<p>,</p>
同段/同句搜索实现
文本为HTML格式
创建索引
<br /> PUT sample1<br /> {<br /> "settings": {<br /> "number_of_replicas": 0,<br /> "number_of_shards": 1,<br /> "analysis": {<br /> "analyzer": {<br /> "maincontent_analyzer": {<br /> "type": "custom",<br /> "char_filter": [<br /> "sentence_paragrah_mapping",<br /> "html_strip"<br /> ],<br /> "tokenizer": "ik_max_word"<br /> }<br /> },<br /> "char_filter": {<br /> "sentence_paragrah_mapping": {<br /> "type": "mapping",<br /> "mappings": [<br /> """<h1> => \u0020paragraph\u0020""",<br /> """</h1> => \u0020sentence\u0020paragraph\u0020 """,<br /> """<h2> => \u0020paragraph\u0020""",<br /> """</h2> => \u0020sentence\u0020paragraph\u0020 """,<br /> """<p> => \u0020paragraph\u0020""",<br /> """</p> => \u0020sentence\u0020paragraph\u0020 """,<br /> """! => \u0020sentence\u0020 """,<br /> """? => \u0020sentence\u0020 """,<br /> """。 => \u0020sentence\u0020 """,<br /> """? => \u0020sentence\u0020 """,<br /> """! => \u0020sentence\u0020"""<br /> ]<br /> }<br /> }<br /> }<br /> },<br /> "mappings": {<br /> "properties": {<br /> "mainContent": {<br /> "type": "text",<br /> "analyzer": "maincontent_analyzer",<br /> "search_analyzer": "ik_smart"<br /> }<br /> }<br /> }<br /> }<br />
我们创建了一个名称为sentence_paragrah_mapping的char filter,它的目的有两个:
- 替换
p,h1,h2标签为统一的分段符:paragraph;- 替换中英文
!,?,。标点符号为统一的分页符:sentence。
有几个细节,需要说明:
- paragraph和sentence前后都需要添加空格,并且需要使用Unicode
\u0020表示空格。
```
期望
hello world! => hello world sentence
不合理的配置,可能会出现下面的情况
hello world! => hello worldsentence
```
</p>,</h1>,</h2>的结尾标签需要添加paragraph和sentence两个分隔符,避免结尾没有标点符号的情况
```
期望
hello world
hello china
=> paragraph hello world sentence paragraph hello china sentence
# ,,只使用paragraph替换的结果
# 此时 hello world hello china 为同句
hello world
hello china
=> paragraph hello world paragraph hello china sentence
# 上面配置结果有些冗余:有两个连续的paragraph
# 如果能保证HTML文本都符合标准,可以只替换,,,不替换,
,
hello world
hello china
=> paragraph hello world sentence paragraph paragraph hello china sentence
```
- 注意sentence_paragrah_mapping和html_strip的配置顺序
插入测试数据
```
POST sample1/_doc/1
{
"mainContent":"java python javascript
oracle mysql sqlserver
"
}
测试分词
POST sample1/_analyze
{
"text": ["java python javascript
oracle mysql sqlserver
"],
"analyzer": "maincontent_analyzer"
}
返回结果
{
"tokens" : [
{
"token" : "paragraph",
"start_offset" : 1,
"end_offset" : 2,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "java",
"start_offset" : 3,
"end_offset" : 7,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "python",
"start_offset" : 8,
"end_offset" : 14,
"type" : "ENGLISH",
"position" : 2
},
{
"token" : "javascript",
"start_offset" : 15,
"end_offset" : 25,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "sentence",
"start_offset" : 26,
"end_offset" : 28,
"type" : "ENGLISH",
"position" : 4
},
{
"token" : "paragraph",
"start_offset" : 28,
"end_offset" : 28,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "paragraph",
"start_offset" : 30,
"end_offset" : 31,
"type" : "ENGLISH",
"position" : 6
},
{
"token" : "oracle",
"start_offset" : 32,
"end_offset" : 38,
"type" : "ENGLISH",
"position" : 7
},
{
"token" : "mysql",
"start_offset" : 39,
"end_offset" : 44,
"type" : "ENGLISH",
"position" : 8
},
{
"token" : "sqlserver",
"start_offset" : 45,
"end_offset" : 54,
"type" : "ENGLISH",
"position" : 9
},
{
"token" : "sentence",
"start_offset" : 55,
"end_offset" : 57,
"type" : "ENGLISH",
"position" : 10
},
{
"token" : "paragraph",
"start_offset" : 57,
"end_offset" : 57,
"type" : "ENGLISH",
"position" : 11
}
]
}
```
测试查询
- 同段查询:java python
<br /> GET sample1/_search<br /> {<br /> "query": {<br /> "span_not": {<br /> "include": {<br /> "span_near": {<br /> "clauses": [<br /> {<br /> "span_term": {<br /> "mainContent": {<br /> "value": "java"<br /> }<br /> }<br /> },<br /> {<br /> "span_term": {<br /> "mainContent": {<br /> "value": "python"<br /> }<br /> }<br /> }<br /> ],<br /> "slop": 12,<br /> "in_order": false<br /> }<br /> },<br /> "exclude": {<br /> "span_term": {<br /> "mainContent": {<br /> "value": "paragraph"<br /> }<br /> }<br /> }<br /> }<br /> }<br /> }<br /> <br /> //结果<br /> {<br /> "took" : 0,<br /> "timed_out" : false,<br /> "_shards" : {<br /> "total" : 1,<br /> "successful" : 1,<br /> "skipped" : 0,<br /> "failed" : 0<br /> },<br /> "hits" : {<br /> "total" : {<br /> "value" : 1,<br /> "relation" : "eq"<br /> },<br /> "max_score" : 0.1655603,<br /> "hits" : [<br /> {<br /> "_index" : "sample1",<br /> "_type" : "_doc",<br /> "_id" : "1",<br /> "_score" : 0.1655603,<br /> "_source" : {<br /> "mainContent" : "<p>java python javascript</p><p>oracle mysql sqlserver</p>"<br /> }<br /> }<br /> ]<br /> }<br /> }<br />
- 同段查询:java oracle
```
GET sample1/_search
{
"query": {
"span_not": {
"include": {
"span_near": {
"clauses": [
{
"span_term": {
"mainContent": {
"value": "java"
}
}
},
{
"span_term": {
"mainContent": {
"value": "oracle"
}
}
}
],
"slop": 12,
"in_order": false
}
},
"exclude": {
"span_term": {
"mainContent": {
"value": "paragraph"
}
}
}
}
}
}
结果:没有文档返回
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
```纯文本格式
纯文本和HTML的区别是段落分割符不同,使用
\n.创建索引
<br /> PUT sample2<br /> {<br /> "settings": {<br /> "number_of_replicas": 0,<br /> "number_of_shards": 1,<br /> "analysis": {<br /> "analyzer": {<br /> "maincontent_analyzer": {<br /> "type": "custom",<br /> "char_filter": [<br /> "sentence_paragrah_mapping"<br /> ],<br /> "tokenizer": "ik_max_word"<br /> }<br /> },<br /> "char_filter": {<br /> "sentence_paragrah_mapping": {<br /> "type": "mapping",<br /> "mappings": [<br /> """\n => \u0020sentence\u0020paragraph\u0020 """,<br /> """! => \u0020sentence\u0020 """,<br /> """? => \u0020sentence\u0020 """,<br /> """。 => \u0020sentence\u0020 """,<br /> """? => \u0020sentence\u0020 """,<br /> """! => \u0020sentence\u0020"""<br /> ]<br /> }<br /> }<br /> }<br /> },<br /> "mappings": {<br /> "properties": {<br /> "mainContent": {<br /> "type": "text",<br /> "analyzer": "maincontent_analyzer",<br /> "search_analyzer": "ik_smart"<br /> }<br /> }<br /> }<br /> }<br />
测试分词
```
POST sample2/_analyze
{
"text": ["java python javascript\noracle mysql sqlserver"],
"analyzer": "maincontent_analyzer"
}
结果
{
"tokens" : [
{
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "python",
"start_offset" : 5,
"end_offset" : 11,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "javascript",
"start_offset" : 12,
"end_offset" : 22,
"type" : "ENGLISH",
"position" : 2
},
{
"token" : "sentence",
"start_offset" : 22,
"end_offset" : 22,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "paragraph",
"start_offset" : 22,
"end_offset" : 22,
"type" : "ENGLISH",
"position" : 4
},
{
"token" : "oracle",
"start_offset" : 23,
"end_offset" : 29,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "mysql",
"start_offset" : 30,
"end_offset" : 35,
"type" : "ENGLISH",
"position" : 6
},
{
"token" : "sqlserver",
"start_offset" : 36,
"end_offset" : 45,
"type" : "ENGLISH",
"position" : 7
}
]
}
```
社区日报 第873期 (2020-03-06)
社区日报 • laoyang360 发表了文章 • 0 个评论 • 2068 次浏览 • 2020-03-06 23:59
http://t.cn/A673I0Tj
2.kafka连接Elasticsearch实战
http://t.cn/A673xv8b
3.Elastic APM部署实操指南(梯子)
http://t.cn/A673x7D4
编辑:铭毅天下
归档:https://ela.st/cn-daily-all
订阅:https://ela.st/cn-daily-sub
沙龙:https://ela.st/cn-meetup
关于kibana 搜索时分词问题
Kibana • medcl 回复了问题 • 2 人关注 • 1 个回复 • 3856 次浏览 • 2020-03-06 18:34
Encountered a retryable error. Will Retry with exponential backoff {:code=>504
回复Logstash • luckkkkydog 发起了问题 • 1 人关注 • 0 个回复 • 4294 次浏览 • 2020-03-06 13:43
如何统计多个字段组合的数量
Elasticsearch • medcl 回复了问题 • 3 人关注 • 1 个回复 • 3547 次浏览 • 2020-03-09 15:17
filebeat修改源码,支持多个output类型
Beats • medcl 回复了问题 • 2 人关注 • 1 个回复 • 5764 次浏览 • 2020-03-06 15:42
安装ES的服务器,对于磁盘的块大小即block size有什么要求?
Elasticsearch • medcl 回复了问题 • 2 人关注 • 1 个回复 • 2319 次浏览 • 2020-03-07 13:49
社区日报 第872期 (2020-03-05)
社区日报 • 白衬衣 发表了文章 • 1 个评论 • 1740 次浏览 • 2020-03-05 16:46
http://t.cn/A67HlfKF
2.跨集群复制 Cross-cluster replication
http://t.cn/A67HlJgg
3.一次有趣的ES+矩阵变换聚合实践
http://t.cn/A67HlSqu
编辑:金桥
归档:https://ela.st/cn-daily-all
订阅:https://ela.st/cn-daily-sub
沙龙:https://ela.st/cn-meetup
通过复制lucene segmentsinfo添加索引数据
Elasticsearch • Charele 回复了问题 • 2 人关注 • 1 个回复 • 1389 次浏览 • 2020-03-05 21:23
Spring Boot 前景怎么样,主流用到哪些
默认分类 • songlvjun 回复了问题 • 2 人关注 • 1 个回复 • 2462 次浏览 • 2020-03-06 14:53
关于Kibana6.8.2版本中文标题报告下载失败问题
Kibana • medcl 回复了问题 • 2 人关注 • 1 个回复 • 1892 次浏览 • 2020-03-05 13:28
第一次通过up主了解这个
灌水区 • jway 回复了问题 • 2 人关注 • 2 个回复 • 1978 次浏览 • 2020-03-08 09:33
java.security.AccessControlException
Elasticsearch • Charele 回复了问题 • 3 人关注 • 2 个回复 • 1940 次浏览 • 2020-03-05 15:30
logstash推送日志到elastic返回413
Elasticsearch • qiuyuetao 回复了问题 • 3 人关注 • 2 个回复 • 6284 次浏览 • 2020-07-29 16:36
社区日报 第871期 (2020-03-04)
社区日报 • elk123 发表了文章 • 1 个评论 • 1542 次浏览 • 2020-03-04 23:24
http://t.cn/A6vqkJuF
2、APM和调用链跟踪
http://t.cn/A6vnqFZj
3、ES与TIDB对比
http://t.cn/A67jSeXg
编辑:wt
归档:https://ela.st/cn-daily-all
订阅:https://ela.st/cn-daily-sub
沙龙:https://ela.st/cn-meetup

