1. 模拟字符串数据存储
如果结果为:
2. 关键词查询
返回结果:
3. 关键词使用存储的分词器查询
返回结果:
总结
localhost:9200/yigo-redist.1/_analyze?analyzer=default&text=全能片(前)---TRW-GDB7891AT刹车片自带报警线,无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片
上面的url表示- 索引为`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default`
- 解析的字符串(`text`)为"全能片(前)---TRW-GDB7891AT刹车片自带报警线,无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片"
如果结果为:
{
"tokens" : [ {
"token" : "全能",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "片",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
}, {
"token" : "前",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
}, {
"token" : "trw-gdb7891at",
"start_offset" : 9,
"end_offset" : 22,
"type" : "LETTER",
"position" : 4
}, {
"token" : "刹车片",
"start_offset" : 22,
"end_offset" : 25,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "自带",
"start_offset" : 25,
"end_offset" : 27,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "报警",
"start_offset" : 27,
"end_offset" : 29,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "线",
"start_offset" : 29,
"end_offset" : 30,
"type" : "CN_CHAR",
"position" : 8
}, {
"token" : "无",
"start_offset" : 31,
"end_offset" : 32,
"type" : "CN_WORD",
"position" : 9
}, {
"token" : "单独",
"start_offset" : 32,
"end_offset" : 34,
"type" : "CN_WORD",
"position" : 10
}, {
"token" : "报警",
"start_offset" : 34,
"end_offset" : 36,
"type" : "CN_WORD",
"position" : 11
}, {
"token" : "线",
"start_offset" : 36,
"end_offset" : 37,
"type" : "CN_CHAR",
"position" : 12
}, {
"token" : "号码",
"start_offset" : 37,
"end_offset" : 39,
"type" : "CN_WORD",
"position" : 13
}, {
"token" : "卡",
"start_offset" : 40,
"end_offset" : 41,
"type" : "CN_CHAR",
"position" : 14
}, {
"token" : "仕",
"start_offset" : 41,
"end_offset" : 42,
"type" : "CN_WORD",
"position" : 15
}, {
"token" : "欧",
"start_offset" : 42,
"end_offset" : 43,
"type" : "CN_WORD",
"position" : 16
}, {
"token" : "卡",
"start_offset" : 44,
"end_offset" : 45,
"type" : "CN_CHAR",
"position" : 17
}, {
"token" : "仕",
"start_offset" : 45,
"end_offset" : 46,
"type" : "CN_WORD",
"position" : 18
}, {
"token" : "欧",
"start_offset" : 46,
"end_offset" : 47,
"type" : "CN_WORD",
"position" : 19
}, {
"token" : "乘用车",
"start_offset" : 48,
"end_offset" : 51,
"type" : "CN_WORD",
"position" : 20
}, {
"token" : "刹车片",
"start_offset" : 52,
"end_offset" : 55,
"type" : "CN_WORD",
"position" : 21
} ]
}
2. 关键词查询
localhost:9200//yigo-redist.1/_analyze?analyzer=default_search&text=gdb7891
- 索引为`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default_search`
- 解析的字符串(`text`)为"gdb7891"
返回结果:
{
"tokens" : [ {
"token" : "gdb7891",
"start_offset" : 0,
"end_offset" : 7,
"type" : "LETTER",
"position" : 1
} ]
}
3. 关键词使用存储的分词器查询
localhost:9200//yigo-redist.1/_analyze?analyzer=default&text=gdb7891
- 索引为`yigo-redist.1`
- 使用了索引`yigo-redist.1`中的分词器(`analyzer`) `default_search`
- 解析的字符串(`text`)为"gdb7891"
返回结果:
{
"tokens" : [ {
"token" : "gdb7891",
"start_offset" : 0,
"end_offset" : 7,
"type" : "LETTER",
"position" : 1
}, {
"token" : "",
"start_offset" : 0,
"end_offset" : 7,
"type" : "LETTER",
"position" : 1
}, {
"token" : "gdb7891",
"start_offset" : 0,
"end_offset" : 7,
"type" : "LETTER",
"position" : 1
}, {
"token" : "",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ENGLISH",
"position" : 2
}, {
"token" : "gdb",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ENGLISH",
"position" : 2
}, {
"token" : "gdb",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ENGLISH",
"position" : 2
}, {
"token" : "7891",
"start_offset" : 3,
"end_offset" : 7,
"type" : "ARABIC",
"position" : 3
}, {
"token" : "7891",
"start_offset" : 3,
"end_offset" : 7,
"type" : "ARABIC",
"position" : 3
}, {
"token" : "",
"start_offset" : 3,
"end_offset" : 7,
"type" : "ARABIC",
"position" : 3
} ]
}
总结
- 通过步骤1可以看出,存储的数据"全能片(前)---TRW-GDB7891AT刹车片自带报警线,无单独报警线号码,卡仕欧,卡仕欧,乘用车,刹车片",被拆分成了很多词组碎片,然后存储在了索引数据中
- 通过步骤2可以看出,当关键词输入"gdb7891",这个在检索分词器(`default_search`)下,没有拆分,只一个可供查询的碎片就是"gdb7891",但是步骤1,拆分的碎片里不存在"gb7891"的词组碎片,唯一相近的就是"trw-gdb7891at",所以使用普通的match-query是无法匹配步骤1输入的索引数据
- 通过步骤3,可以看出如果使用相同的分词器,"gdb7891"能够拆分成"gdb","7891"等等,通过这2个碎片都能找到步骤1输入的索引数据,但是因为关键词被拆分了,所以会查询到更多的匹配的数据,比如:与"gdb"匹配的,与"7891"匹配的,与"gdb7891"匹配的
- 如果说想通过分词器(`default_search`)检索出步骤1的数据,需要使用wildcard-query,使用"*gdb7891*",就可以匹配
{ "query": { "wildcard" : { "description" : "*gdb7891*" } } }
[尊重社区原创,转载请保留或注明出处]
本文地址:http://searchkit.cn/article/533
本文地址:http://searchkit.cn/article/533
4 个评论
请问一下为什么不能执行你的分词测试的例子,我应该如何修改
{
"error": {
"root_cause": [
{
"type": "parse_exception",
"reason": "request body or source parameter is required"
}
],
"type": "parse_exception",
"reason": "request body or source parameter is required"
},
"status": 400
}
{
"error": {
"root_cause": [
{
"type": "parse_exception",
"reason": "request body or source parameter is required"
}
],
"type": "parse_exception",
"reason": "request body or source parameter is required"
},
"status": 400
}
我是直接访问了elasticsearch服务,没有用过kibana,请直接访问elasticsearch服务
“通过步骤3,可以看出如果使用相同的分词器,"gdb7891"能够拆分成"gdb","7891"等等,通过这2个碎片都能找到步骤1输入的索引数据,但是因为关键词被拆分了,所以会查询到更多的匹配的数据,比如:与"gdb"匹配的,与"7891"匹配的,与"gdb7891"匹配的”
不是很明白啊,第一部分词结果不是这样的嘛?
{
"token" : "trw-gdb7891at",
"start_offset" : 9,
"end_offset" : 22,
"type" : "LETTER",
"position" : 4
},
这里trw-gdb7891at并没有被分开啊,但是步骤三里面分开了,这样怎么能匹配上呢?
不是很明白啊,第一部分词结果不是这样的嘛?
{
"token" : "trw-gdb7891at",
"start_offset" : 9,
"end_offset" : 22,
"type" : "LETTER",
"position" : 4
},
这里trw-gdb7891at并没有被分开啊,但是步骤三里面分开了,这样怎么能匹配上呢?