在使用elasticsearch过程中发现中文分词有时候不正确,
如: 对“莫西”“莫西林” 分词结果分别为
“莫西”分词结果:
{
"tokens": [
{
"token": "莫西",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "莫",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "西",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 2
}
]
}
“莫西林”分词结果:
{
"tokens": [
{
"token": "莫西",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "莫",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "西林",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
}
]
}
为什么“莫西林”的分词结果没完全包含“莫西”的分词结果?
试了“中华人民共和国”"中华人民"是可以的
@medcl
如: 对“莫西”“莫西林” 分词结果分别为
“莫西”分词结果:
{
"tokens": [
{
"token": "莫西",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "莫",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "西",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 2
}
]
}
“莫西林”分词结果:
{
"tokens": [
{
"token": "莫西",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "莫",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "西林",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
}
]
}
为什么“莫西林”的分词结果没完全包含“莫西”的分词结果?
试了“中华人民共和国”"中华人民"是可以的
@medcl
0 个回复