elasticsearch-analysis-pinyin更新至es2.4.1和5.0.0-rc1
Elasticsearch | 作者 medcl | 发布于2016年10月13日 | | 阅读数:4995
                                    版本分别支持到最新的 es v2.4.1和 es v5.0.0-rc1
新增若干特性,支持多种选项配置,支持 pinyin 的切分,比之前需要结合 ngram 的方式更加准确,
如:liudehuaalibaba13zhuanghan->liu,de,hua,a,li,ba,ba,13,zhuang,han,
具体配置参加文档:
https://github.com/medcl/elast ... inyin
 
下载:
https://github.com/medcl/elast ... eases
 
欢迎测试:
 
                                
                                
                                
                                
                                新增若干特性,支持多种选项配置,支持 pinyin 的切分,比之前需要结合 ngram 的方式更加准确,
如:liudehuaalibaba13zhuanghan->liu,de,hua,a,li,ba,ba,13,zhuang,han,
具体配置参加文档:
https://github.com/medcl/elast ... inyin
下载:
https://github.com/medcl/elast ... eases
欢迎测试:
curl -XPUT http://localhost:9200/medcl/ -d'
{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true
                }
            }
        }
    }
}'
curl http://localhost:9200/medcl/_a ... lyzer
{
  "tokens" : [ {
    "token" : "liu",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "de",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "hua",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "a",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "b",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "c",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "d",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "liu",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "de",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "hua",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 9
  }, {
    "token" : "wo",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 10
  }, {
    "token" : "bu",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 11
  }, {
    "token" : "zhi",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 12
  }, {
    "token" : "dao",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 13
  }, {
    "token" : "shi",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 14
  }, {
    "token" : "shui",
    "start_offset" : 2,
    "end_offset" : 31,
    "type" : "word",
    "position" : 15
  }, {
    "token" : "ldhabcdliudehuaw",
    "start_offset" : 0,
    "end_offset" : 16,
    "type" : "word",
    "position" : 16
  } ]
}
                                [尊重社区原创,转载请保留或注明出处]
本文地址:http://searchkit.cn/article/105
                                本文地址:http://searchkit.cn/article/105
3 个评论
                                            手动点赞                                        
                                    
                                            测试了下medcl大神的拼音分词器,发现一点问题,比如,搜索“沾光”的拼音“zhanguang”时的分词如下:
http://172.19.22.124:9200/_analyze?analyzer=pinyin&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "u",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "ang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "zhanguang",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 3
}
]
}
这里有2点问题:
1. 这里貌似只使用了正向最大匹配,而没有考虑最优拼音拆分,所以会拆分不是很准确
2. 拼音的偏移位貌似有一丢丢问题,比如第一个token “zhang”不应该是0到8
下面是我实现的lc-pinyin里面使用“最短拼音拆分既最优拆分”的方法可以避免这个问题
http://172.19.22.124:9200/_analyze?analyzer=lc_search&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhan",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "guang",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
}
]
}
                                    http://172.19.22.124:9200/_analyze?analyzer=pinyin&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
},
{
"token" : "u",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "ang",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "zhanguang",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 3
}
]
}
这里有2点问题:
1. 这里貌似只使用了正向最大匹配,而没有考虑最优拼音拆分,所以会拆分不是很准确
2. 拼音的偏移位貌似有一丢丢问题,比如第一个token “zhang”不应该是0到8
下面是我实现的lc-pinyin里面使用“最短拼音拆分既最优拆分”的方法可以避免这个问题
http://172.19.22.124:9200/_analyze?analyzer=lc_search&pretty&text=zhanguang
{
"tokens" : [
{
"token" : "zhan",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "guang",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
}
]
}
 
                                        
                                        medcl 回复 chennanlcy
                                            欢迎提交 PR                                        
                                    


