不定内容的新闻搜索

Elasticsearch | 作者 God_lockin | 发布于2018年12月20日 | 阅读数：4820

各位大佬，目前遇到个case，求思路、建议…

目前ES能存的数据字段有 title（新闻标题的原始内容），content（新闻内容的原始内容），keyword（文章被算法引擎计算出来的一些关键字），publish_time（文章的发布时间）

text=${用户输入的内容}
// 这个内容可能是title的一部分或者随便输入的关键词等

if text 和 title 的匹配度超过80
then
强行把这篇放在第一个返回
end

其他的内容按publish_time倒序排列

可能会因为title、content、keyword的一些内容对召回的数据进行顺序微调。

目前我想到的办法就是直接用multi_match + title、content、keyword的权重进行召回，按publish_time进行排序，然后再在程序里便利一把所有的title，如果发现能匹配超过80%的直接放在第一个。但是觉得这个方式一方面比较土，另一方面效率也很差（计算80%匹配率这个很慢），各位大佬有没啥建议？

4 个回复

rochy - rochy_he

赞同来自: God_lockin

{

  "dis_max" : {

    "tie_breaker" : 0.0,

    "queries" : [

      {

        "function_score" : {

          "query" : {

            "multi_match" : {

              "query" : "华为发布会",

              "fields" : [

                "content^1.0",

                "title.text^1.0"

              ],

              "type" : "best_fields",

              "operator" : "OR",

              "slop" : 0,

              "prefix_length" : 0,

              "max_expansions" : 50,

              "minimum_should_match" : "80%",

              "zero_terms_query" : "NONE",

              "auto_generate_synonyms_phrase_query" : true,

              "fuzzy_transpositions" : true,

              "boost" : 1.0

            }

          },

          "functions" : [

            {

              "filter" : {

                "match_all" : {

                  "boost" : 1.0

                }

              },

              "script_score" : {

                "script" : {

                  "source" : "return _score * 10000",

                  "lang" : "painless"

                }

              }

            }

          ],

          "score_mode" : "multiply",

          "boost_mode" : "replace",

          "max_boost" : 3.4028235E38,

          "boost" : 1.0

        }

      },

      {

        "function_score" : {

          "query" : {

            "multi_match" : {

              "query" : "华为发布会",

              "fields" : [

                "content^1.0",

                "keywords^1.0",

                "title.text^1.0"

              ],

              "type" : "best_fields",

              "operator" : "OR",

              "slop" : 0,

              "prefix_length" : 0,

              "max_expansions" : 50,

              "zero_terms_query" : "NONE",

              "auto_generate_synonyms_phrase_query" : true,

              "fuzzy_transpositions" : true,

              "boost" : 1.0

            }

          },

          "functions" : [

            {

              "filter" : {

                "match_all" : {

                  "boost" : 1.0

                }

              },

              "script_score" : {

                "script" : {

                  "source" : "return Math.log(10, doc['pDateTime'].value)",

                  "lang" : "painless"

                }

              }

            }

          ],

          "score_mode" : "multiply",

          "boost_mode" : "replace",

          "max_boost" : 3.4028235E38,

          "boost" : 1.0

        }

      }

    ],

    "boost" : 1.0

  }

}

可以试一下

zz_hello

rescore可以对查询出来的文档进行重新评分，可以看看
https://www.elastic.co/guide/c ... .html

rochy - rochy_he

匹配度超过 80 可用使用 minimum_should_match 参数来控制；

推荐使用 function_score_query，如果匹配度超过 80，则 replace 替换得分为 _score * 10000；
否则 replace 得分为 log10(time)

最后按照得分排名即可

God_lockin

现在的mapping是

{

    "title": {

      "type": "text",

      "fields": {

        "text": {

          "type": "text",

          "search_analyzer": "ik_smart",

          "analyzer": "ik_max_word"

        },

        "keyword": {

          "type": "keyword"

        }

      }

    },

    "content": {

      "type": "text",

      "search_analyzer": "ik_smart",

      "analyzer": "ik_max_word"

    },

    "keywords": {

      "type": "text",

      "analyzer": "whitespace"

    },

    "pDateTime": {

      "type": "date",

      "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"

    }

}

还是不是很理解那个function_scro_query的命令应该怎么写，是这样吗？运行会报错的样子…

{

  "from": 0,

  "query": {

    "function_score": {

      "script_score": {

        "script": {

          "source": "if (_score > 0.8) { _score *= 1000} else { _score = Math.log(10, doc['pDateTime'].value}"

        }

      },

      "query": {

        "bool": {

          "must": {

            "exists": {

              "field": "pDateTime"

            }

          },

          "should": [

            {

              "match": {

                "title.keyword": {

                  "query": "华为发布会"

                }

              }

            },

            {

              "match": {

                "title.text": {

                  "query": "华为发布会"

                }

              }

            },

            {

              "match": {

                "content": {

                  "query": "华为发布会"

                }

              }

            },

            {

              "match": {

                "keywords": {

                  "query": "华为 发布会"

                }

              }

            }

          ]

        }

      }

    }

  }

}

要回复问题请先登录或注册

不定内容的新闻搜索

4 个回复

发起人

活动推荐

相关问题

问题状态

不定内容的新闻搜索

与内容相关的链接

4 个回复

发起人

活动推荐

相关问题

问题状态