1 Star 0 Fork 0

缠中说禅/elasticsearch-definitive-guide

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
50_Controlling_stemming.asciidoc 4.59 KB
一键复制 编辑 原始数据 按行查看 历史

Controlling Stemming

Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers, especially, will blithely apply their rules to any words they encounter, perhaps conflating words that you would prefer to keep separate. Maybe, for your use case, it is important to keep skies and skiing as distinct words rather than stemming them both down to ski (as would happen with the english analyzer).

The {ref}/analysis-keyword-marker-tokenfilter.html[keyword_marker] and {ref}/analysis-stemmer-override-tokenfilter.html[stemmer_override] token filters allow us to customize the stemming process.

Preventing Stemming

The stem_exclusion parameter for language analyzers (see [configuring-language-analyzers]) allowed us to specify a list of words that should not be stemmed. Internally, these language analyzers use the {ref}/analysis-keyword-marker-tokenfilter.html[keyword_marker token filter] to mark the listed words as keywords, which prevents subsequent stemming token filters from touching those words.

For instance, we can create a simple custom analyzer that uses the {ref}/analysis-porterstem-tokenfilter.html[porter_stem] token filter, but prevents the word skies from being stemmed:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "no_stem": {
          "type": "keyword_marker",
          "keywords": [ "skies" ] (1)
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "no_stem",
            "porter_stem"
          ]
        }
      }
    }
  }
}
  1. They keywords parameter could accept multiple words.

Testing it with the analyze API shows that just the word skies has been excluded from stemming:

GET /my_index/_analyze?analyzer=my_english
sky skies skiing skis (1)
  1. Returns: sky, skies, ski, ski

Tip

While the language analyzers allow us only to specify an array of words in the stem_exclusion parameter, the keyword_marker token filter also accepts a keywords_path parameter that allows us to store all of our keywords in a file. The file should contain one word per line, and must be present on every node in the cluster. See [updating-stopwords] for tips on how to update this file.

Customizing Stemming

In the preceding example, we prevented skies from being stemmed, but perhaps we would prefer it to be stemmed to sky instead. The {ref}/analysis-stemmer-override-tokenfilter.html[stemmer_override] token filter allows us to specify our own custom stemming rules. At the same time, we can handle some irregular forms like stemming mice to mouse and feet to foot:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_stem": {
          "type": "stemmer_override",
          "rules": [ (1)
            "skies=>sky",
            "mice=>mouse",
            "feet=>foot"
          ]
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_stem", (2)
            "porter_stem"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_english
The mice came down from the skies and ran over my feet (3)
  1. Rules take the form original⇒stem.

  2. The stemmer_override filter must be placed before the stemmer.

  3. Returns the, mouse, came, down, from, the, sky, and, ran, over, my, foot.

Tip
Just as for the keyword_marker token filter, rules can be stored in a file whose location should be specified with the rules_path parameter.
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/SFAC_hds/elasticsearch-definitive-guide.git
[email protected]:SFAC_hds/elasticsearch-definitive-guide.git
SFAC_hds
elasticsearch-definitive-guide
elasticsearch-definitive-guide
master

搜索帮助