1 Star 0 Fork 0

缠中说禅/elasticsearch-definitive-guide

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
40_Case_folding.asciidoc 3.00 KB
一键复制 编辑 原始数据 按行查看 历史

Unicode Case Folding

Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.

Take, for example, the lowercase German letter ß. Converting that to upper case gives you SS, which converted back to lowercase gives you ss. Or consider the Greek letter ς (sigma, when used at the end of a word). Converting it to uppercase results in Σ, which converted back to lowercase, gives you σ.

The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing. Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.

For instance, the letter ß, which is already lowercase, is folded to ss. Similarly, the lowercase ς is folded to σ, to make σ, ς, and Σ comparable, no matter where the letter appears in a word.

The default normalization form that the icu_normalizer token filter uses is nfkc_cf. Like the nfkc form, this does the following:

  • Composes characters into the shortest byte representation

  • Uses compatibility mode to convert characters like into the simpler ffi

But it also does this:

  • Case-folds characters into a form suitable for case comparison

In other words, nfkc_cf is the equivalent of the lowercase token filter, but suitable for use with all languages. The on-steroids equivalent of the standard analyzer would be the following:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_lowercaser": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_normalizer" ] (1)
        }
      }
    }
  }
}
  1. The icu_normalizer defaults to the nfkc_cf form.

We can compare the results of running Weißkopfseeadler and WEISSKOPFSEEADLER (the uppercase equivalent) through the standard analyzer and through our Unicode-aware analyzer:

GET /_analyze?analyzer=standard (1)
Weißkopfseeadler WEISSKOPFSEEADLER

GET /my_index/_analyze?analyzer=my_lowercaser (2)
Weißkopfseeadler WEISSKOPFSEEADLER
  1. Emits tokens weißkopfseeadler, weisskopfseeadler

  2. Emits tokens weisskopfseeadler, weisskopfseeadler

The standard analyzer emits two different, incomparable tokens, while our custom analyzer produces tokens that are comparable, regardless of the original case.

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/SFAC_hds/elasticsearch-definitive-guide.git
git@gitee.com:SFAC_hds/elasticsearch-definitive-guide.git
SFAC_hds
elasticsearch-definitive-guide
elasticsearch-definitive-guide
master

搜索帮助

371d5123 14472233 46e8bd33 14472233