1 Star 0 Fork 0

缠中说禅/elasticsearch-definitive-guide

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
50_Character_folding.asciidoc 2.97 KB
一键复制 编辑 原始数据 按行查看 历史
Clinton Gormley 提交于 2015-01-05 19:05 +08:00 . Update 50_Character_folding.asciidoc

Unicode Character Folding

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.

The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion of numbers in other scripts into their Latin equivalents, plus various other numeric, symbolic, and punctuation transformations.

The icu_folding token filter applies Unicode normalization and case folding from nfkc_cf automatically, so the icu_normalizer is not required:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_folder": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_folding" ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥ (1)
  1. The Arabic numerals ١٢٣٤٥ are folded to their Latin equivalent: 12345.

If there are particular characters that you would like to protect from folding, you can use a UnicodeSet (much like a character class in regular expressions) to specify which Unicode characters may be folded. For instance, to exclude the Swedish letters å, ä, ö, Å, Ä, and Ö from folding, you would specify a character class representing all Unicode characters, except for those letters: [^åäöÅÄÖ] (^ means everything except).

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_folding": { (1)
          "type": "icu_folding",
          "unicodeSetFilter": "[^åäöÅÄÖ]"
        }
      },
      "analyzer": {
        "swedish_analyzer": { (2)
          "tokenizer": "icu_tokenizer",
          "filter":  [ "swedish_folding", "lowercase" ]
        }
      }
    }
  }
}
  1. The swedish_folding token filter customizes the icu_folding token filter to exclude Swedish letters, both uppercase and lowercase.

  2. The swedish analyzer first tokenizes words, then folds each token by using the swedish_folding filter, and then lowercases each token in case it includes some of the uppercase excluded letters: Å, Ä, or Ö.

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/SFAC_hds/elasticsearch-definitive-guide.git
git@gitee.com:SFAC_hds/elasticsearch-definitive-guide.git
SFAC_hds
elasticsearch-definitive-guide
elasticsearch-definitive-guide
master

搜索帮助