In the same way as the lowercase
token filter is a good starting point for
many languages but falls short when exposed to the entire tower of Babel, so
the asciifolding
token filter requires a more
effective Unicode character-folding counterpart for dealing with the many
languages of the world.
The icu_folding
token filter (provided by the icu
plug-in)
does the same job as the asciifolding
filter, but extends the transformation
to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion
of numbers in other scripts into their Latin equivalents, plus various other
numeric, symbolic, and punctuation transformations.
The icu_folding
token filter applies Unicode normalization and case folding
from nfkc_cf
automatically, so the icu_normalizer
is not required:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_folder": {
"tokenizer": "icu_tokenizer",
"filter": [ "icu_folding" ]
}
}
}
}
}
GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥ (1)
The Arabic numerals ١٢٣٤٥
are folded to their Latin equivalent: 12345
.
If there are particular characters that you would like to protect from
folding, you can use a
UnicodeSet
(much like a character class in regular expressions) to specify which Unicode
characters may be folded. For instance, to exclude the Swedish letters å
,
ä
, ö
, Å, Ä
, and Ö
from folding, you would specify a character class
representing all Unicode characters, except for those letters: [^åäöÅÄÖ]
(^
means everything except).
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"swedish_folding": { (1)
"type": "icu_folding",
"unicodeSetFilter": "[^åäöÅÄÖ]"
}
},
"analyzer": {
"swedish_analyzer": { (2)
"tokenizer": "icu_tokenizer",
"filter": [ "swedish_folding", "lowercase" ]
}
}
}
}
}
The swedish_folding
token filter customizes the
icu_folding
token filter to exclude Swedish letters,
both uppercase and lowercase.
The swedish
analyzer first tokenizes words, then folds
each token by using the swedish_folding
filter, and then
lowercases each token in case it includes some of
the uppercase excluded letters: Å, Ä
, or Ö
.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。