Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.
Take, for example, the lowercase German letter ß
. Converting that to upper
case gives you SS
, which converted back to lowercase gives you ss
. Or consider the
Greek letter ς
(sigma, when used at the end of a word). Converting it to
uppercase results in Σ
, which converted back to lowercase, gives you σ
.
The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing. Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
For instance, the letter ß
, which is already lowercase, is folded to
ss
. Similarly, the lowercase ς
is folded to σ
, to make σ
, ς
, and Σ
comparable, no matter where the letter appears in a word.
The default normalization form that the icu_normalizer
token filter uses
is nfkc_cf
. Like the nfkc
form, this does the following:
Composes characters into the shortest byte representation
Uses compatibility mode to convert characters like ffi
into the simpler
ffi
But it also does this:
Case-folds characters into a form suitable for case comparison
In other words, nfkc_cf
is the equivalent of the lowercase
token filter,
but suitable for use with all languages. The on-steroids equivalent of the
standard
analyzer would be the following:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "icu_tokenizer",
"filter": [ "icu_normalizer" ] (1)
}
}
}
}
}
The icu_normalizer
defaults to the nfkc_cf
form.
We can compare the results of running Weißkopfseeadler
and
WEISSKOPFSEEADLER
(the uppercase equivalent) through the standard
analyzer and through our Unicode-aware analyzer:
GET /_analyze?analyzer=standard (1)
Weißkopfseeadler WEISSKOPFSEEADLER
GET /my_index/_analyze?analyzer=my_lowercaser (2)
Weißkopfseeadler WEISSKOPFSEEADLER
Emits tokens weißkopfseeadler
, weisskopfseeadler
Emits tokens weisskopfseeadler
, weisskopfseeadler
The standard
analyzer emits two different, incomparable tokens, while our
custom analyzer produces tokens that are comparable, regardless of the
original case.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。