Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers,
especially, will blithely apply their rules to any words they encounter,
perhaps conflating words that you would prefer to keep separate. Maybe, for
your use case, it is important to keep skies
and skiing
as distinct words
rather than stemming them both down to ski
(as would happen with the
english
analyzer).
The {ref}/analysis-keyword-marker-tokenfilter.html[keyword_marker
] and
{ref}/analysis-stemmer-override-tokenfilter.html[stemmer_override
] token filters
allow us to customize the stemming process.
The stem_exclusion
parameter for language analyzers (see
[configuring-language-analyzers]) allowed us to specify a list of words that
should not be stemmed. Internally, these language analyzers use the
{ref}/analysis-keyword-marker-tokenfilter.html[keyword_marker
token filter]
to mark the listed words as keywords, which prevents subsequent stemming
token filters from touching those words.
For instance, we can create a simple custom analyzer that uses the
{ref}/analysis-porterstem-tokenfilter.html[porter_stem
] token filter,
but prevents the word skies
from being stemmed:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"no_stem": {
"type": "keyword_marker",
"keywords": [ "skies" ] (1)
}
},
"analyzer": {
"my_english": {
"tokenizer": "standard",
"filter": [
"lowercase",
"no_stem",
"porter_stem"
]
}
}
}
}
}
They keywords
parameter could accept multiple words.
Testing it with the analyze
API shows that just the word skies
has
been excluded from stemming:
GET /my_index/_analyze?analyzer=my_english
sky skies skiing skis (1)
Returns: sky
, skies
, ski
, ski
Tip
|
While the language analyzers allow us only to specify an array of words in the
|
In the preceding example, we prevented skies
from being stemmed, but perhaps we
would prefer it to be stemmed to sky
instead. The
{ref}/analysis-stemmer-override-tokenfilter.html[stemmer_override
] token
filter allows us to specify our own custom stemming rules. At the same time,
we can handle some irregular forms like stemming mice
to mouse
and feet
to foot
:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"custom_stem": {
"type": "stemmer_override",
"rules": [ (1)
"skies=>sky",
"mice=>mouse",
"feet=>foot"
]
}
},
"analyzer": {
"my_english": {
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_stem", (2)
"porter_stem"
]
}
}
}
}
}
GET /my_index/_analyze?analyzer=my_english
The mice came down from the skies and ran over my feet (3)
Rules take the form original⇒stem
.
The stemmer_override
filter must be placed before the stemmer.
Returns the
, mouse
, came
, down
, from
, the
, sky
,
and
, ran
, over
, my
, foot
.
Tip
|
Just as for the keyword_marker token filter, rules can be stored
in a file whose location should be specified with the rules_path
parameter.
|
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。