Most of the stemmers available in Elasticsearch are algorithmic in that they
apply a series of rules to a word in order to reduce it to its root form, such
as stripping the final s
or es
from plurals. They don’t have to know
anything about individual words in order to stem them.
These algorithmic stemmers have the advantage that they are available out of
the box, are fast, use little memory, and work well for regular words. The
downside is that they don’t cope well with irregular words like be
, are
,
and am
, or mice
and mouse
.
One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball.
Tip
|
The {ref}/analysis-kstem-tokenfilter.html[ |
While you can use the
{ref}/analysis-porterstem-tokenfilter.html[porter_stem
] or
{ref}/analysis-kstem-tokenfilter.html[kstem
] token filter directly, or
create a language-specific Snowball stemmer with the
{ref}/analysis-snowball-tokenfilter.html[snowball
] token filter, all of the
algorithmic stemmers are exposed via a single unified interface:
the {ref}/analysis-stemmer-tokenfilter.html[stemmer
token filter], which
accepts the language
parameter.
For instance, perhaps you find the default stemmer used by the english
analyzer to be too aggressive and you want to make it less aggressive.
The first step is to look up the configuration for the english
analyzer
in the {ref}/analysis-lang-analyzer.html[language analyzers]
documentation, which shows the following:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker", (1)
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english" (2)
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english" (2)
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
The keyword_marker
token filter lists words that should not be
stemmed. This defaults to the empty list.
The english
analyzer uses two stemmers: the possessive_english
and the english
stemmer. The possessive stemmer removes 's
from any words before passing them on to the english_stop
,
english_keywords
, and english_stemmer
.
Having reviewed the current configuration, we can use it as the basis for a new analyzer, with the following changes:
Change the english_stemmer
from english
(which maps to the
{ref}/analysis-porterstem-tokenfilter.html[porter_stem
] token filter)
to light_english
(which maps to the less aggressive
{ref}/analysis-kstem-tokenfilter.html[kstem
] token filter).
Add the asciifolding
token filter to
remove any diacritics from foreign words.
Remove the keyword_marker
token filter, as we don’t need it.
(We discuss this in more detail in [controlling-stemming].)
Our new custom analyzer would look like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english" (1)
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"light_english_stemmer", (1)
"asciifolding" (2)
]
}
}
}
}
}
Replaced the english
stemmer with the less aggressive
light_english
stemmer
Added the asciifolding
token filter
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。