full text search - Handling the dot in ElasticSearch -
i have string property called summary
has analyzer
set trigrams
, search_analyzer
set words
.
"filter": { "words_splitter": { "type": "word_delimiter", "preserve_original": "true" }, "english_words_filter": { "type": "stop", "stop_words": "_english_" }, "trigrams_filter": { "type": "ngram", "min_gram": "2", "max_gram": "20" } }, "analyzer": { "words": { "filter": [ "lowercase", "words_splitter", "english_words_filter" ], "type": "custom", "tokenizer": "whitespace" }, "trigrams": { "filter": [ "lowercase", "words_splitter", "trigrams_filter", "english_words_filter" ], "type": "custom", "tokenizer": "whitespace" } }
i need query strings given in input react , html
(or react, html
) being matched documents contain in summary
words react
, reactjs
, react.js
, html
, html5
. more matching keywords have, higher score have (i expect lower scores on documents have word matching not @ 100%, ideally).
the thing is, guess @ moment react.js
split in both react
, js
since documents contain js
well. on other hand, reactjs
returns nothing. think need words_splitter
in order ignore comma.
you can solve problem names react.js keyword marker filter , defining analyzer uses keyword filter. prevent react.js being split react , js tokens.
here example configuration filter:
"filter": { "keywords": { "type": "keyword_marker", "keywords": [ "react.js", ] } }
and analyzer:
"analyzer": { "main_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "keywords", "synonym_filter", "german_stop", "german_stemmer" ] } }
you can see whether analyzer behaves required using analyze command:
get /<index_name>/_analyze?analyzer=main_analyzer&text="react.js nice library"
this should return following tokens react.js not tokenized:
{ "tokens": [ { "token": "react.js", "start_offset": 1, "end_offset": 9, "type": "<alphanum>", "position": 0 }, { "token": "is", "start_offset": 10, "end_offset": 12, "type": "<alphanum>", "position": 1 }, { "token": "a", "start_offset": 13, "end_offset": 14, "type": "<alphanum>", "position": 2 }, { "token": "nice", "start_offset": 15, "end_offset": 19, "type": "<alphanum>", "position": 3 }, { "token": "library", "start_offset": 20, "end_offset": 27, "type": "<alphanum>", "position": 4 } ] }
for words similar not same as: react.js , reactjs use synonym filter. have fixed set of keywords want match?
Comments
Post a Comment