full text search - Handling the dot in ElasticSearch -

February 15, 2012

i have string property called summary has analyzer set trigrams , search_analyzer set words.

"filter": {     "words_splitter": {         "type": "word_delimiter",         "preserve_original": "true"     },     "english_words_filter": {         "type": "stop",         "stop_words": "_english_"     },     "trigrams_filter": {         "type": "ngram",         "min_gram": "2",         "max_gram": "20"     } }, "analyzer": {     "words": {         "filter": [             "lowercase",             "words_splitter",             "english_words_filter"         ],         "type": "custom",         "tokenizer": "whitespace"     },     "trigrams": {         "filter": [             "lowercase",             "words_splitter",             "trigrams_filter",             "english_words_filter"         ],         "type": "custom",         "tokenizer": "whitespace"     } }

i need query strings given in input react , html (or react, html) being matched documents contain in summary words react, reactjs, react.js, html, html5. more matching keywords have, higher score have (i expect lower scores on documents have word matching not @ 100%, ideally).

the thing is, guess @ moment react.js split in both react , js since documents contain js well. on other hand, reactjs returns nothing. think need words_splitter in order ignore comma.

you can solve problem names react.js keyword marker filter , defining analyzer uses keyword filter. prevent react.js being split react , js tokens.

here example configuration filter:

     "filter": {         "keywords": {            "type": "keyword_marker",            "keywords": [               "react.js",            ]         }      }

and analyzer:

     "analyzer": {         "main_analyzer": {            "type": "custom",            "tokenizer": "standard",            "filter": [               "lowercase",               "keywords",               "synonym_filter",               "german_stop",               "german_stemmer"            ]         }      }

you can see whether analyzer behaves required using analyze command:

get /<index_name>/_analyze?analyzer=main_analyzer&text="react.js nice library"

this should return following tokens react.js not tokenized:

{    "tokens": [       {          "token": "react.js",          "start_offset": 1,          "end_offset": 9,          "type": "<alphanum>",          "position": 0       },       {          "token": "is",          "start_offset": 10,          "end_offset": 12,          "type": "<alphanum>",          "position": 1       },       {          "token": "a",          "start_offset": 13,          "end_offset": 14,          "type": "<alphanum>",          "position": 2       },       {          "token": "nice",          "start_offset": 15,          "end_offset": 19,          "type": "<alphanum>",          "position": 3       },       {          "token": "library",          "start_offset": 20,          "end_offset": 27,          "type": "<alphanum>",          "position": 4       }    ] }

for words similar not same as: react.js , reactjs use synonym filter. have fixed set of keywords want match?

Search This Blog

Perl

full text search - Handling the dot in ElasticSearch -

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -