string - Hamming distance (Simhash python) giving out unexpected value -


i checking out simhash module ( https://github.com/leonsim/simhash ).

i presume simhash("string").distance(simhash("another string")) hamming distance between 2 strings. now, not sure understand "get_features(string) method completely, shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).

def get_features(s):     width = 2     s = s.lower()     s = re.sub(r'[^\w]+', '', s)     return [s[i:i + width] in range(max(len(s) - width + 1, 1))] 

now, when try compute distance between "aaaa" , "aaas" using width 2, gives out distance 0.

from simhash import simhash  simhash(get_features("aaas")).distance(simhash(get_features("aaaa"))) 

i not sure missing out in here.

dig code

the width, in case, key parameter in get_features(), give different splitted words. get_features() in case output like:

['aa', 'aa', 'aa']

['aa', 'aa', 'as']

then simhash calculates these list unweighted features (which means default weight of each feature 1) , output like:

86f24ba207a4912

86f24ba207a4912

they same!

the reason simhash algorithm itself. let's code:

def build_by_features(self, features):     """     `features` might list of unweighted tokens (a weight of 1         assumed), list of (token, weight) tuples or         token -> weight dict.     """     v = [0] * self.f     masks = [1 << in range(self.f)]     if isinstance(features, dict):         features = features.items()     f in features:         if isinstance(f, basestring):             h = self.hashfunc(f.encode('utf-8'))             w = 1         else:             assert isinstance(f, collections.iterable)             h = self.hashfunc(f[0].encode('utf-8'))             w = f[1]         in range(self.f):             v[i] += w if h & masks[i] else -w     ans = 0     in range(self.f):         if v[i] >= 0:             ans |= masks[i]     self.value = ans 

from: leonsim/simhash

the calculation process can divied 4 steps: 1) hash each splitted word (feature), transform string binary numbers; 2) weight them; 3) assumble weighted bits together; 4) change assumbled number binary , output value.

now, in case, step 3 output like:

[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3]  [-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1] 

and after step 4, 2 output same value.

other parameter

if change width 2 1, 3, 4, different result of simhash(get_features()).

your case shows limitation of simhash short length text.


Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -