string - Hamming distance (Simhash python) giving out unexpected value -
i checking out simhash module ( https://github.com/leonsim/simhash ).
i presume simhash("string").distance(simhash("another string")) hamming distance between 2 strings. now, not sure understand "get_features(string) method completely, shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).
def get_features(s): width = 2 s = s.lower() s = re.sub(r'[^\w]+', '', s) return [s[i:i + width] in range(max(len(s) - width + 1, 1))]
now, when try compute distance between "aaaa" , "aaas" using width 2, gives out distance 0.
from simhash import simhash simhash(get_features("aaas")).distance(simhash(get_features("aaaa")))
i not sure missing out in here.
dig code
the width, in case, key parameter in get_features()
, give different splitted words. get_features()
in case output like:
['aa', 'aa', 'aa']
['aa', 'aa', 'as']
then simhash calculates these list unweighted features (which means default weight of each feature 1) , output like:
86f24ba207a4912
86f24ba207a4912
they same!
the reason simhash algorithm itself. let's code:
def build_by_features(self, features): """ `features` might list of unweighted tokens (a weight of 1 assumed), list of (token, weight) tuples or token -> weight dict. """ v = [0] * self.f masks = [1 << in range(self.f)] if isinstance(features, dict): features = features.items() f in features: if isinstance(f, basestring): h = self.hashfunc(f.encode('utf-8')) w = 1 else: assert isinstance(f, collections.iterable) h = self.hashfunc(f[0].encode('utf-8')) w = f[1] in range(self.f): v[i] += w if h & masks[i] else -w ans = 0 in range(self.f): if v[i] >= 0: ans |= masks[i] self.value = ans
the calculation process can divied 4 steps: 1) hash each splitted word (feature), transform string binary numbers; 2) weight them; 3) assumble weighted bits together; 4) change assumbled number binary , output value.
now, in case, step 3 output like:
[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3] [-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]
and after step 4, 2 output same value.
other parameter
if change width 2 1, 3, 4, different result of simhash(get_features())
.
your case shows limitation of simhash short length text.
Comments
Post a Comment