character encoding - Python 3 - number of letters in an encoded string -
i number of letters in given string. however, len(txt) returns number of letters in unicode form (i guess), actual number of letters less get.
for example:
txt = שלום וברכה len(txt) # returns different 10
i saw solution python 2 using string.decode
, not available in python 3 - , i'm not sure appropriate answer me. way, encoding string cp862
.
edit: more details: read text file using
with open(path, "r", encoding="cp862") textfile:
this output of line read when print it
╫¬╫ñ╫¿╫ש╫ר ╫£╫ª╫ץ╫¥: ╫¢╫ת ╫¬╫ª╫£╫ק╫ץ ╫נ╫¬ ╫¢╫ש╫ñ╫ץ╫¿
the length 52. real line is: תפריט לצום: כך תצלחו את כיפור , real length 29
probably, opening file wrong encoding scheme, here demonstration:
>>> import sys >>> sys.version '3.4.3 (default, oct 14 2015, 20:28:29) \n[gcc 4.8.4]' >>> >>> s = '╫¬╫ñ╫¿╫ש╫ר ╫£╫ª╫ץ╫¥: ╫¢╫ת ╫¬╫ª╫£╫ק╫ץ ╫נ╫¬ ╫¢╫ש╫ñ╫ץ╫¿' >>> len(s) 52 >>> >>> s = s.encode('cp862').decode('utf-8') 'תפריט לצום: כך תצלחו את כיפור' >>> len(s) 29
try open default encoding (utf-8).
Comments
Post a Comment