character encoding - Python 3 - number of letters in an encoded string -


i number of letters in given string. however, len(txt) returns number of letters in unicode form (i guess), actual number of letters less get.

for example:

txt = שלום וברכה len(txt)   # returns different 10 

i saw solution python 2 using string.decode , not available in python 3 - , i'm not sure appropriate answer me. way, encoding string cp862.

edit: more details: read text file using

with open(path, "r",  encoding="cp862") textfile: 

this output of line read when print it

╫¬╫ñ╫¿╫ש╫ר ╫£╫ª╫ץ╫¥: ╫¢╫ת ╫¬╫ª╫£╫ק╫ץ ╫נ╫¬ ╫¢╫ש╫ñ╫ץ╫¿ 

the length 52. real line is: תפריט לצום: כך תצלחו את כיפור , real length 29

probably, opening file wrong encoding scheme, here demonstration:

>>> import sys >>> sys.version '3.4.3 (default, oct 14 2015, 20:28:29) \n[gcc 4.8.4]' >>>  >>> s = '╫¬╫ñ╫¿╫ש╫ר ╫£╫ª╫ץ╫¥: ╫¢╫ת ╫¬╫ª╫£╫ק╫ץ ╫נ╫¬ ╫¢╫ש╫ñ╫ץ╫¿' >>> len(s) 52 >>> >>> s = s.encode('cp862').decode('utf-8') 'תפריט לצום: כך תצלחו את כיפור' >>> len(s) 29 

try open default encoding (utf-8).


Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -