python - Removing encoded text from strings read from txt file -
here's problem:
i copied , pasted entire list txt file https://www.cboe.org/mdx/mdi/mdiproducts.aspx
sample of text lines:
bfly - cboe s&p 500 iron butterfly index bpvix - cboe/cme fx british pound volatility index bpvix1 - cboe/cme fx british pound volatility first term structure index bpvix2 - cboe/cme fx british pound volatility second term structure index
these lines of course appear normal in text file, , saved file utf-8 encoding.
my goal use python strip out symbols long list, .e.g. bfly, vpvix etc, , write them new file
i using following code read file , split it:
x=open('sometextfile.txt','r') y=x.read().split()
the issue i'm seeing there unfamiliar characters popping , affecting ability filter list. example:
print(y[0]) bfly
i'm guessing these characters have encoding , have tried few different things codec module without success. using .decode('utf-8') throws error when trying use against above variables x or y. able use .encode('utf-8'), makes things worse.
the main problem when try loop through list , remove items not upper case or contain non-alpha characters. ex:
y[0].isalpha() false y[0].isupper() false
so in example symbol bfly ends being removed list.
funny thing these characters not present in txt file if like:
q=open('someotherfile.txt','w') q.write(y[0])
any appreciated. understand why happens when copying , pasting text web pages one.
why not use regex?
i think catch letters in caps
"[a-z]{1,}/?[a-z]{1,}[0-9]?"
this better. got list of such symbols. here's result.
['bfly', 'cboe', 'bpvix', 'cboe/cme', 'fx', 'bpvix1', 'cboe/cme', 'fx', 'bpvix2', 'cboe/cme', 'fx']
here's code
import re reg_obj = re.compile(r'[a-z]{1,}/?[a-z]{1,}[0-9]?') sym = reg_obj.findall(a)enter code here print(sym)
Comments
Post a Comment