python - Removing encoded text from strings read from txt file -


here's problem:

i copied , pasted entire list txt file https://www.cboe.org/mdx/mdi/mdiproducts.aspx

sample of text lines:

bfly - cboe s&p 500 iron butterfly index bpvix - cboe/cme fx british pound volatility index bpvix1 - cboe/cme fx british pound volatility first term structure index bpvix2 - cboe/cme fx british pound volatility second term structure index

these lines of course appear normal in text file, , saved file utf-8 encoding.

my goal use python strip out symbols long list, .e.g. bfly, vpvix etc, , write them new file

i using following code read file , split it:

x=open('sometextfile.txt','r') y=x.read().split() 

the issue i'm seeing there unfamiliar characters popping , affecting ability filter list. example:

print(y[0]) bfly 

i'm guessing these characters have encoding , have tried few different things codec module without success. using .decode('utf-8') throws error when trying use against above variables x or y. able use .encode('utf-8'), makes things worse.

the main problem when try loop through list , remove items not upper case or contain non-alpha characters. ex:

y[0].isalpha() false y[0].isupper() false 

so in example symbol bfly ends being removed list.

funny thing these characters not present in txt file if like:

q=open('someotherfile.txt','w') q.write(y[0]) 

any appreciated. understand why happens when copying , pasting text web pages one.

why not use regex?

i think catch letters in caps

"[a-z]{1,}/?[a-z]{1,}[0-9]?" 

this better. got list of such symbols. here's result.

['bfly', 'cboe', 'bpvix', 'cboe/cme', 'fx', 'bpvix1', 'cboe/cme', 'fx', 'bpvix2', 'cboe/cme', 'fx'] 

here's code

import re reg_obj = re.compile(r'[a-z]{1,}/?[a-z]{1,}[0-9]?') sym = reg_obj.findall(a)enter code here print(sym) 

Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -