python - How to parse labeled values of columns into a Pandas Dataframe (some column values are missing)? -
the follow 2 rows unlabeled dataset, small subset:
random1 147 sub1 95 34 dewdfa3 15000 -1238 sbaasbaqsbarsbatsbausbaxbelaax aaa:col:uvtwuvwduwduwdww bbb:col:f ccc:col:gtatgtca ddd:col:k20 eee:col:54t ggg:col:-30.5 hhh:col:000.1 iii:col:2 jjj:col:0 random2 123 sub1 996 12 kwnc239 10027 144 lbprlbpslbrdlbsdlbsllbwb aaa:col:uwttutuvvuwwuuu bbb:col:f ddd:col:cacgtcgg eee:col:k19 fff:col:hcc16 ggg:col:873 iii:col:-77 jjj:col:0 kkk:col:0 lll:col:1 mmm:col:212
the first 9 columns consistent throughout dataset, , labeled.
my problem following columns. each value in row labeled column value first, e.g. aaa:col:uvtwuvwduwduwdww
column aaa
, bbb:col:f
column bbb
, etc.
however, (1) each row not have same number of columns , (2) columns "missing". first row missing column fff
, second row skips column ccc
, hhh
.
also, notice first row stops @ column jjj
, while second column stops @ column mmm
.
how 1 allocate 9 + 13 columns of dataframe, , parse these values such if column:value
pair didn't exist, column have nan
value.
would pandas.read_table()
have functionality this?
this "correct" format first row:
random int sub int2 int3 string1 int4 int5 string2 aaa bbb ccc ddd eee fff ggg .... mmm random1 147 sub1 95 34 dewdfa3 15000 -1238 sbaasbaqsbarsbatsbausbaxbelaax uvtwuvwduwduwdww f dfadfadfa k20 54t 'nan' -30.5 ....'nan'
related (and unanswered) question here: how import unlabeled , missing columns pandas dataframe?
this it:
text = """random1 147 sub1 95 34 dewdfa3 15000 -1238 sbaasbaqsbarsbatsbausbaxbelaax aaa:col:uvtwuvwduwduwdww bbb:col:f ccc:col:gtatgtca ddd:col:k20 eee:col:54t ggg:col:-30.5 hhh:col:000.1 iii:col:2 jjj:col:0 random2 123 sub1 996 12 kwnc239 10027 144 lbprlbpslbrdlbsdlbsllbwb aaa:col:uwttutuvvuwwuuu bbb:col:f ddd:col:cacgtcgg eee:col:k19 fff:col:hcc16 ggg:col:873 iii:col:-77 jjj:col:0 kkk:col:0 lll:col:1 mmm:col:212""" data = [line.split() line in text.split('\n')] data1 = [line[:9] line in data] data2 = [line[9:] line in data] # list of dictionaries data2, parse columns dict2 = [[dict([d.split(':col:') d in d1]) d1 in data2] result = pd.concat([pd.dataframe(data1), pd.dataframe(dict2)], axis=1) result.iloc[:, 9:]
Comments
Post a Comment