python - How to parse labeled values of columns into a Pandas Dataframe (some column values are missing)? -

February 15, 2015

the follow 2 rows unlabeled dataset, small subset:

random1 147 sub1    95  34  dewdfa3 15000   -1238   sbaasbaqsbarsbatsbausbaxbelaax  aaa:col:uvtwuvwduwduwdww    bbb:col:f   ccc:col:gtatgtca    ddd:col:k20 eee:col:54t ggg:col:-30.5   hhh:col:000.1   iii:col:2   jjj:col:0     random2 123 sub1    996 12  kwnc239 10027    144        lbprlbpslbrdlbsdlbsllbwb    aaa:col:uwttutuvvuwwuuu bbb:col:f   ddd:col:cacgtcgg    eee:col:k19 fff:col:hcc16   ggg:col:873 iii:col:-77 jjj:col:0   kkk:col:0   lll:col:1   mmm:col:212

the first 9 columns consistent throughout dataset, , labeled.

my problem following columns. each value in row labeled column value first, e.g. aaa:col:uvtwuvwduwduwdww column aaa, bbb:col:f column bbb, etc.

however, (1) each row not have same number of columns , (2) columns "missing". first row missing column fff, second row skips column ccc , hhh.

also, notice first row stops @ column jjj, while second column stops @ column mmm.

how 1 allocate 9 + 13 columns of dataframe, , parse these values such if column:value pair didn't exist, column have nan value.

would pandas.read_table() have functionality this?

this "correct" format first row:

random    int     sub    int2    int3    string1    int4    int5    string2                         aaa            bbb    ccc    ddd    eee    fff    ggg .... mmm random1   147    sub1    95      34      dewdfa3    15000   -1238   sbaasbaqsbarsbatsbausbaxbelaax  uvtwuvwduwduwdww    f   dfadfadfa   k20 54t 'nan' -30.5 ....'nan'

related (and unanswered) question here: how import unlabeled , missing columns pandas dataframe?

this it:

text = """random1 147 sub1    95  34  dewdfa3 15000   -1238   sbaasbaqsbarsbatsbausbaxbelaax  aaa:col:uvtwuvwduwduwdww    bbb:col:f   ccc:col:gtatgtca    ddd:col:k20 eee:col:54t ggg:col:-30.5    hhh:col:000.1   iii:col:2  jjj:col:0    random2 123 sub1    996 12  kwnc239 10027    144        lbprlbpslbrdlbsdlbsllbwb    aaa:col:uwttutuvvuwwuuu bbb:col:f   ddd:col:cacgtcgg    eee:col:k19 fff:col:hcc16   ggg:col:873 iii:col:-77 jjj:col:0   kkk:col:0   lll:col:1   mmm:col:212"""  data = [line.split() line in text.split('\n')] data1 = [line[:9] line in data] data2 = [line[9:] line in data]  # list of dictionaries data2, parse columns dict2 = [[dict([d.split(':col:') d in d1]) d1 in data2]  result = pd.concat([pd.dataframe(data1),                     pd.dataframe(dict2)],                    axis=1)  result.iloc[:, 9:]

Search This Blog

Perl

python - How to parse labeled values of columns into a Pandas Dataframe (some column values are missing)? -

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -