linux - Remove Lines from File which not appear in another File, error -

April 15, 2012

i have 2 files, similar ones below:

file 1 - phenotype informations, first column individual, orinal file has 400 rows:

215 2 25 13.8354303 15.2841303 222 2 25.2 15.8507278 17.2994278 216 2 28.2 13.0482192 14.4969192 223 11 15.4 9.2714745 11.6494745

file 2 - snps information, original file has 400 lines , 42,000 characters per line.

215          20211111201200125201212202220111202005111102 222          20111011212200025002211001111120211015112111 216          20210005201100025210212102210212201005101001 223          20222120201200125202202102210121201005010101 217          20211010202200025201202102210121201005010101 218          02022000252012021022101212010050101012021101

and need remove file 2 individuals not appear in file 1, example:

215          20211111201200125201212202220111202005111102 222          20111011212200025002211001111120211015112111 216          20210005201100025210212102210212201005101001 223          20222120201200125202202102210121201005010101

i this code:

awk 'nr==fnr{a[$1]; next}$1 in a{print $0}' file1 file2> file3

however, when main analysis generated file following error appears:

*** error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 *** *** error in `./postgsf90': free(): invalid size: 0x00007fec4a04f010 ***

airemlf90 , postgsf90 software. when use original file problem not occur. command made delete individuals adequate? detail did not individuals have identification 4 characters, can error?

thanks

i wrote small python script in few minutes. works well, have tested 42000-char lines , works fine.

import sys,re  # rudimentary argument parsing  file1 = sys.argv[1] file2 = sys.argv[2] file3 = sys.argv[3]  present = set()  # first read file 1, discard fields except first 1 (the key) open(file1,"r") f1:     l in f1:         toks = re.split("\s+",l)    # same awk fields         if toks:   # robustness against empty lines             present.add(toks[0])  #now read second 1 , write in third 1 if id in set  open(file2,"r") f2:     open(file3,"w") f3:         l in f2:             toks = re.split("\s+",l)             if toks , toks[0] in present:                 f3.write(l)

(first install python if not present.)

call sample script mytool.py , run this:

python mytool.py file1.txt file2.txt file3.txt

to process several files @ once in bash file (to replace original solution) it's easy (although not optimal because done in whirl in python)

<whatever loop need>;   python my_tool.py $1 $2 $3 done

exactly call awk 3 files.

Search This Blog

Perl

linux - Remove Lines from File which not appear in another File, error -

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -