linux - Remove Lines from File which not appear in another File, error -
i have 2 files, similar ones below:
file 1 - phenotype informations, first column individual, orinal file has 400 rows:
215 2 25 13.8354303 15.2841303 222 2 25.2 15.8507278 17.2994278 216 2 28.2 13.0482192 14.4969192 223 11 15.4 9.2714745 11.6494745
file 2 - snps information, original file has 400 lines , 42,000 characters per line.
215 20211111201200125201212202220111202005111102 222 20111011212200025002211001111120211015112111 216 20210005201100025210212102210212201005101001 223 20222120201200125202202102210121201005010101 217 20211010202200025201202102210121201005010101 218 02022000252012021022101212010050101012021101
and need remove file 2 individuals not appear in file 1, example:
215 20211111201200125201212202220111202005111102 222 20111011212200025002211001111120211015112111 216 20210005201100025210212102210212201005101001 223 20222120201200125202202102210121201005010101
i this code:
awk 'nr==fnr{a[$1]; next}$1 in a{print $0}' file1 file2> file3
however, when main analysis generated file following error appears:
*** error in `./airemlf90': free(): invalid size: 0x00007f5041cc2010 *** *** error in `./postgsf90': free(): invalid size: 0x00007fec4a04f010 ***
airemlf90 , postgsf90 software. when use original file problem not occur. command made delete individuals adequate? detail did not individuals have identification 4 characters, can error?
thanks
i wrote small python script in few minutes. works well, have tested 42000-char lines , works fine.
import sys,re # rudimentary argument parsing file1 = sys.argv[1] file2 = sys.argv[2] file3 = sys.argv[3] present = set() # first read file 1, discard fields except first 1 (the key) open(file1,"r") f1: l in f1: toks = re.split("\s+",l) # same awk fields if toks: # robustness against empty lines present.add(toks[0]) #now read second 1 , write in third 1 if id in set open(file2,"r") f2: open(file3,"w") f3: l in f2: toks = re.split("\s+",l) if toks , toks[0] in present: f3.write(l)
(first install python if not present.)
call sample script mytool.py
, run this:
python mytool.py file1.txt file2.txt file3.txt
to process several files @ once in bash file (to replace original solution) it's easy (although not optimal because done in whirl in python)
<whatever loop need>; python my_tool.py $1 $2 $3 done
exactly call awk 3 files.
Comments
Post a Comment