r - Repeated subsetting and arithmetic of dataset based on second dataframe -
i have dataframe "dfa" (65,000 rows) of form:
chr pos ncp ncp_ratio 1 72 1.06 0.599 1 371 4.26 1.331 1 633 2.10 2.442 1 859 1.62 1.276 1 1032 7.62 4.563 1 1199 6.12 4.896 1 1340 13.22 23.607
i wish use values of chr
, pos
in each row of dfa
sequentially subset second data.frame dfb
of form:
chr pos watson crick 1 1 5 0 1 2 5 0 1 4 1 0 1 6 1 0 1 7 1 0 1 8 2 0 1 9 2 0 1 12 1 0 1 14 1 0 1 15 2 0 1 22 1 0
dfb
has 4 million rows.
each time subset dfb
, i'd retrieve values region of interest based on range in pos
(i.e. +/- 1000 value of pos
in dfa
), , add them third data.frame dfc
prefilled zeros.
i have working looping through each row of dfa
. due 65,000 rows, takes hours. questions are:
is there better/more efficient way?
which part of code slowing down terribly?"
my code:
temp=null width=300 # region upstream , downstream of centrepoint # padding=50 # add padding area table # width1=width+padding dfc=data.frame(null) dfc[1:((width1*2)+1),"pos"]=(1:((width1*2)+1)) # create pos column # # prefill dfc table zeros # dfc[1:((width1*2)+1),"watson"]=0 dfc[1:((width1*2)+1),"crick"]=0 (chrom in 1:16) { # loop1. specify chromosomes process # dfb.1=subset(dfb,chr==chrom) # make temp copy of dataframes each chromosome # dfa.1=subset(dfa, chr==chrom) (i in 1:nrow(dfa.1)) { # loop2: each row in dfa: temp=subset(dfb.1, pos>=(dfa.1[i,"pos"]-width1) & pos<=(dfa.1[i,"pos"]+width1)) # create temp matrix hits in region temp$pos=temp$pos-dfa.1[i,"pos"]+width1+1 dfc[temp$pos,"watson"]=dfc[temp$pos,"watson"]+temp[,"watson"] dfc[temp$pos,"crick"]=dfc[temp$pos,"crick"]+temp[,"crick"] } # end of loop2 # } # end of loop1 #
example output in following form - pos contains values of 1 2000 (representing region of -1000 +1000 flanking each central pos position in dfa), , watson/crick columns contain sum of hits each location.
pos watson crick 1 15 34 2 35 32 3 11 26 4 19 52 5 10 23 6 32 17 7 21 6 8 15 38 9 17 68 10 28 54 11 27 35 etc
i cleaned code, don't expect great improvement, think version might run marginally faster.
width <- 300 padding <- 50 width1 <- width + padding dfc <- data.frame(pos=1:((width1*2)+1), watson=0, crick=0) (chrom in 1:16) { dfb1 <- subset(dfb, chr == chrom) (pos in dfa$pos[dfa$chr == chrom]) { dfb2 <- dfb1[(dfb1$pos >= pos - width1) & (dfb1$pos <= pos + width1), ] rows <- dfb2$pos - pos + width1 + 1 dfc$watson[rows] <- dfc$watson[rows] + dfb2$watson dfc$crick[rows] <- dfc$crick[rows] + dfb2$crick } }
Comments
Post a Comment