r - Repeated subsetting and arithmetic of dataset based on second dataframe -

June 15, 2011

i have dataframe "dfa" (65,000 rows) of form:

chr pos     ncp     ncp_ratio 1   72      1.06    0.599 1   371     4.26    1.331 1   633     2.10    2.442 1   859     1.62    1.276 1   1032    7.62    4.563 1   1199    6.12    4.896 1   1340    13.22   23.607

i wish use values of chr , pos in each row of dfa sequentially subset second data.frame dfb of form:

chr pos watson  crick 1   1   5       0 1   2   5       0 1   4   1       0 1   6   1       0 1   7   1       0 1   8   2       0 1   9   2       0 1   12  1       0 1   14  1       0 1   15  2       0 1   22  1       0

dfb has 4 million rows.

each time subset dfb, i'd retrieve values region of interest based on range in pos (i.e. +/- 1000 value of pos in dfa), , add them third data.frame dfc prefilled zeros.

i have working looping through each row of dfa. due 65,000 rows, takes hours. questions are:

is there better/more efficient way?
which part of code slowing down terribly?"

my code:

temp=null width=300 # region upstream , downstream of centrepoint # padding=50 # add padding area table # width1=width+padding dfc=data.frame(null) dfc[1:((width1*2)+1),"pos"]=(1:((width1*2)+1)) # create pos column #  # prefill dfc table zeros # dfc[1:((width1*2)+1),"watson"]=0 dfc[1:((width1*2)+1),"crick"]=0  (chrom in 1:16) { # loop1. specify chromosomes process #    dfb.1=subset(dfb,chr==chrom) # make temp copy of dataframes each chromosome #   dfa.1=subset(dfa, chr==chrom)  (i in 1:nrow(dfa.1)) { # loop2: each row in dfa:    temp=subset(dfb.1, pos>=(dfa.1[i,"pos"]-width1) & pos<=(dfa.1[i,"pos"]+width1)) # create temp matrix hits in region   temp$pos=temp$pos-dfa.1[i,"pos"]+width1+1   dfc[temp$pos,"watson"]=dfc[temp$pos,"watson"]+temp[,"watson"]   dfc[temp$pos,"crick"]=dfc[temp$pos,"crick"]+temp[,"crick"]  } # end of loop2 # } # end of loop1 #

example output in following form - pos contains values of 1 2000 (representing region of -1000 +1000 flanking each central pos position in dfa), , watson/crick columns contain sum of hits each location.

pos watson  crick 1   15      34 2   35      32 3   11      26 4   19      52 5   10      23 6   32      17 7   21      6 8   15      38 9   17      68 10  28      54 11  27      35 etc

i cleaned code, don't expect great improvement, think version might run marginally faster.

width <- 300 padding <- 50 width1 <- width + padding     dfc <- data.frame(pos=1:((width1*2)+1), watson=0, crick=0) (chrom in 1:16) {     dfb1 <- subset(dfb, chr == chrom)     (pos in dfa$pos[dfa$chr == chrom]) {         dfb2 <- dfb1[(dfb1$pos >= pos - width1) & (dfb1$pos <= pos + width1), ]         rows <- dfb2$pos - pos + width1 + 1         dfc$watson[rows] <- dfc$watson[rows] + dfb2$watson         dfc$crick[rows] <- dfc$crick[rows] + dfb2$crick     } }

Search This Blog

Perl

r - Repeated subsetting and arithmetic of dataset based on second dataframe -

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -