python - Join and sum on subset of rows in a dataframe -


i have pandas dataframe stores date ranges , associated colums:

        date_start    date_end ... lots of other columns ... 1       2016-07-01  2016-07-02 2       2016-07-01  2016-07-03 3       2016-07-01  2016-07-04 4       2016-07-02  2016-07-07 5       2016-07-05  2016-07-06 

and dataframe of pikachu sightings indexed date:

              pikachu_sightings       date 2016-07-01                    2 2016-07-02                    4 2016-07-03                    6 2016-07-04                    8 2016-07-05                   10 2016-07-06                   12 2016-07-07                   14 

for each row in first df i'd calculate sum of pikachu_sightings within date range (i.e., date_start date_end) , store in new column. end df (numbers left in clarity):

        date_start    date_end    total_pikachu_sightings 1       2016-07-01  2016-07-02                      2 + 4 2       2016-07-01  2016-07-03                  2 + 4 + 6 3       2016-07-01  2016-07-04              2 + 4 + 6 + 8 4       2016-07-02  2016-07-07   4 + 6 + 8 + 10 + 12 + 14 5       2016-07-05  2016-07-06                    10 + 12 

if doing iteratively i'd iterate on each row in table of date ranges, select subset of rows in table of sightings match date range , perform sum on - way slow dataset:

for range in ranges.itertuples():     sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)]     sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum()     ranges.set_value(range.index, 'total_pikachu_sightings', sum_sightings_in_range) 

this attempt @ using pandas, fails because length of 2 dataframes not match (and if did, there's other flaw in approach):

range["total_pikachu_sightings"] =     sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)              ["pikachu_sightings"].sum() 

i'm trying understand general approach/design should i'd aggregate other functions too, sum seems easiest example. sorry if obvious question - i'm new pandas!

first make sure pikachu_sightings has datetime index , sorted.

p = pikachu_sightings.squeeze() # force series p.index = pd.to_datetime(p.index) p = p.sort_index() 

then make sure date_start , date_end datetime.

df.date_start = pd.to_datetime(df.date_start) df.date_end   = pd.to_datetime(df.date_end) 

then simply

df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1)  0     6 1    12 2    20 3    54 4    22 dtype: int64 

Comments

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -