python - Join and sum on subset of rows in a dataframe -
i have pandas dataframe stores date ranges , associated colums:
date_start date_end ... lots of other columns ... 1 2016-07-01 2016-07-02 2 2016-07-01 2016-07-03 3 2016-07-01 2016-07-04 4 2016-07-02 2016-07-07 5 2016-07-05 2016-07-06
and dataframe of pikachu sightings indexed date:
pikachu_sightings date 2016-07-01 2 2016-07-02 4 2016-07-03 6 2016-07-04 8 2016-07-05 10 2016-07-06 12 2016-07-07 14
for each row in first df i'd calculate sum of pikachu_sightings within date range (i.e., date_start
date_end
) , store in new column. end df (numbers left in clarity):
date_start date_end total_pikachu_sightings 1 2016-07-01 2016-07-02 2 + 4 2 2016-07-01 2016-07-03 2 + 4 + 6 3 2016-07-01 2016-07-04 2 + 4 + 6 + 8 4 2016-07-02 2016-07-07 4 + 6 + 8 + 10 + 12 + 14 5 2016-07-05 2016-07-06 10 + 12
if doing iteratively i'd iterate on each row in table of date ranges, select subset of rows in table of sightings match date range , perform sum on - way slow dataset:
for range in ranges.itertuples(): sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)] sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum() ranges.set_value(range.index, 'total_pikachu_sightings', sum_sightings_in_range)
this attempt @ using pandas, fails because length of 2 dataframes not match (and if did, there's other flaw in approach):
range["total_pikachu_sightings"] = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end) ["pikachu_sightings"].sum()
i'm trying understand general approach/design should i'd aggregate other functions too, sum
seems easiest example. sorry if obvious question - i'm new pandas!
first make sure pikachu_sightings
has datetime index , sorted.
p = pikachu_sightings.squeeze() # force series p.index = pd.to_datetime(p.index) p = p.sort_index()
then make sure date_start
, date_end
datetime.
df.date_start = pd.to_datetime(df.date_start) df.date_end = pd.to_datetime(df.date_end)
then simply
df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1) 0 6 1 12 2 20 3 54 4 22 dtype: int64
Comments
Post a Comment