python - pandas plot time-series with minimized gaps -
i started explore depths of pandas , visualize time-series data contains gaps, of them rather large. example mydf
:
timestamp val 0 2016-07-25 00:00:00 0.740442 1 2016-07-25 01:00:00 0.842911 2 2016-07-25 02:00:00 -0.873992 3 2016-07-25 07:00:00 -0.474993 4 2016-07-25 08:00:00 -0.983963 5 2016-07-25 09:00:00 0.597011 6 2016-07-25 10:00:00 -2.043023 7 2016-07-25 12:00:00 0.304668 8 2016-07-25 13:00:00 1.185997 9 2016-07-25 14:00:00 0.920850 10 2016-07-25 15:00:00 0.201423 11 2016-07-25 16:00:00 0.842970 12 2016-07-25 21:00:00 1.061207 13 2016-07-25 22:00:00 0.232180 14 2016-07-25 23:00:00 0.453964
now plot dataframe through df1.plot(x='timestamp').get_figure().show()
, data along x-axis interpolated (appearing 1 line):
what have instead is:
- visible gaps between sections data
- a consistent gap-width differing gaps-legths
- perhaps form of marker in axis helps clarify fact jumps in time performed.
researching in matter i've come across
which come close i'm after former approach yield in leaving gaps out of plotted figure , latter in large gaps avoid (think of gaps may span few days).
as second approach may closer tried use timestamp-column index through:
mydf2 = pd.dataframe(data=list(mydf['val']), index=mydf[0])
which allows me fill gaps nan
through reindexing (wondering if there more simple solution achive this):
mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='h'))
leading to:
val 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 nan 2016-07-25 04:00:00 nan 2016-07-25 05:00:00 nan 2016-07-25 06:00:00 nan 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 nan 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 nan 2016-07-25 18:00:00 nan 2016-07-25 19:00:00 nan 2016-07-25 20:00:00 nan 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964
from here on might need reduce consecutive entries on limit missing data fix number (representing gap-width) , index-value of these entries plotted differently got lost here guess don't know how achieve that.
while tinkering around wondered if there might more direct , elegant approach , thankful if knowing more point me towards right direction.
thanks hints , feedback in advance!
### addendum ###
after posting question i've come across interesting idea postend andy hayden seems helpful. he's using column hold results of comparison of difference time-delta. after performing cumsum()
on int-representation of boolean results uses groupby()
cluster entries of each ungapped-series dataframegroupby
-object.
as written time ago pandas returns timedelta
-objects comparison should done timedelta
-object (based on mydf
above or on reindexed df2
after copying index column through mydf2['timestamp'] = mydf2.index
):
from datetime import timedelta mytd = timedelta(minutes=60) mydf['nogap'] = mydf['timestamp'].diff() > mytd mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() ## btw.: why not "... .apply(lambda x: int(x)) ..."? dfg = mydf.groupby('nogap')
we iterate on dataframegroup getting ungapped series , something them. pandas/mathplot-skills way immature plot group-elements sub-plots? maybe way discontinuity along time-axis represented in way (in form of interrupted axis-line or such)?
pirsquared's answer leads quite usable result thing kind of missing being more striking visual feedback along time-axis gap/time-jump has occurred between 2 values.
maybe grouped sections width of gap-representation more configurable?
i built new series , plotted it. not super elegant! believe gets wanted.
setup
do starting point
from stringio import stringio import pandas pd text = """ timestamp val 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964""" s1 = pd.read_csv(stringio(text), index_col=0, parse_dates=[0], engine='python', sep='\s{2,}').squeeze() s1 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 name: val, dtype: float64
resample hourly. resample
deferred method, meaning expects pass method afterwards knows do. used mean
. example, doesn't matter because sampling higher frequency. if care.
s2 = s1.resample('h').mean() s2 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 nan 2016-07-25 04:00:00 nan 2016-07-25 05:00:00 nan 2016-07-25 06:00:00 nan 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 nan 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 nan 2016-07-25 18:00:00 nan 2016-07-25 19:00:00 nan 2016-07-25 20:00:00 nan 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 freq: h, name: val, dtype: float64
ok, wanted equally sized gaps. tad tricky. used ffill(limit=1)
fill in 1 space of each gap. took slice of s2
forward filled thing not null. gives me single null each gap.
s3 = s2[s2.ffill(limit=1).notnull()] s3 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 nan 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 nan 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 nan 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 name: val, dtype: float64
lastly, if plotted this, still irregular gaps. need str
indices matplotlib
doesn't try expand out dates.
s3.reindex(s3.index.strftime('%h:%m')) timestamp 00:00 0.740442 01:00 0.842911 02:00 -0.873992 03:00 nan 07:00 -0.474993 08:00 -0.983963 09:00 0.597011 10:00 -2.043023 11:00 nan 12:00 0.304668 13:00 1.185997 14:00 0.920850 15:00 0.201423 16:00 0.842970 17:00 nan 21:00 1.061207 22:00 0.232180 23:00 0.453964 name: val, dtype: float64
i'll plot them can see difference.
f, = plt.subplots(2, 1, sharey=true, figsize=(10, 5)) s2.plot(ax=a[0]) s3.reindex(s3.index.strftime('%h:%m')).plot(ax=a[1])
Comments
Post a Comment