python - pandas plot time-series with minimized gaps -

i started explore depths of pandas , visualize time-series data contains gaps, of them rather large. example mydf:

             timestamp       val 0  2016-07-25 00:00:00  0.740442 1  2016-07-25 01:00:00  0.842911 2  2016-07-25 02:00:00 -0.873992 3  2016-07-25 07:00:00 -0.474993 4  2016-07-25 08:00:00 -0.983963 5  2016-07-25 09:00:00  0.597011 6  2016-07-25 10:00:00 -2.043023 7  2016-07-25 12:00:00  0.304668 8  2016-07-25 13:00:00  1.185997 9  2016-07-25 14:00:00  0.920850 10 2016-07-25 15:00:00  0.201423 11 2016-07-25 16:00:00  0.842970 12 2016-07-25 21:00:00  1.061207 13 2016-07-25 22:00:00  0.232180 14 2016-07-25 23:00:00  0.453964

now plot dataframe through df1.plot(x='timestamp').get_figure().show() , data along x-axis interpolated (appearing 1 line):

what have instead is:

visible gaps between sections data
a consistent gap-width differing gaps-legths
perhaps form of marker in axis helps clarify fact jumps in time performed.

researching in matter i've come across

which come close i'm after former approach yield in leaving gaps out of plotted figure , latter in large gaps avoid (think of gaps may span few days).

as second approach may closer tried use timestamp-column index through:

mydf2 = pd.dataframe(data=list(mydf['val']), index=mydf[0])

which allows me fill gaps nan through reindexing (wondering if there more simple solution achive this):

mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='h'))

leading to:

                          val 2016-07-25 00:00:00  0.740442 2016-07-25 01:00:00  0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00       nan 2016-07-25 04:00:00       nan 2016-07-25 05:00:00       nan 2016-07-25 06:00:00       nan 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00  0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00       nan 2016-07-25 12:00:00  0.304668 2016-07-25 13:00:00  1.185997 2016-07-25 14:00:00  0.920850 2016-07-25 15:00:00  0.201423 2016-07-25 16:00:00  0.842970 2016-07-25 17:00:00       nan 2016-07-25 18:00:00       nan 2016-07-25 19:00:00       nan 2016-07-25 20:00:00       nan 2016-07-25 21:00:00  1.061207 2016-07-25 22:00:00  0.232180 2016-07-25 23:00:00  0.453964

from here on might need reduce consecutive entries on limit missing data fix number (representing gap-width) , index-value of these entries plotted differently got lost here guess don't know how achieve that.

while tinkering around wondered if there might more direct , elegant approach , thankful if knowing more point me towards right direction.

thanks hints , feedback in advance!

### addendum ###

after posting question i've come across interesting idea postend andy hayden seems helpful. he's using column hold results of comparison of difference time-delta. after performing cumsum() on int-representation of boolean results uses groupby() cluster entries of each ungapped-series dataframegroupby-object.

as written time ago pandas returns timedelta-objects comparison should done timedelta-object (based on mydf above or on reindexed df2 after copying index column through mydf2['timestamp'] = mydf2.index):

from datetime import timedelta mytd = timedelta(minutes=60) mydf['nogap'] = mydf['timestamp'].diff() > mytd mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum()  ## btw.: why not "... .apply(lambda x: int(x)) ..."? dfg = mydf.groupby('nogap')

we iterate on dataframegroup getting ungapped series , something them. pandas/mathplot-skills way immature plot group-elements sub-plots? maybe way discontinuity along time-axis represented in way (in form of interrupted axis-line or such)?

pirsquared's answer leads quite usable result thing kind of missing being more striking visual feedback along time-axis gap/time-jump has occurred between 2 values.

maybe grouped sections width of gap-representation more configurable?

i built new series , plotted it. not super elegant! believe gets wanted.

setup

do starting point

from stringio import stringio import pandas pd  text = """          timestamp       val 2016-07-25 00:00:00   0.740442 2016-07-25 01:00:00   0.842911 2016-07-25 02:00:00  -0.873992 2016-07-25 07:00:00  -0.474993 2016-07-25 08:00:00  -0.983963 2016-07-25 09:00:00   0.597011 2016-07-25 10:00:00  -2.043023 2016-07-25 12:00:00   0.304668 2016-07-25 13:00:00   1.185997 2016-07-25 14:00:00   0.920850 2016-07-25 15:00:00   0.201423 2016-07-25 16:00:00   0.842970 2016-07-25 21:00:00   1.061207 2016-07-25 22:00:00   0.232180 2016-07-25 23:00:00   0.453964"""  s1 = pd.read_csv(stringio(text),                  index_col=0,                  parse_dates=[0],                  engine='python',                  sep='\s{2,}').squeeze()  s1  timestamp 2016-07-25 00:00:00    0.740442 2016-07-25 01:00:00    0.842911 2016-07-25 02:00:00   -0.873992 2016-07-25 07:00:00   -0.474993 2016-07-25 08:00:00   -0.983963 2016-07-25 09:00:00    0.597011 2016-07-25 10:00:00   -2.043023 2016-07-25 12:00:00    0.304668 2016-07-25 13:00:00    1.185997 2016-07-25 14:00:00    0.920850 2016-07-25 15:00:00    0.201423 2016-07-25 16:00:00    0.842970 2016-07-25 21:00:00    1.061207 2016-07-25 22:00:00    0.232180 2016-07-25 23:00:00    0.453964 name: val, dtype: float64

resample hourly. resample deferred method, meaning expects pass method afterwards knows do. used mean. example, doesn't matter because sampling higher frequency. if care.

s2 = s1.resample('h').mean()  s2  timestamp 2016-07-25 00:00:00    0.740442 2016-07-25 01:00:00    0.842911 2016-07-25 02:00:00   -0.873992 2016-07-25 03:00:00         nan 2016-07-25 04:00:00         nan 2016-07-25 05:00:00         nan 2016-07-25 06:00:00         nan 2016-07-25 07:00:00   -0.474993 2016-07-25 08:00:00   -0.983963 2016-07-25 09:00:00    0.597011 2016-07-25 10:00:00   -2.043023 2016-07-25 11:00:00         nan 2016-07-25 12:00:00    0.304668 2016-07-25 13:00:00    1.185997 2016-07-25 14:00:00    0.920850 2016-07-25 15:00:00    0.201423 2016-07-25 16:00:00    0.842970 2016-07-25 17:00:00         nan 2016-07-25 18:00:00         nan 2016-07-25 19:00:00         nan 2016-07-25 20:00:00         nan 2016-07-25 21:00:00    1.061207 2016-07-25 22:00:00    0.232180 2016-07-25 23:00:00    0.453964 freq: h, name: val, dtype: float64

ok, wanted equally sized gaps. tad tricky. used ffill(limit=1) fill in 1 space of each gap. took slice of s2 forward filled thing not null. gives me single null each gap.

s3 = s2[s2.ffill(limit=1).notnull()]  s3  timestamp 2016-07-25 00:00:00    0.740442 2016-07-25 01:00:00    0.842911 2016-07-25 02:00:00   -0.873992 2016-07-25 03:00:00         nan 2016-07-25 07:00:00   -0.474993 2016-07-25 08:00:00   -0.983963 2016-07-25 09:00:00    0.597011 2016-07-25 10:00:00   -2.043023 2016-07-25 11:00:00         nan 2016-07-25 12:00:00    0.304668 2016-07-25 13:00:00    1.185997 2016-07-25 14:00:00    0.920850 2016-07-25 15:00:00    0.201423 2016-07-25 16:00:00    0.842970 2016-07-25 17:00:00         nan 2016-07-25 21:00:00    1.061207 2016-07-25 22:00:00    0.232180 2016-07-25 23:00:00    0.453964 name: val, dtype: float64

lastly, if plotted this, still irregular gaps. need str indices matplotlib doesn't try expand out dates.

s3.reindex(s3.index.strftime('%h:%m'))  timestamp 00:00    0.740442 01:00    0.842911 02:00   -0.873992 03:00         nan 07:00   -0.474993 08:00   -0.983963 09:00    0.597011 10:00   -2.043023 11:00         nan 12:00    0.304668 13:00    1.185997 14:00    0.920850 15:00    0.201423 16:00    0.842970 17:00         nan 21:00    1.061207 22:00    0.232180 23:00    0.453964 name: val, dtype: float64

i'll plot them can see difference.

f, = plt.subplots(2, 1, sharey=true, figsize=(10, 5)) s2.plot(ax=a[0]) s3.reindex(s3.index.strftime('%h:%m')).plot(ax=a[1])

Search This Blog

Perl

python - pandas plot time-series with minimized gaps -

setup

Comments

Post a Comment

Popular posts from this blog

jOOQ update returning clause with Oracle -

java - Warning equals/hashCode on @Data annotation lombok with inheritance -

java - BasicPathUsageException: Cannot join to attribute of basic type -