Advanced plotting with Pandas¶
At this point you should know the basics of making plots with Matplotlib module. It is also possible to do Matplotlib plots directly from Pandas because many of the basic functionalities of Matplotlib are integrated into Pandas. In this part, we will show how to visualize data using Pandas and create plots such as this:
Downloading the data and preparing¶
For our second lesson plotting data using Pandas we will use hourly weather data from Helsinki. Download the weather data file from here.
- Save a copy of this file in your home directory or a directory for the materials for this week’s lesson.
- The data file contains observed hourly temperatures, windspeeds, etc. covering years 2012 and 2013. Observations were recorded from the Malmi airport weather station in Helsinki. It is derived from a data file of daily temperature measurments downloaded from the US National Oceanographic and Atmospheric Administration’s National Centers for Environmental Information climate database.
- There should be around 16.5 thousand rows in the data.
The first rows of the data looks like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029750 99999 201201010050 280 3 *** 89 BKN * * * 7.0 ** ** ** ** ** ** ** ** * 28 25 ****** 29.74 ****** *** *** ***** ***** ***** ***** **
029750 99999 201201010150 310 3 *** 89 OVC * * * 7.0 ** ** ** ** ** ** ** ** * 27 25 ****** 29.77 ****** *** *** ***** ***** ***** ***** **
029750 99999 201201010250 280 1 *** *** *** * * * 6.2 ** ** ** ** ** ** ** ** * 25 21 ****** 29.77 ****** *** *** ***** ***** ***** ***** **
029750 99999 201201010350 200 1 *** *** *** * * * 6.2 ** ** ** ** ** ** ** ** * 21 21 ****** 29.80 ****** *** *** ***** ***** ***** ***** **
Parsing datetime when reading data¶
One of the most useful and powerful features in Pandas is its ability to work with time data. In Pandas, we can even read the data from a file and tell to Pandas that values from certain column should be interpreted as time, and we can actually use that as our index, which is cool! You will see later why.
Let’s start by importing some modules that will be useful when plotting.
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
In [3]: from datetime import datetime
In [4]: import numpy as np
Next, let’s read the data into Pandas and determine that the values from YR--MODAHRMN
column should be interpreted and converted into a time index.
In [5]: fp = "1924927457196dat.txt"
When reading the data we can use parse_dates
parameter to parse the time information
In [5]: data = pd.read_csv(fp, sep='\s+', parse_dates=['YR--MODAHRMN'], na_values=['*', '**', '***', '****', '*****', '******'])
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-5-5291211b00bc> in <module>()
----> 1 data = pd.read_csv(fp, sep='\s+', parse_dates=['YR--MODAHRMN'], na_values=['*', '**', '***', '****', '*****', '******'])
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
684 )
685
--> 686 return _read(filepath_or_buffer, kwds)
687
688
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
450
451 # Create the parser.
--> 452 parser = TextFileReader(fp_or_buf, **kwds)
453
454 if chunksize or iterator:
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
934 self.options["has_index_names"] = kwds["has_index_names"]
935
--> 936 self._make_engine(self.engine)
937
938 def close(self):
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1166 def _make_engine(self, engine="c"):
1167 if engine == "c":
-> 1168 self._engine = CParserWrapper(self.f, **self.options)
1169 else:
1170 if engine == "python":
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1996 kwds["usecols"] = self.usecols
1997
-> 1998 self._reader = parsers.TextReader(src, **kwds)
1999 self.unnamed_cols = self._reader.unnamed_cols
2000
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] No such file or directory: '/home/docs/checkouts/readthedocs.org/user_builds/geo-python-site/checkouts/2017.1/source/data/L7/1924927457196dat.txt'
Let’s check the datatypes of our columns.
In [6]: data.dtypes
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-6226a73926db> in <module>()
----> 1 data.dtypes
NameError: name 'data' is not defined
As we can see the data type of YR--MODAHRMN
column (third from above) is of type datetime64[ns]
.
This means that the values on that column are interpreted as time objects.
Let’s see how our data look like.
In [7]: data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
As we can see the values on YR--MODAHRMN
indeed look like time information where the first part represents the date (yyyy-mm-dd
) and the second part represents the hours:minutes:seconds
.
Before continue with plotting in Pandas, let’s process our data a bit by selecting only few columns, renaming them and converting the Fahrenheit temperatures into Celsius. If you don’t remember how the following steps work, you might want to take another look on Lesson 6 materials.
# Select data
selected_cols = ['YR--MODAHRMN', 'TEMP', 'SPD']
data = data[selected_cols]
# Rename columns
name_conversion = {'YR--MODAHRMN': 'TIME', 'SPD': 'SPEED'}
data = data.rename(columns=name_conversion)
# Convert Fahrenheit temperature into Celsius
data['Celsius'] = (data['TEMP'] - 32) / 1.8
Let’s confirm that everything looks correct.
In [8]: data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
Okey, great now our data looks better, and we can continue. Let’s see how our data looks like by plotting the Celsius temperatures.
Basic line plot in Pandas¶
In Pandas, it is extremely easy to plot data from your DataFrame. You can do this by using plot()
function.
Let’s plot all the Celsius temperatures (y-axis) against the time (x-axis). You can specify the columns that you want to plot
with x
and y
parameters:
In [9]: data.plot(x='TIME', y='Celsius');
Cool, it was this easy to produce a line plot that can be used to understand our data better. We can clearly see that there is quite a lot of variation in the temperatures, and different seasons pop up quite clearly from the data.
Selecting data based on time in Pandas¶
What is obvious from the figure above, is that the hourly level data is actually slightly too accurate for plotting data covering two full years. Let’s see a trick, how we can really easily aggregate the data using Pandas.
First we need to set the TIME
as the index of our DataFrame. We can do this by using set_index()
parameter.
In [10]: data = data.set_index('TIME')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-8c0ed9f335b0> in <module>()
----> 1 data = data.set_index('TIME')
NameError: name 'data' is not defined
In [11]: data.head()