{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Basic plotting with Pandas and Matplotlib\n",
"\n",
"```{attention}\n",
"Finnish university students are encouraged to use the CSC Notebooks platform.
\n",
"\n",
"\n",
"Others can follow the lesson and fill in their student notebooks using Binder.
\n",
"\n",
"```\n",
"\n",
"As we're now familiar with some of the features of [Pandas](https://pandas.pydata.org/), we will wade into visualizing our data in Python using the built-in plotting options available directly in Pandas.\n",
"Much like the case of Pandas being built upon [NumPy](https://numpy.org/), plotting in Pandas takes advantage of plotting features from the [Matplotlib](https://matplotlib.org/) plotting library.\n",
"Plotting in Pandas provides a basic framework for visualizing our data, but as you'll see we will sometimes need to also use features from Matplotlib to enhance our plots. In particular, we will use features from the the `pyplot` module in Matplotlib, which provides [MATLAB](https://www.mathworks.com/products/matlab.html)-like plotting.\n",
"\n",
"Toward the end of the lesson we will also briefly explore creating interactive plots using the [Pandas-Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh) plotting backend, which allows us to produce plots similar to those available in the [Bokeh plotting library](https://docs.bokeh.org/en/latest/index.html) using plotting syntax similar to that used normally in Pandas. This is an optional part of the lesson, but will allow you to see an example for further exploration of interactive plotting using Pandas-Bokeh."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input data\n",
"\n",
"In the lesson this week we are using some of the same weather observation data from Finland [downloaded from NOAA](https://www7.ncdc.noaa.gov/CDO/cdopoemain.cmd?datasetabbv=DS3505&countryabbv=&georegionabbv=&resolution=40) that we used in Lesson 6. In this case we'll focus on weather observation station data from the Helsinki-Vantaa airport.\n",
"\n",
"## Downloading the data\n",
"\n",
"```{attention}\n",
"It is recommended to use the Geo-Python Lite blueprint for this lesson.\n",
"```\n",
"\n",
"Just like last week, the first step for today's lesson is to get the data. Unlike last week, we'll all download and use the same data.\n",
"\n",
"You can download the data by opening a new terminal window in Jupyter Lab by going to **File** -> **New** -> **Terminal** in the Jupyter Lab menu bar. Once the terminal is open, you will need to navigate to the directory for Lesson 7 by typing\n",
"\n",
"```bash\n",
"cd notebooks/L7/\n",
"```\n",
"\n",
"or the equivalent command to navigate to the location of the Lesson 7 files on your computer (for those running Jupyter on their own computers).\n",
"\n",
"\n",
"You can now confirm you're in the correct directory by typing\n",
"\n",
"```bash\n",
"ls\n",
"```\n",
"\n",
"You should see something like the following output:\n",
"\n",
"```bash\n",
"advanced-plotting.ipynb matplotlib.ipynb\n",
"img metadata\n",
"```\n",
"\n",
"If so, you're in the correct directory and you can download the data files by typing\n",
"\n",
"```bash\n",
"wget https://davewhipp.github.io/data/Finland-weather-data-L7.tar.gz\n",
"```\n",
"\n",
"After the download completes, you can extract the data files by typing\n",
"\n",
"```bash\n",
"tar zxvf Finland-weather-data-L7.tar.gz\n",
"```\n",
"\n",
"At this stage you should have a new directory called `data` that contains the data for this week's lesson. You can confirm this by typing\n",
"\n",
"```bash\n",
"ls data\n",
"```\n",
"\n",
"You should see something like the following:\n",
"\n",
"```bash\n",
"029740.txt 6367598020644inv.txt\n",
"3505doc.txt 6367598020644stn.txt\n",
"```\n",
"\n",
"Now you should be all set to proceed with the lesson!\n",
"\n",
"### Binder users\n",
"\n",
"It is not recommended to complete this lesson using Binder.\n",
"\n",
"## About the data\n",
"\n",
"As part of the download there are a number of files that describe the weather data. These *metadata* files include:\n",
"\n",
"- A list of stations: [data/6367598020644stn.txt](metadata/6367598020644stn.txt)\n",
"- Details about weather observations at each station: [data/6367598020644inv.txt](metadata/6367598020644inv.txt)\n",
"- A data description (i.e., column names): [data/3505doc.txt](metadata/3505doc.txt)\n",
"\n",
"The input data for this week are separated with varying number of spaces (i.e., fixed width). The first lines and columns of the data look like following:\n",
"\n",
"``` \n",
" USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD\n",
"029740 99999 195201010000 200 23 *** 15 OVC 7 2 * 5.0 63 ** ** ** ** ** ** ** 6 36 32 989.2 ***** ****** *** *** ***** ***** ***** ***** **\n",
"029740 99999 195201010600 220 18 *** 8 OVC 7 2 * 2.2 63 ** ** ** ** ** ** ** 6 37 37 985.9 ***** ****** *** 34 ***** ***** ***** ***** **\n",
"029740 99999 195201011200 220 21 *** 5 OVC 7 * * 3.8 59 ** ** ** ** ** ** ** 5 39 36 988.1 ***** ****** *** *** ***** ***** ***** ***** **\n",
"029740 99999 195201011800 250 16 *** 722 CLR 0 0 0 12.5 02 ** ** ** ** ** ** ** 5 36 27 991.9 ***** ****** 39 *** ***** ***** ***** ***** **\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started\n",
"\n",
"Let's start by importing Pandas and reading our data file."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} Datetime in Python\n",
"For the lesson this week we will be using a datetime index for our weather observations.\n",
"We skipped over the datetime data type in Lesson 6, but you can find [a brief introduction to datetime in Lesson 6](https://geo-python-site.readthedocs.io/en/latest/notebooks/L6/advanced-data-processing-with-pandas.html#datetime-optional-for-lesson-6).\n",
"```\n",
"\n",
"Just as we did last week, we'll read our data file by passing a few parameters to the Pandas `read_csv()` function. In this case, however, we'll include a few additional parameters in order to read the data with a *datetime index*. Let's read the data first, then see what happened."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"fp = r'data/029740.txt'\n",
"\n",
"data = pd.read_csv(fp, delim_whitespace=True, \n",
" na_values=['*', '**', '***', '****', '*****', '******'],\n",
" usecols=['YR--MODAHRMN', 'TEMP', 'MAX', 'MIN'],\n",
" parse_dates=['YR--MODAHRMN'], index_col='YR--MODAHRMN')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So what's different here? Well, we have added two new parameters: `parse_dates` and `index_col`.\n",
"\n",
"- `parse_dates` takes a Python list of column name(s) containing date data that Pandas will parse and convert to the *datetime* data type. For many common date formats this parameter will automatically recognize and convert the date data.\n",
"- `index_col` is used to state a column that should be used to index the data in the DataFrame. In this case, we end up with our date data as the DataFrame index. This is a very useful feature in Pandas as we'll see below.\n",
"\n",
"Having read in the data, let's have a quick look at what we have using `data.head()`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | TEMP | \n", "MAX | \n", "MIN | \n", "
---|---|---|---|
YR--MODAHRMN | \n", "\n", " | \n", " | \n", " |
1952-01-01 00:00:00 | \n", "36.0 | \n", "NaN | \n", "NaN | \n", "
1952-01-01 06:00:00 | \n", "37.0 | \n", "NaN | \n", "34.0 | \n", "
1952-01-01 12:00:00 | \n", "39.0 | \n", "NaN | \n", "NaN | \n", "
1952-01-01 18:00:00 | \n", "36.0 | \n", "39.0 | \n", "NaN | \n", "
1952-01-02 00:00:00 | \n", "36.0 | \n", "NaN | \n", "NaN | \n", "