Advanced data processing with Pandas¶
In this week, we will continue developing our skills using Pandas to analyze climate data. The aim of this lesson is to learn different functions to manipulate with the data and do simple analyses. In the end, our goal is to detect weather anomalies (stormy winds) in Helsinki, during August 2017.
Downloading and reading the data¶
Notice that this time, we will read the actual data obtained from NOAA without any modifications to the actual data by us.
Start by downloading the data file 6591337447542dat_sample.txt
from this link.
The first rows of the data looks like following:
USAF WBAN YR--MODAHRMN DIR SPD GUS CLG SKC L M H VSB MW MW MW MW AW AW AW AW W TEMP DEWP SLP ALT STP MAX MIN PCP01 PCP06 PCP24 PCPXX SD
029740 99999 201708040000 114 6 *** *** BKN * * * 25.0 03 ** ** ** ** ** ** ** 2 58 56 1005.6 ***** 999.2 *** *** ***** ***** ***** ***** 0
029740 99999 201708040020 100 6 *** 75 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.68 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040050 100 5 *** 60 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 57 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
029740 99999 201708040100 123 8 *** 63 OVC * * * 10.0 ** ** ** ** 23 ** ** ** * 59 58 1004.7 ***** 998.4 *** *** ***** ***** ***** ***** 0
029740 99999 201708040120 110 7 *** 70 *** * * * 6.2 ** ** ** ** ** ** ** ** * 59 59 ****** 29.65 ****** *** *** ***** ***** ***** ***** **
Notice from above that our data is separated with varying amount of spaces (fixed width).
Note
Write the codes of this lesson into a separate script called weather_analysis.py
because we will re-use the codes we write here again later.
Let’s start by importing pandas and specifying the filepath to the file that we want to read.
As the data was separated with varying amount of spaces, we need to tell Pandas to read it like that
with sep
parameter that says following things about it:
Hence, we can separate the columns by varying number spaces of spaces with sep='\s+'
-parameter.
Our data also included No Data values with varying number of *
-characters. Hence, we need to take also those
into account when reading the data. We can tell Pandas to consider those characters as NaNs by specifying na_values=['*', '**', '***', '****', '*****', '******']
.
In [1]: data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-4c97493f75d7> in <module>()
----> 1 data = pd.read_csv(fp, sep='\s+', na_values=['*', '**', '***', '****', '*****', '******'])
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
684 )
685
--> 686 return _read(filepath_or_buffer, kwds)
687
688
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
450
451 # Create the parser.
--> 452 parser = TextFileReader(fp_or_buf, **kwds)
453
454 if chunksize or iterator:
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
934 self.options["has_index_names"] = kwds["has_index_names"]
935
--> 936 self._make_engine(self.engine)
937
938 def close(self):
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1166 def _make_engine(self, engine="c"):
1167 if engine == "c":
-> 1168 self._engine = CParserWrapper(self.f, **self.options)
1169 else:
1170 if engine == "python":
~/checkouts/readthedocs.org/user_builds/geo-python-site/envs/2017.1/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1996 kwds["usecols"] = self.usecols
1997
-> 1998 self._reader = parsers.TextReader(src, **kwds)
1999 self.unnamed_cols = self._reader.unnamed_cols
2000
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] No such file or directory: '/home/docs/checkouts/readthedocs.org/user_builds/geo-python-site/checkouts/2017.1/source/data/L6/6591337447542dat_August.txt'
Exploring data and renaming columns¶
Let’s see how the data looks by printing the first five rows with head()
function
In [2]: data.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-304fa4ce4ebd> in <module>()
----> 1 data.head()
NameError: name 'data' is not defined
Let’s continue and check what columns do we have.
In [3]: data.columns
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-c3d483a1c074> in <module>()
----> 1 data.columns
NameError: name 'data' is not defined
Okey there are quite many columns and we are not interested to use all of them. Let’s select only columns that might be used to detect unexceptional weather conditions, i.e. YR–MODAHRMN, DIR, SPD, GUS, TEMP, MAX, and MIN.
In [4]: select_cols = ['YR--MODAHRMN', 'DIR', 'SPD', 'GUS','TEMP', 'MAX', 'MIN']
In [5]: data = data[select_cols]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-e9def1dd97a5> in <module>()
----> 1 data = data[select_cols]
NameError: name 'data' is not defined
Let’s see what our data looks like now by printing last 5 rows and the datatypes.
In [6]: data.tail()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-0b73fe40d24a> in <module>()
----> 1 data.tail()
NameError: name 'data' is not defined
In [7]: data.dtypes