Advanced data processing with NumPy

This week we will continue developing our skills using NumPy to process real data. In the lesson this week we are using daily weather observations from the Detroit Willow Run Airport station, near Ypsilanti, Michigan in the U.S. This data file is in the same format as the one you’ll work with in this week’s exercise.

Getting started

Previewing the data file.

To start, let’s have a look at the top few lines of our data file.

----------------- ---------- ---------- ---------- -------- -------- -------- -------- --------
GHCND:USW00014853      236.8   42.23333  -83.53333 20070101 0.03     -9999    50       32
GHCND:USW00014853      236.8   42.23333  -83.53333 20070102 0.00     -9999    43       26
GHCND:USW00014853      236.8   42.23333  -83.53333 20070103 0.00     -9999    46       30

So, we have 9 columns in the file, with values separated by variable numbers of spaces. Another thing to notice are the -9999 values, which indicate missing data. Otherwise, some columns contain numerical values, and those will be our focus for today’s lesson. If you prefer, you’re welcome to open the data file in the JupyterLab editor to explore more.

Importing NumPy and loading the data

Now let’s import NumPy and read in our data to a variable called data.

import numpy as np

Now we can read the data file. Note that we only want the DATE, TAVG, TMAX, and TMIN values, which we can select using the usecols() parameter. Don’t forget about skipping the header!

fp = '1495566.txt'
data = np.genfromtxt(fp, skip_header=2, usecols=(4, 6, 7, 8))

Now let’s see what we’re dealing with…

[[  2.00701010e+07  -9.99900000e+03   5.00000000e+01   3.20000000e+01]
 [  2.00701020e+07  -9.99900000e+03   4.30000000e+01   2.60000000e+01]
 [  2.00701030e+07  -9.99900000e+03   4.60000000e+01   3.00000000e+01]
 [  2.01712290e+07  -9.99900000e+03   2.00000000e+01   1.10000000e+01]
 [  2.01712300e+07  -9.99900000e+03   1.70000000e+01   4.00000000e+00]
 [  2.01712310e+07  -9.99900000e+03   1.70000000e+01  -3.00000000e+00]]

Converting the missing data to nan values

Our first processing task will be to convert the missing data values to np.nan values, which will make it easier to use masks later in our data processing. We can go ahead and do this to all of the data to start, following a similar approach to how we used masks in the last lesson.

First, we need to identify a useful test for identifying missing data. We could could convert the data to integer values everywhere and test for values being equal to -9999, but there’s an even easier option. Since we know all of our data are dates or temperatures, we can simply look for numbers below -9998 to identify missing data.

data_mask = (data < -9998)
data[data_mask] = np.nan
[[  2.00701010e+07              nan   5.00000000e+01   3.20000000e+01]
 [  2.00701020e+07              nan   4.30000000e+01   2.60000000e+01]
 [  2.00701030e+07              nan   4.60000000e+01   3.00000000e+01]
 [  2.01712290e+07              nan   2.00000000e+01   1.10000000e+01]
 [  2.01712300e+07              nan   1.70000000e+01   4.00000000e+00]
 [  2.01712310e+07              nan   1.70000000e+01  -3.00000000e+00]]

Now we see the expected nan values in place of the -9999 values. Note that a nan value in NumPy is refered to as np.nan.

Splitting the data into column arrays

Like the last lesson, we can now separate our data into different arrays with meaningful names. For example, we can create a date array using the first column of data. Remember that we’ll do this using array slicing.

date = data[:,0]
[ 20070101.  20070102.  20070103. ...,  20171229.  20171230.  20171231.]

Looks good, let’s do the others.

tavg = data[:,1]
tmax = data[:,2]
tmin = data[:,3]

Examining our data

Now that we have loaded the data, let’s have a look at what we’re dealing with.

Basic observations

We can start by looking at some basic information, such as the starting and ending dates for the observations.


OK, so we can see our observations start on January 1, 2007 and continue until December 31, 2017.

Checking for missing/bad data

Let’s make sure that we don’t have any missing dates by looking for any nan values in the date array. To do this, we’ll need a new NumPy function we’ve not yet used called np.count_nonzero(). np.count_nonzero() does exactly what is says, counts all values that are not equal to zero in an array and returns the sum. In this case, False values in mask arrays, for example, are considered to be equal to zero in the count. Thus, if we have a mask array with all of the finite values for an array like date, that should give np.count_nonzero() will give us the number of dates with useful data. Let’s see how it works.

date_mask = np.isfinite(date)
print("Number of days:", np.count_nonzero(date_mask))
Number of days: 4016

OK, so we have 4016 dates in the data we’re using, but how do we know whether we’re missing any dates? One way to check is to take the opposite of the mask array values by using the ~ character, the logical not operator. How does it work? Let’s see.

missing_date_mask = ~np.isfinite(date)
print("Number of missing days:", np.count_nonzero(missing_date_mask))
Number of missing days: 0

OK, so no missing days, which is not really a big surprise. Good to know at least. How about for the other arrays like tavg?

tavg_mask = np.isfinite(tavg)
print("Number of average temps:", np.count_nonzero(tavg_mask))
Number of average temps: 0

Oh no! So we don’t have any average temperature observations. Looks like we’ll have to use another array for exploring the temperature data.

Calculations within a date range

One common thing we’d like to do is calculate something about our data over a fixed time range, for example, one year. We’ve seen this idea previously, but just to remind you we can do this by masking our data.

Removing missing data

First, let’s check for any missing data in the tmax array.

missing_tmax_mask = ~np.isfinite(tmax)
print("Number of missing tmax values:", np.count_nonzero(missing_tmax_mask))
Number of missing tmax values: 7

OK, so we need to remove the missing values first, which we can do as we’ve done before: Make a mask array, then remove bad values using the mask.

tmax_mask = np.isfinite(tmax)
tmax_clean = tmax[tmax_mask]

We can also remove the corresponding dates by using the same mask (tmax_mask) on our date array.

date_clean = date[tmax_mask]

Average annual temperature, method 1

OK, now let’s use a range of dates to find the average maximum temperature for one year. In this case, let’s go for 2010.

tmax_2010 = tmax[(date >= 20100101) & (date <= 20101231)]

OK, so this works, but we’ll see another more flexible approach in just a moment.

Average monthly temperatures

So one possibility is for us to use numerical ranges to calculate things like average annual temperatures. An alternative is to split date into different arrays, with one for each year, month, and day in the data. To do this, we first need to convert the date_clean array to be character string values rather than float64 values. Let’s do that.

array(['20070101.0', '20070102.0', '20070103.0', ..., '20171229.0',
       '20171230.0', '20171231.0'],

Well, that’s not quite what we want. Ideally, we want a date in the form YYYYMMDD where YYYY is the year, MM is the month, and DD is the day. Looks like we need to convert the date to an integer first.

date_clean_int = date_clean.astype(int)
date_clean_str = date_clean_int.astype(str)
['20070101' '20070102' '20070103' ..., '20171229' '20171230' '20171231']

That’s better.

Splitting dates into separate year, month, and day

Now we can go about splitting our dates into separate arrays for the year, month, and day.

Splitting a single date string

As a reminder, it is first useful to recall that character strings can be sliced using index values just like arrays.

date_fmt = "YYYYMMDD"

So by taking the first 4 values from the string YYYYMMDD we get YYYY. We can do the same for the month, for example.


Splitting dates in a NumPy array

With this in mind, we can now attempt the same in our NumPy array date_clean_str.

year = [datenow[0:4] for datenow in date_clean_str]
year = np.array(year)
['2007' '2007' '2007' ..., '2017' '2017' '2017']

OK, so we have the year now, but what happened with the [datenow[0:4] for datenow in date_clean_str] stuff. We’ve never seen that before. You’re right, this is new, and is an example of list comprehension in Python. Basically, it is a shorthand approach for looping over all of the values and doing something to them. In our case, we find the first 4 characters for each value in the date array.

To be more clear,

[datenow[0:4] for datenow in date_clean_str]

is equivalent to doing

year = []
for datenow in date_clean_str:

since this would produce a new list called year, we then need to use the np.array() function to convert that list to a NumPy array. Let’s do the same for the months and days.

month = [datenow[4:6] for datenow in date_clean_str]
month = np.array(month)
day = [datenow[6:8] for datenow in date_clean_str]
day = np.array(day)

Finding averages in a date range

Now that we have separate arrays for the year, month, and day we can do some fun stuff like calculate the monthly average temperature for a given year.

Finding the average monthly max temperature

Let’s take 2010 again as our example and find the average temperatures for each month in 2010. For this, we can use a for loop.

means_2010 = np.zeros(12)
index = 0
for month_now in np.unique(month):
    means_2010[index] = tmax_clean[(month == month_now) & (year == '2010')].mean()
    index = index + 1
[ 29.77419355  32.46428571  52.51612903  65.76666667  71.96774194
  80.23333333  87.5483871   85.          74.36666667  65.5483871
  51.83333333  30.80645161]

OK, so let’s break this down as there were a few things that happened here that might be unfamiliar.

  1. We created an array of zeros with 12 locations to store the monthly average temperatures.
  2. We used an index variable called index to define where to store those monthly average temperatures. We could have done this differently, but there is a good reason to use index here because it is more flexible if we wanted to consider more than one year of data as you’ll see in this week’s exercise.
  3. We then used a for loop to go through each unique month (found using np.unique()).
  4. For each month we found the max temperature using the condition month == month_now, and taking the tmax_clean.mean() for that month. We also limit the months to those in 2010 by using the year == '2010' condition.
  5. We added 1 to the index value to ensure the next mean value goes into the correct location in the means_2010 array.

This might seem a bit complicated, but it is rather powerful as well. We’ve efficiently calcuated mean monthly max temperatures in only a few lines of code!

Using functions on NumPy array data

As you might have noticed above, the means_2010 array has temperatures in Fahrenheit, which might be easier for us to work with in Celsius. Fortunately, we can easily convert NumPy array data from Fahrenheit to Celsius using a familiar function.

def fahrToCelsius(temp):
    return (temp - 32.0) / 1.8

Now, let’s go about that conversion.

means_2010_celsius = fahrToCelsius(means_2010)
[ -1.23655914   0.25793651  11.39784946  18.75925926  22.20430108
  26.7962963   30.86021505  29.44444444  23.53703704  18.63799283
  11.01851852  -0.66308244]

Finally, something easy to do with NumPy! All we have to do hear is plug in a NumPy array with our function and we get back the converted temperature values.

Computing monthly average temperatures for a range of years

One thing we’ll encounter in this week’s exercise is a need to calculate monthly average temperatures for a number of years. For example, the average temperature in February from 2010-2015. We can do that with the tools we have, but in a slightly different way than calculating the monthly average temperature for each month in those years one by one. First, let’s make a few masks, starting with the month.

month_mask = month == '02'

Nothing crazy here, just ensuring the month is February, or '02'. Now for the years.

year_mask = (year.astype(int) >= 2010) & (year.astype(int) <= 2015)

Again, nothing too scary here. We do need to use the .astype(int) here to be able to check the range of years, but that is not so bad. Now let’s find that monthly average temperature for February between 2010 and 2015.

mean_2010_2015 = tmax_clean[month_mask & year_mask].mean()

Looks about right, but perhaps we can check the value in Celsius to see whether that makes sense for winter in the northern hemisphere.


Right around 0°C, looks good!

Calculating temperature anomalies

One thing we can do now that we have a mean temperature for February in 2010-2015 is to compare one of those years to the average max temperature in the time period. This would give us a temperature anomaly, the difference between the mean value and a given year. Let’s give that a shot.

anomaly = means_2010[1] - mean_2010_2015

As we can see, February of 2010 was only 0.08° Fahrenheit warmer than the average February between 2010 and 2015. Pretty typical February, it would seem. We’ll explore exactly this kind of thing in more detail in this week’s exercise.