A Brief Tale of Two ggplots

Probably the most fundamental aspect of exploratory data analysis is plotting your data. Just getting a look at the spread or distribution of the data can provide sometimes insight into the processes that generated the data. When plotting a data set, my "goto" tool is R. The traditional R graphics packages are easy to use. Although not fancy, they get the job done and it's possible to produce publication quality plots with the basics plotting functions. In fact, to my eye, R's standard plots look better than the default plots provided by something like matplotlib for Python.

Lately, I have been generating most of my plots in R using ggplot2. ggplot2 is a plotting package based on something called the  grammar of graphics developed by Leland Wilkinson. ggplot2 is a set of independent components that can be combined in many different ways. Plots are constructed in a layered manner. You start with a description of the data and add descriptions of how you want to present and summarize the data. The descriptions of the presentations are composed of high-level representations and transformations of the raw data.

ggplot2 has been ported to Python. The Python version lacks some of the features of the R version and there are differences in some methods calls. You could always use a Python library like rpy2 to call ggplot2 in R but it's nice to be able create ggplot2 type graphs directly from Python.

In this post, I would like to compare R and Python's versions of ggplot2 by means of a very simple example. I have been examining some data relating the concentration of  some ketones, called alkenones, to sea water temperature. Alkenones are produced by some ocean phytoplankton. These phytoplankton species respond to changes in water temperature by altering the relative proportions of the different alkenones they produce. The relative degree of two types of alkenones can be used to estimate the temperature of the water in which the phytoplankton grew. For our purposes in this post, the details of alkenones are not important. We just need to know that sea water temperature can be estimated by a proxy call UK'37 which is the ratio of one form of alkenone to the total. UK'37 values range from 0 to 1. I'll go into more details about alkenones and how we can uses them to estimate sea water temperature in a future post.

The data we will be using comes from Conte et al. Global temperature calibration of the alkenone unsaturation index (UK37) in surface waters and comparison with surface sediments (2006, doi:10.1029/2005GC001054) . Here's what the head of the data file looks like:


Basin
Basin2
Outlier
UK37
Temp
1
NAtlantic
NaN
0.39
12.4
1
NAtlantic
NaN
0.39
13.1
1
NAtlantic
NaN
0.29
11.0
1
NAtlantic
NaN
0.45
12.6
1
NAtlantic
NaN
0.43
12.3

We would like to fit a simple linear model, temperature vs. UK37, and plot the data along with the least squares line. In addition, we would like to color the data points according to their locations, Basin2 in the table.

This process is almost trivial in R.

df <- read.csv(file='SWalkenone data_conte.csv', header=TRUE)  # read the data
df <- df[!is.na(df$UK37), ]   # throw away rows with NA in UK37
ggplot(df, aes(x=UK37, y=Temp)) + geom_point(aes(color=Basin2)) + 
   geom_smooth(method="lm", formula = 'y ~ x', se=FALSE)

Here's the result:



In the R code above, ggplot(df, aes(x=UK37, y=Temp)) establishes the data frame containing the source data and the x and y values to be plotted. aes() is an aesthetic.  Aesthetic mappings describe how variables in the data are mapped to visual properties of geometries. The ggplot() statement describes the data but doesn't say how we will plot it. geom_point(aes(color=Basin2)) 
tells R that we want to plot points, a scatter plot. Finally, geom_smooth(method="lm", formula = 'y ~ x', se=FALSE) creates a simple y ~ x linear model and plots the line. Notice that we did not have to build a separate linear model to get the least squares line. Of course, we through away all of the  information about coefficients, fit. etc. so in a real analysis, we would want to build some sort of model before plotting.

The same process is easy in Python, but there are differences as we'll see. You'll need ggplot for Python. You can get it from yhat or if you use Anaconda like I do, you can install it with conda. You will also need pandas and Statsmodels. If you use Anaconda, pandas and Statsmodels are included.

As an aside, if you do data analysis with Python, Anaconda is wonderful. It gives you almost everything you need, numpy, pandas, scypi etc. If a package isn't installed by default, it's easy to install it.

To get started, load the necessary libraries.

import pandas as pd
from ggplot import *
import statsmodels.formula.api as sm

As in the R code, we need to load the data and remove rows with NA's in the UK37 columns.

def ReadDataFile(filename):
    df = pd.read_csv(filename)
    df = df.dropna(subset=['UK37'])  # drop rows containing nan in UK37 column
    return df

In R, we were now ready to call ggplot to plot the data and the least squares line. The Python version of ggplot lacks the geom_smooth method. It does have the stat_smooth method, but the method lacks many of the features of the R version including method and formula. Instead, we'll use statsmodel's ols method to build a linear model and plot the least squares line separately.

def LinearModel(data_frame):
    res = sm.ols(formula = 'Temp ~ UK37', data = data_frame).fit()
    return res

The advantage to this approach is that we now have access to all the features of the linear model. For example, we could get a summary of the fitted model.

print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Temp   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                 1.005e+04
Date:                Wed, 15 Nov 2017   Prob (F-statistic):          1.81e-277
Time:                        09:56:47   Log-Likelihood:                -673.57
No. Observations:                 386   AIC:                             1351.
Df Residuals:                     384   BIC:                             1359.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.3191      0.171     19.449      0.000       2.984       3.655
UK37          25.2301      0.252    100.256      0.000      24.735      25.725
==============================================================================
Omnibus:                        7.429   Durbin-Watson:                   0.880
Prob(Omnibus):                  0.024   Jarque-Bera (JB):                7.265
Skew:                          -0.315   Prob(JB):                       0.0265
Kurtosis:                       3.237   Cond. No.                         5.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Given the data and the linear model, we can plot them.

def PlotLinearModel(data_frame, lm):
    gg = ggplot(data_frame, aes(x='UK37', y='Temp', color = 'Basin2')) + \
        geom_point() + \
        geom_abline(intercept = lm.params.Intercept, slope = lm.params.UK37, color = 'black') + \
        scale_x_continuous(limits = (0,1)) + \
        scale_y_continuous(limits = (0,35)) + \
        ggtitle('Temperature vs UK37')
    print(gg)

We use geom_abline to plot the least squares line. The linear model parameters provide the slope and intercept for the line. Notice that we had to scale the x and y axis. The Python version of geom_abline rescales the axis so to get the proper ranges, we have to rescale.

To put it all together, we just call the functions.

def Main():
    data_file = 'SWalkenone data_conte.csv'
    df = ReadDataFile(data_file)
    lm = LinearModel(df)
    PlotLinearModel(df, lm)
    
if __name__ == "__main__":
    Main()

Here are the results.



If you are interested in ggplot2 for R, this tutorial is a good place to start. Probably the ultimate resource for the ideas behind and use of ggplot2 is Hadley Wickham's book ggplot2: Elegant Graphics for Data Analysis. Yhat has a brief tutorial showing the use of pandas and ggplot with Python.

Thanks to Greg Houston's blog for formatting the code.

No comments:

Post a Comment