Another Logistic Regression from Scratch

 The world doesn't really need another description of how to code logistic regression. A good description of how to implement logistic regression can already be found here. In addition, there are many great packages for logistic regression in Python, sklearn.linear_model.LogisticRegression; R glm; Julia GLM; and many more. 

I started following this course on Udemy. The course began with a brief discussion of logistic regression. I have used logistic regression techniques many times, but I didn't have a clear idea of how to implement it. I thought I might as well try. What follows is a very simple implementation of binary logistic regression.

Logistic Regression - Who lives, who dies?

Consider the following data. It's from the R package alr4. It describes the fate of the infamous Donner party. It consists of 91 observations of five variables. The important columns for our purposed are age, y (survival), and sex. We want to know how did age and sex affect survival of members of the Donner party in the terrible circumstances in which they found themselves.

> head(Donner, n = 20)
                           age        y    sex family.name status
Breen_Edward_               13 survived   Male       Breen Family
Breen_Margaret_Isabella      1 survived Female       Breen Family
Breen_James_Frederick        5 survived   Male       Breen Family
Breen_John                  14 survived   Male       Breen Family
Breen_Margaret_Bulger       40 survived Female       Breen Family
Breen_Patrick               51 survived   Male       Breen Family
Breen_Patrick_Jr.            9 survived   Male       Breen Family
Breen_Peter                  3 survived   Male       Breen Family
Breen_Simon_Preston          8 survived   Male       Breen Family
Donner_Elitha_Cumi          13 survived Female    G_Donner Family
Donner_Eliza_Poor            3 survived Female    G_Donner Family
Donner_Elizabeth_Blue_Hook  45     died Female    J_Donner Family
Donner_Frances_Eustis        6 survived Female    G_Donner Family
Donner_George               61     died   Male    G_Donner Family
Donner_George_Jr.            9 survived   Male    J_Donner Family
Donner_Georgia_Ann           4 survived Female    G_Donner Family
Donner_Isaac                 5     died   Male    J_Donner Family
Donner_Jacob                56     died   Male    J_Donner Family
Donner_Leanna_Charity       11 survived Female    G_Donner Family
Donner_Lewis                 3     died   Male    J_Donner Family

In order to simplify the Python code, we will transform the data by recoding both y and sex as 0/1 variables.

> df_donner <- Donner %>% 
     select(y, age, sex) %>% 
     drop_na(any_of('age')) %>% 
     mutate(y = ifelse(y == 'survived', 0, 1)) %>% 
     mutate(sex = ifelse(sex == 'Female', 0, 1)) %>% 
     remove_rownames()
> head(df_donner, n = 20)
   y age sex
1  0  13   1
2  0   1   0
3  0   5   1
4  0  14   1
5  0  40   0
6  0  51   1
7  0   9   1
8  0   3   1
9  0   8   1
10 0  13   0
11 0   3   0
12 1  45   0
13 0   6   0
14 1  61   1
15 0   9   1
16 0   4   0
17 1   5   1
18 1  56   1
19 0  11   0
20 1   3   1

Three rows containing NAs were removed. This data was saved in a CSV file and will be used by the Python version of logistic regression.

Plotting the data shows there is a relationship between age, sex and y (survival). 

  p <- ggplot(df) +
    geom_point(aes(x = age, y = y, color = as.factor(sex)))
  p <- p +
    scale_color_discrete(name = 'sex',
                         labels = c("Female", "Male"))
  
  print(p)

The plot shows that there is a relationship among the variables y, sex, and age, but it's not obvious.

The  Logistic Function


Data such as the Donner data above present us with a binary classification problem. The dependent variable takes on one of two possible values. In the data above y can take on the value of either 0 or 1. Modeling data such as the Donner can be thought of as a learning problem. We wish to learn a  model which will allow us to predict the class, 0 or 1, conditioned on the independent variables. Once we have a model, we could use it to predict the class of new data items. For example, if we were trapped in the high Sierra mountains with no food, we could predict whether we would survive or not. ;-(

The principled approach is to split the data into training and test sets. Train the model and then test it's performance on the test data, performing cross validation to ensure that the model is robust. For the purposed of this article, we will cheat and simply model the Donner data.

We want to create a model that predict the probability that y equals 1 based on some function of sex and age.

\[P(y = 1|X) = f(X\beta )\]

X is the data matrix consisting of columns age and sex with a column of 1's prepended as an intercept. $\beta $ is a row vector of coefficients.

Why $X\beta $ and not some other form? We could use something more complicated, but it's best to start simply. A linear combination of the columns is probably the most simple form that might yield useful predictions.

What form should $f(X\beta ) $ take? It should have a range of 0 to 1 and be such that some values are mapped close to 0 and others to 1. Here are three possible candidates shown for one dimension. There are may more.



tanh is the hyperbolic function tangent function. It is defined in 1 dimension as:

\[\tanh (x) = \frac{{{e^x} - {e^{ - x}}}}{{{e^x} + {e^{ - x}}}}\]

The range of tanh is from -1 to 1, so we map it to 0 to 1.

The logistic function is:

\[\sigma (x) = \frac{1}{{1 + {e^{ - \frac{{x - \mu }}{s}}}}} = \frac{1}{{1 + {e^{ - ({\beta _0} + {\beta _1}x)}}}}\]

The step function is:

\[step(x) = \begin{array}{*{20}{c}}{1\,if\,x \ge 0}\\{0\,if\,x < 0}\end{array}\]

We can easily extend these definitions to multiple dimensions.

Since the title of this post is about logistic regression, you can guess which one we will choose.

\[P(y = 1|X;\beta ) = \sigma (X\beta )\]

The logistic function has some nice properties. In order to perform regression using the logistic function we will calculate the data likelihood based on the coefficients.

\[L(\beta) = \prod\limits_{i = 1}^m {P({y_i}|{x_i};\beta)} \]

In practice, we minimize the normalized negative log likelihood.

\[\begin{array}{c}\log (L(\beta)) =  - \frac{1}{m}\sum\limits_{i = 1}^m {\left[ {{y_i}\log (P({y_i} = 1|{x_i};\beta)) + (1 - {y_i})\log (1 - P({y_i} = 0|{x_i};\beta))} \right]} \\ =  - \frac{1}{m}\sum\limits_{i = 1}^m {\left[ {{y_i}\log (\sigma ({\beta^T}{x_i})) + (1 - {y_i})\log ((1 - \sigma ({\beta^T}{x_i}))} \right]} \end{array}\]

One nice property of this function is that it is convex. If we can find the minimum value, it will be a global minimum. 

Again, in practice we would use any standard optimization technique to find the minimum of the log likelihood. In the code below, we use gradient descent

The gradient of the log likelihood is

\[{\nabla _\beta}\log (L(\beta)) = \frac{1}{m}{X^T}(\sigma (X\beta) - y)\]

We won't derive that expression. It's an application of the chain rule. You can see a derivation here.

To minimize the negative log likelihood, we will iteratively update the coefficients using a simple rule.

 \[\beta  = \beta  - \alpha {\nabla _\beta }\log (L(\beta ))\]

We keep updating $\beta $ until the gradient reaches a minimum. $\alpha $ is called the learning rate or step size. 

To make prediction, we calculate

\[\begin{array}{l}1\,if\,P(y = 1|X;\beta ) > 0.5\\0\,if\,P(y = 1|X;\beta ) \le 0.5\end{array}\]

The choice of the logistic as leads to an interesting property. We can calculate the odds that a value of 1 will be found.

\[\begin{array}{c}\frac{{P(y = 1|X;\beta )}}{{P(y = 0|X;\beta )}} = \frac{{\sigma (X\beta )}}{{1 - \sigma (X\beta )}}\\ = \frac{{\frac{1}{{1 + {e^{ - X\beta }}}}}}{{1 - \frac{1}{{1 + {e^{ - X\beta }}}}}}\\ = {e^{X\beta }}\end{array}\]

Taking the log of both sides we obtain

\[\log (\frac{{P(y = 1|X;\beta )}}{{P(y = 0|X;\beta )}}) = X\beta \]

The log odds is a linear function of the data.

Binary Logistic Regression


Now for some code. A class for logistic regression is defined. Defining a class is probably overkill for this simple example. However, it is similar to the approach taken in scikit-learn. An object of the class is created, initializing a few parameters. The fit function is called and attributes can be accessed to obtain the coefficients.

Here's the code.


#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
lr.py - a class for binary logistic regression
author: Bill Thompson
license: GPL 3
copyright: 2020-11-05
"""
import numpy as np
from scipy.linalg import norm
from scipy.special import expit  # logistic function

class lr(object):
    def __init__(self, 
            maxiter = 1000, 
            tol = 1.0e-06, 
            learn_rate = 0.01,
            debug = False):
        """
        Initialize object instance and set some parameters.
        Parameters
        ----------
        maxiter : int, optional
            maximum number of iteratiosn for fit, by default 1000
        tol : float, optional
            exit fit when gradient is less than tol, by default 1.0e-06
        learn_rate : float, optional
            learning rate, decrease coefficients by this times gradient, by default 0.01
        debug : bool, optional
            print debugging information, by default False
        """            
        self.maxiter = maxiter
        self.tol = tol
        self.learn_rate = learn_rate
        self.debug = debug

    def probability(self, X):
        """
        Calculate P(Y = 1|X, w)

        Parameters
        ----------
        X : numpy array
            n X m data matrix of observations

        Returns
        -------
        numpy array n X 1
            1/(1 + exp(-Xw)),
        """        
        return expit(np.matmul(X, self.weights))

    def predict(self, X):
        """
        Calculate predictions P(Y = 1|X, w) > 0.5

        Parameters
        ----------
        X : numpy array
            n X m data matrix of observations

        Returns
        -------
        numpy array
            0/1 prediction of y
        """        
        return np.rint(self.probability(X)).astype(np.int)

    def score(self, X, y):
        """
        Calculate number of difference between y and current prediction

        Parameters
        ----------
        X : numpy array
            n X m data matrix of observations
        y : numpy array
            n X 1 dependent variable from data CSV

        Returns
        -------
        int
            number of differences between predicted labels and data labels
        """        
        return self._diffs(y, self.predict(X))

    def _diffs(self, y, y_hat):
        """
        Calculate number of difference between y and current prediction

        Parameters
        ----------
        y : numpy array
            n X 1 dependent variable from data CSV
        y_hat : numpy array
            current prediction of P(Y = 1|X, w)

        Returns
        -------
        float
            number of differences between predicted labels and data labels
        """        
        labs = np.rint(y_hat).astype(np.int) # predicted labels
        return np.sum(np.abs(y - labs))

    @property
    def weights(self):
        """
        return current coefficients

        Returns
        -------
        numpy array
            m X 1 array of coefficients
        """        
        return self._w

    def log_likelihood(self, X, y):
        """
        Calculate negative log likelihood of data given current coefficients

        Parameters
        ----------
        X : numpy array
            n X m data matrix of observations
        y : numpy array
            n X 1 dependent variable from data CSV

        Returns
        -------
        float
            negative log likelihood of data given current coefficients
        """        
        p = self.probability(X)
        return -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))

    def fit(self, X, y):
        """
        Fit y ~ X

        Parameters
        ----------
        X : numpy array
            n X m data matrix of observations
        y : numpy array
            n X 1 dependent variable from data CSV

        Returns
        -------
        lr object
            this object. coefficients can be obtained from weight method.
        """        
        m, n = X.shape

        self._w = np.zeros((n, 1))   # coefficents 
        
        iter = 0
        while iter < self.maxiter: 
            y_hat = self.probability(X)   # current estimate
            grad = np.matmul(X.transpose(), (y_hat - y)) / m  # gradient
            self._w -= self.learn_rate * grad  # update coefficients

            if self.debug and (iter % 100 == 0):
                print(iter)
                print(self.weights)
                print(self._diffs(y, y_hat))
                print(grad, norm(grad)**2)
                print()

            # quit if gradient is flat
            if norm(grad) ** 2 < self.tol:
                break

            iter += 1
            
        return self

Here's a short demo using the Donner data.


def main():
    args = GetArgs()

    data = pd.read_csv(args.data_file)
    X, y = get_data(data)

    reg = lr(maxiter = args.max_iter, 
            tol = args.tolerance, 
            learn_rate = args.learning_rate,
            debug = args.debug)
    reg.fit(X, y)

    print(reg.weights, reg.score(X, y))

We can run the code as

python src/logreg.py data/donner.csv --max_iter 100000 --learning_rate 0.01 --tolerance 1.0e-10

We give it a large number of iterations and a very small tolerance for the flat gradient. However, the code is fairly fast, it executes quickly.

It prints

[[-1.6215459 ]
 [ 0.03560203]
 [ 1.06773709]] 29

The first list contains the intercept and the coefficients for age and sex. The number following is the number misclassified.

To confirm that it is working correctly. Here's logistic regression on the same data using R.


> summary(model_donner <- glm(y ~ age + sex, data = df_donner, family = 'binomial'))

Call:
glm(formula = y ~ age + sex, family = "binomial", data = df_donner)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7386  -1.0261  -0.6511   1.0383   1.8828  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -1.62180    0.50279  -3.226  0.00126 **
age          0.03561    0.01525   2.336  0.01952 * 
sex          1.06798    0.48229   2.214  0.02680 * 
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 120.86  on 87  degrees of freedom
Residual deviance: 108.87  on 85  degrees of freedom
AIC: 114.87

Number of Fisher Scoring iterations: 4

> 

The full code and data are available on GitHub.

No comments:

Post a Comment