The world doesn't really need another description of how to code logistic regression. A good description of how to implement logistic regression can already be found here. In addition, there are many great packages for logistic regression in Python, sklearn.linear_model.LogisticRegression; R glm; Julia GLM; and many more.
I started following this course on Udemy. The course began with a brief discussion of logistic regression. I have used logistic regression techniques many times, but I didn't have a clear idea of how to implement it. I thought I might as well try. What follows is a very simple implementation of binary logistic regression.
Logistic Regression - Who lives, who dies?
Consider the following data. It's from the R package alr4. It describes the fate of the infamous Donner party. It consists of 91 observations of five variables. The important columns for our purposed are age, y (survival), and sex. We want to know how did age and sex affect survival of members of the Donner party in the terrible circumstances in which they found themselves.
> head(Donner, n = 20) age y sex family.name status Breen_Edward_ 13 survived Male Breen Family Breen_Margaret_Isabella 1 survived Female Breen Family Breen_James_Frederick 5 survived Male Breen Family Breen_John 14 survived Male Breen Family Breen_Margaret_Bulger 40 survived Female Breen Family Breen_Patrick 51 survived Male Breen Family Breen_Patrick_Jr. 9 survived Male Breen Family Breen_Peter 3 survived Male Breen Family Breen_Simon_Preston 8 survived Male Breen Family Donner_Elitha_Cumi 13 survived Female G_Donner Family Donner_Eliza_Poor 3 survived Female G_Donner Family Donner_Elizabeth_Blue_Hook 45 died Female J_Donner Family Donner_Frances_Eustis 6 survived Female G_Donner Family Donner_George 61 died Male G_Donner Family Donner_George_Jr. 9 survived Male J_Donner Family Donner_Georgia_Ann 4 survived Female G_Donner Family Donner_Isaac 5 died Male J_Donner Family Donner_Jacob 56 died Male J_Donner Family Donner_Leanna_Charity 11 survived Female G_Donner Family Donner_Lewis 3 died Male J_Donner Family
> df_donner <- Donner %>%
select(y, age, sex) %>%
drop_na(any_of('age')) %>%
mutate(y = ifelse(y == 'survived', 0, 1)) %>%
mutate(sex = ifelse(sex == 'Female', 0, 1)) %>%
remove_rownames()
> head(df_donner, n = 20) y age sex 1 0 13 1 2 0 1 0 3 0 5 1 4 0 14 1 5 0 40 0 6 0 51 1 7 0 9 1 8 0 3 1 9 0 8 1 10 0 13 0 11 0 3 0 12 1 45 0 13 0 6 0 14 1 61 1 15 0 9 1 16 0 4 0 17 1 5 1 18 1 56 1 19 0 11 0 20 1 3 1
p <- ggplot(df) + geom_point(aes(x = age, y = y, color = as.factor(sex))) p <- p + scale_color_discrete(name = 'sex', labels = c("Female", "Male")) print(p)
The plot shows that there is a relationship among the variables y, sex, and age, but it's not obvious.
The Logistic Function
Binary Logistic Regression
#!/usr/bin/env python # -*- coding: utf-8 -*- """ lr.py - a class for binary logistic regression author: Bill Thompson license: GPL 3 copyright: 2020-11-05 """ import numpy as np from scipy.linalg import norm from scipy.special import expit # logistic function class lr(object): def __init__(self, maxiter = 1000, tol = 1.0e-06, learn_rate = 0.01, debug = False): """ Initialize object instance and set some parameters. Parameters ---------- maxiter : int, optional maximum number of iteratiosn for fit, by default 1000 tol : float, optional exit fit when gradient is less than tol, by default 1.0e-06 learn_rate : float, optional learning rate, decrease coefficients by this times gradient, by default 0.01 debug : bool, optional print debugging information, by default False """ self.maxiter = maxiter self.tol = tol self.learn_rate = learn_rate self.debug = debug def probability(self, X): """ Calculate P(Y = 1|X, w) Parameters ---------- X : numpy array n X m data matrix of observations Returns ------- numpy array n X 1 1/(1 + exp(-Xw)), """ return expit(np.matmul(X, self.weights)) def predict(self, X): """ Calculate predictions P(Y = 1|X, w) > 0.5 Parameters ---------- X : numpy array n X m data matrix of observations Returns ------- numpy array 0/1 prediction of y """ return np.rint(self.probability(X)).astype(np.int) def score(self, X, y): """ Calculate number of difference between y and current prediction Parameters ---------- X : numpy array n X m data matrix of observations y : numpy array n X 1 dependent variable from data CSV Returns ------- int number of differences between predicted labels and data labels """ return self._diffs(y, self.predict(X)) def _diffs(self, y, y_hat): """ Calculate number of difference between y and current prediction Parameters ---------- y : numpy array n X 1 dependent variable from data CSV y_hat : numpy array current prediction of P(Y = 1|X, w) Returns ------- float number of differences between predicted labels and data labels """ labs = np.rint(y_hat).astype(np.int) # predicted labels return np.sum(np.abs(y - labs)) @property def weights(self): """ return current coefficients Returns ------- numpy array m X 1 array of coefficients """ return self._w def log_likelihood(self, X, y): """ Calculate negative log likelihood of data given current coefficients Parameters ---------- X : numpy array n X m data matrix of observations y : numpy array n X 1 dependent variable from data CSV Returns ------- float negative log likelihood of data given current coefficients """ p = self.probability(X) return -np.sum(y * np.log(p) + (1 - y) * np.log(1 - p)) def fit(self, X, y): """ Fit y ~ X Parameters ---------- X : numpy array n X m data matrix of observations y : numpy array n X 1 dependent variable from data CSV Returns ------- lr object this object. coefficients can be obtained from weight method. """ m, n = X.shape self._w = np.zeros((n, 1)) # coefficents iter = 0 while iter < self.maxiter: y_hat = self.probability(X) # current estimate grad = np.matmul(X.transpose(), (y_hat - y)) / m # gradient self._w -= self.learn_rate * grad # update coefficients if self.debug and (iter % 100 == 0): print(iter) print(self.weights) print(self._diffs(y, y_hat)) print(grad, norm(grad)**2) print() # quit if gradient is flat if norm(grad) ** 2 < self.tol: break iter += 1 return self
def main(): args = GetArgs() data = pd.read_csv(args.data_file) X, y = get_data(data) reg = lr(maxiter = args.max_iter, tol = args.tolerance, learn_rate = args.learning_rate, debug = args.debug) reg.fit(X, y) print(reg.weights, reg.score(X, y))
python src/logreg.py data/donner.csv --max_iter 100000 --learning_rate 0.01 --tolerance 1.0e-10
[[-1.6215459 ] [ 0.03560203] [ 1.06773709]] 29
> summary(model_donner <- glm(y ~ age + sex, data = df_donner, family = 'binomial')) Call: glm(formula = y ~ age + sex, family = "binomial", data = df_donner) Deviance Residuals: Min 1Q Median 3Q Max -1.7386 -1.0261 -0.6511 1.0383 1.8828 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.62180 0.50279 -3.226 0.00126 ** age 0.03561 0.01525 2.336 0.01952 * sex 1.06798 0.48229 2.214 0.02680 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 120.86 on 87 degrees of freedom Residual deviance: 108.87 on 85 degrees of freedom AIC: 114.87 Number of Fisher Scoring iterations: 4 >
No comments:
Post a Comment