top of page
  • patrickwalsh6079

Logistic Regression Made Easy

You may have heard of the term Logistic Regression. It sounds intimidating, like the sort of thing only super smart math geniuses and scientists can understand. While there is a lot of math involved in the inner workings of these models, you don't have to memorize advanced formulas or have an advanced degree to actually do Logistic Regression. Here, I am going to show you how to use Logistic Regression to produce a simple, straightforward answer to the following question: did this passenger survive?


Titanic dataset

We will be using the famous Titanic dataset, found here. This dataset is a tabular set of values containing information on the passengers aboard the Titanic when it sank on its maiden voyage in 1912. The columns of this dataset include passenger name, cabin, home and destination port, and a binary value for whether or not they survived. The dataset also includes whether the passengers are 1st class, 2nd class, or 3rd class, as well as their age and sex.



For our purposes, we will be trying to predict whether the passenger survived based only on their passenger class, age, and sex. Before we can perform Logistic Regression, however, there are a few preprocessing steps that must be taken with our data - a process known as data wrangling.


Data wrangling

Data wrangling refers to the process of exploring your dataset and transforming it to a format that is optimal for further analytics, such as Machine Learning and Logistic Regression. Typical data wrangling tasks include changing the data format, such as converting text to numbers, data cleaning which includes techniques for removing null values, and data visualization.


In order to run Logistic Regression on our Titanic dataset, the data must all be in a numeric format. Since the values in the sex column are text, we will need to convert these to numbers. I added a column and used the Excel formula =IF(F2="female",0,1) to fill the row with a 0 if female, 1 if male.











Next, we need to deal with empty rows. Empty rows are typically handled in one of two ways: delete them or fill them in with a dummy value, such as an average. To keep things simple, I deleted them. We had about 300 rows where age was blank, so I deleted these rows.


Hint: You may not want to delete rows if you have a very small dataset. With small datasets, it may be better to populate blank rows with another value, such as an average so that you aren't losing data.


Now we can perform Logistic Regression.


Analyzing the Results


The regression statistics above provide valuable insights into our data model. For instance, we can state that 36.7% of the change in the Intercept (whether or not a passenger survives, in this case), can be explained by passenger class, age, and sex. Whether or not this value (known as the Adjusted R-Square) is high enough really depends on the use-case. Some business use-cases might demand a much higher Adjusted R-Square in order to narrow down the list of possible factors influencing a certain outcome.


To be confident that our Logistic Regression model can be trusted, we must also pay close attention to two other important metrics: Significance of F and the P-values of our Coefficients.




Significance of F

The Significance of F, also known as the F-test, tells us whether our overall model is statistically significant. The rule of thumb is that an F-test under 0.05 is statistically significant and has a low probability of error. An F-test greater than 0.05 is not statistically significant and cannot be relied upon for accurate results. If your F-test is greater than 0.05, stop with your analysis. Your model is no good.


P-values

The P-values are carry a similar meaning in that a value of under 0.05 is considered statistically significant an have a low probability of error. If the P-value for a given coefficient is greater than 0.05, ignore that coefficient. It cannot be relied upon for statistical accuracy. The distinction between P-value and the Significance of F is that P-values refer to the reliability of individual coefficients while the Significance of F refers to the model as a whole.


Conclusion

This post introduced us to the concept of Logistic Regression using the Titanic dataset. We first cleaned (or wrangled) our dataset by converting text fields to numbers and deleting empty rows since Logistic Regression requires all inputs to be numeric. We then ran a simple regression and analyzed the Adjusted R-Squared, Significance of F, and P-values of the coefficients.


21 views0 comments

Comments


bottom of page