Link: Logistic regression Python

Exploratory data analysis by chart

Count plot


Distribution plot

method 1

# the kde argument will show a kde curve

method 2


Data cleaning

Set and apply function

def impute_age(cols)
    Age = cols[0]
    Pclass = cols[1]
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
            return 24
        return Age
# Apply the function using apply()
train[‘Age] = train[[file:‘Age’, ‘Pclass’.org][‘Age’, ‘Pclass’]].apply(impute_age, axis=1)

Drop/remove columns


Drop NAs


Convert categorical features

Convert categorical variables to indicator variables so that the machine can understand it (e.g. Female/Male -> 0/1)


The first part of argument will convert the sex to two columns: Female/Male with 0/1, which causes the multicollinearity. Therefore we add drop_first to keep it as a single column.

multicollinearity it occurs when two or more independent variables have a high correlation with one another, which makes it difficult to decide the individual effect of each variable

Concatenate columns

train = pd.concat([train, sex, embark], axis=1)

Build LR model

From sklearn.linear_model import LogisticRegression
# set max_iter to avoid the warning
logmodel = LogisticRegression(max_iter=1000), y_train)

Evaluate the model

Print the accuracy rate for classification report

From sklearn.metrics import classification_report
Print(classification_report(y_test, predictions))

Print the confusion matrix

from sklearn.metrics import confusion_matrix
Confusion_matrix(y_test, predictions)