Link: Logistic regression Python

Exploratory data analysis by chart

Count plot

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

Distribution plot

method 1

# the kde argument will show a kde curve
sns.displot(train['Age'].dropna(),kde=True,color='darkred',bins=30)

method 2

train['Age'].hist(bins=30,color='darkred',alpha=0.7)

Data cleaning

Set and apply function

def impute_age(cols)
    Age = cols[0]
    Pclass = cols[1]
 
    if pd.isnull(Age):
 
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age
 
# Apply the function using apply()
train[‘Age] = train[[file:‘Age’, ‘Pclass’.org][‘Age’, ‘Pclass’]].apply(impute_age, axis=1)

Drop/remove columns

train.drop(‘Cabin’,axis=1,inplace=True)

Drop NAs

train.dropna(inplace=True)

Convert categorical features

Convert categorical variables to indicator variables so that the machine can understand it (e.g. Female/Male -> 0/1)

pd.get_dummies(train[‘Sex’],drop_first=True)

The first part of argument will convert the sex to two columns: Female/Male with 0/1, which causes the multicollinearity. Therefore we add drop_first to keep it as a single column.

multicollinearity it occurs when two or more independent variables have a high correlation with one another, which makes it difficult to decide the individual effect of each variable

Concatenate columns

train = pd.concat([train, sex, embark], axis=1)

Build LR model

From sklearn.linear_model import LogisticRegression
# set max_iter to avoid the warning
logmodel = LogisticRegression(max_iter=1000)
Logmodel.fit(x_train, y_train)

Evaluate the model

Print the accuracy rate for classification report

From sklearn.metrics import classification_report
Print(classification_report(y_test, predictions))

Print the confusion matrix

from sklearn.metrics import confusion_matrix
Confusion_matrix(y_test, predictions)