Link: Logistic regression Python
Exploratory data analysis by chart
Count plot
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
Distribution plot
method 1
# the kde argument will show a kde curve
sns.displot(train['Age'].dropna(),kde=True,color='darkred',bins=30)
method 2
train['Age'].hist(bins=30,color='darkred',alpha=0.7)
Data cleaning
Set and apply function
def impute_age(cols)
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
# Apply the function using apply()
train[‘Age] = train[[file:‘Age’, ‘Pclass’.org][‘Age’, ‘Pclass’]].apply(impute_age, axis=1)
Drop/remove columns
train.drop(‘Cabin’,axis=1,inplace=True)
Drop NAs
train.dropna(inplace=True)
Convert categorical features
Convert categorical variables to indicator variables so that the machine can understand it (e.g. Female/Male -> 0/1)
pd.get_dummies(train[‘Sex’],drop_first=True)
The first part of argument will convert the sex to two columns: Female/Male with 0/1, which causes the multicollinearity. Therefore we add drop_first
to keep it as a single column.
multicollinearity it occurs when two or more independent variables have a high correlation with one another, which makes it difficult to decide the individual effect of each variable
Concatenate columns
train = pd.concat([train, sex, embark], axis=1)
Build LR model
From sklearn.linear_model import LogisticRegression
# set max_iter to avoid the warning
logmodel = LogisticRegression(max_iter=1000)
Logmodel.fit(x_train, y_train)
Evaluate the model
Print the accuracy rate for classification report
From sklearn.metrics import classification_report
Print(classification_report(y_test, predictions))
Print the confusion matrix
from sklearn.metrics import confusion_matrix
Confusion_matrix(y_test, predictions)