## Exploratory data analysis by chart §

### Count plot §

sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')

### Distribution plot §

method 1

# the kde argument will show a kde curve
sns.displot(train['Age'].dropna(),kde=True,color='darkred',bins=30)

method 2

train['Age'].hist(bins=30,color='darkred',alpha=0.7)

## Data cleaning §

### Set and apply function §

def impute_age(cols)
Age = cols[0]
Pclass = cols[1]

if pd.isnull(Age):

if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age

# Apply the function using apply()
train[‘Age] = train[[file:‘Age’, ‘Pclass’.org][‘Age’, ‘Pclass’]].apply(impute_age, axis=1)

### Drop/remove columns §

train.drop(‘Cabin’,axis=1,inplace=True)

### Drop NAs §

train.dropna(inplace=True)

## Convert categorical features §

Convert categorical variables to indicator variables so that the machine can understand it (e.g. Female/Male -> 0/1)

pd.get_dummies(train[‘Sex’],drop_first=True)

The first part of argument will convert the sex to two columns: Female/Male with 0/1, which causes the multicollinearity. Therefore we add drop_first to keep it as a single column.

multicollinearity it occurs when two or more independent variables have a high correlation with one another, which makes it difficult to decide the individual effect of each variable

### Concatenate columns §

train = pd.concat([train, sex, embark], axis=1)

## Build LR model §

From sklearn.linear_model import LogisticRegression
# set max_iter to avoid the warning
logmodel = LogisticRegression(max_iter=1000)
Logmodel.fit(x_train, y_train)

## Evaluate the model §

Print the accuracy rate for classification report

From sklearn.metrics import classification_report
Print(classification_report(y_test, predictions))

Print the confusion matrix

from sklearn.metrics import confusion_matrix
Confusion_matrix(y_test, predictions)