Feature Engineering
TOC
Introduction
Intuitively, feature engineering is the process of understanding the data intimately. So that we can handcraft new features that represent the dataset better and improve the prediction of the model.
Some common methods are:
-
Binning/bucketing: For example, in a dataset about the home credit default rate, when collecting client’s data of their age, it could make better sense to divide the range into categories: less than 20 years old, from 20 to 30 years old, from 30 to 40 years old, from 40 to 50 years old, and above 50 years old. The reasoning behind this division is that, client less than 20 years old are not allowed to take a loan, and client above 50 years old can be groupped into one group since the most popular ages to take loans are from 20 to 50. Then we diving equally from the age 20 to 50. This unequal division of ages into buckets actually make better sense and generalize the age groups better.
-
Polynomial features: We can take square of features, for example, to assume that those features having a nonlinear relationship with the target.
-
Log transform the variables with long tailed distribution so that the new logged feature has a normal distribution
-
Feature interaction: This is a way to combine different features, by assuming them having relationship among themselves. For example, we can combine family related features of a client together (which can be a simple linear combination or a complicated equation). The new feature would represent an overview of the client’s family status.
-
Categorical feature handling: Since we usually need to transform categorical feature into numerical one, there are ways to do it such as onehot encoding (encode the value into a vector of 1 and 0s, with 1 being the cateogry it belongs to) or label encoding (encode each category as a different number).
-
Date time variables: If we have the data on date and time, we can add a lagged variable (the value of the feature in some day in the past), calculate the interval between two dates (for example, the age of the house/car of the client who comes to request a loan).
-
Scale the feature: since features are different in nature, they naturally use different units and scales. But that would makes the model inaccurate since the model doesn’t really grasp the differences in scales. We can do some engineering to bring all features into one scale, in a way, for the machine to understand the dataset a bit better. The most two popular ways is to do minmax scaling and standardization. In min max scaling, we scale each feature back to a range, could be from 0 to 1. This is also called normalization. In standardization, we minus each value to the mean and divided by the standard deviation of the sample.
Code example
The things noted above are general advice. In reality, the feature engineering process depends on the nature of the dataset itself (its dimensions, its purpose, the underlying patterns). Today we explore the the dataset for the home credit default risk. When we look into the features, we can see that there are about 50 features about the building that the client lives in. We can combine those features into a new one named “living_condition” by machine learning technique such as kernal PCA and Kmeans algorithm. Then we can add a financial_stress variable by considering a weighted combination of common factors such as credit income ration, current income, number of children, family size, spouse situation and other bills. Thirdly, add an statistical aggregation of previous loan application to add credibility and credit worthiness of the client into the dataset. Finally, we can consider some other features such as red flag that takes into account credit evaluation from external sources and then non linear relationship with the target.
Preprocessing
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
np.set_printoptions(suppress=True)
labels = pd.read_csv('home-credit-default-risk/HomeCredit_columns_description.csv',encoding='ISO-8859-1')
data = pd.read_csv('home-credit-default-risk/application_train.csv')
labels.head()
Unnamed: 0 | Table | Row | Description | Special | |
---|---|---|---|---|---|
0 | 1 | application_{train|test}.csv | SK_ID_CURR | ID of loan in our sample | NaN |
1 | 2 | application_{train|test}.csv | TARGET | Target variable (1 - client with payment diffi... | NaN |
2 | 5 | application_{train|test}.csv | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
3 | 6 | application_{train|test}.csv | CODE_GENDER | Gender of the client | NaN |
4 | 7 | application_{train|test}.csv | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
# First take all the name of the features related to the building
living_condition = labels['Row'][44:91]
living_condition
44 APARTMENTS_AVG
45 BASEMENTAREA_AVG
46 YEARS_BEGINEXPLUATATION_AVG
47 YEARS_BUILD_AVG
48 COMMONAREA_AVG
49 ELEVATORS_AVG
50 ENTRANCES_AVG
51 FLOORSMAX_AVG
52 FLOORSMIN_AVG
53 LANDAREA_AVG
54 LIVINGAPARTMENTS_AVG
55 LIVINGAREA_AVG
56 NONLIVINGAPARTMENTS_AVG
57 NONLIVINGAREA_AVG
58 APARTMENTS_MODE
59 BASEMENTAREA_MODE
60 YEARS_BEGINEXPLUATATION_MODE
61 YEARS_BUILD_MODE
62 COMMONAREA_MODE
63 ELEVATORS_MODE
64 ENTRANCES_MODE
65 FLOORSMAX_MODE
66 FLOORSMIN_MODE
67 LANDAREA_MODE
68 LIVINGAPARTMENTS_MODE
69 LIVINGAREA_MODE
70 NONLIVINGAPARTMENTS_MODE
71 NONLIVINGAREA_MODE
72 APARTMENTS_MEDI
73 BASEMENTAREA_MEDI
74 YEARS_BEGINEXPLUATATION_MEDI
75 YEARS_BUILD_MEDI
76 COMMONAREA_MEDI
77 ELEVATORS_MEDI
78 ENTRANCES_MEDI
79 FLOORSMAX_MEDI
80 FLOORSMIN_MEDI
81 LANDAREA_MEDI
82 LIVINGAPARTMENTS_MEDI
83 LIVINGAREA_MEDI
84 NONLIVINGAPARTMENTS_MEDI
85 NONLIVINGAREA_MEDI
86 FONDKAPREMONT_MODE
87 HOUSETYPE_MODE
88 TOTALAREA_MODE
89 WALLSMATERIAL_MODE
90 EMERGENCYSTATE_MODE
Name: Row, dtype: object
# Now preprocess the data a bit
data.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
y_train = data['TARGET']
X_train = data.drop(['TARGET'], axis=1)
y_train = y_train.to_frame()
y_train
TARGET | |
---|---|
0 | 1 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
... | ... |
307506 | 0 |
307507 | 0 |
307508 | 0 |
307509 | 1 |
307510 | 0 |
307511 rows × 1 columns
# Let's handle categorical / numerical variables and missing values
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train_categorical = X_train.select_dtypes(include=categoricals)
X_train_numerical = X_train.select_dtypes(include=numerics)
categorical_columns = X_train_categorical.columns
numerical_columns = X_train_numerical.columns
categorical_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
numerical_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# imputer = imputer.fit(X_train)
X_train_categorical = categorical_imputer.fit_transform(X_train_categorical)
X_train_categorical = pd.DataFrame(data=X_train_categorical, columns=categorical_columns)
X_train_numerical = numerical_imputer.fit_transform(X_train_numerical)
X_train_numerical = pd.DataFrame(data=X_train_numerical, columns=numerical_columns)
The thing about using label encoder instead of one hot encoder is that in label encoder, there is an inherent assumption that the values are hierarchically meaningful. This might or might not reflect the qualitative meaning of the value in reality. For example, we categorize the house into 3 district: district 1, district 2, district 3 and encode them into number 0, 1, and 2. Since 2 > 1, it might suggest that district 2 is better than district 1 which might not reflect the real situation in which there are no inherent difference in those two geographical locations (they are both equal in distance to the center for example). We might take this inherent bias into account and try to make a new variable (via clustering or via distance to center) to compensate for this bias in the model. The same goes for the days of the week, inherently the meaning of monday tuesday to sunday might not be that linear. We can hope that the model might have enough data to learn this representation. One hot encoding, on the other hand, assume those categories are all equal, and it puts 1 for that category and 0s for others in the representation vector. For example: a house in district 1 can be represented as [0,1,0].
Building living_condition feature
X_train_categorical = X_train_categorical.apply(LabelEncoder().fit_transform)
X_train_categorical
NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 6 | 7 | 4 | 3 | 1 | 8 | 6 | 5 | 2 | 0 | 5 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 4 | 1 | 1 | 1 | 3 | 1 | 39 | 2 | 0 | 0 | 0 |
2 | 1 | 1 | 1 | 1 | 6 | 7 | 4 | 3 | 1 | 8 | 1 | 11 | 2 | 0 | 4 | 0 |
3 | 0 | 0 | 0 | 1 | 6 | 7 | 4 | 0 | 1 | 8 | 6 | 5 | 2 | 0 | 4 | 0 |
4 | 0 | 1 | 0 | 1 | 6 | 7 | 4 | 3 | 1 | 3 | 4 | 37 | 2 | 0 | 4 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 0 | 1 | 0 | 0 | 6 | 7 | 4 | 2 | 5 | 14 | 4 | 43 | 2 | 0 | 5 | 0 |
307507 | 0 | 0 | 0 | 1 | 6 | 3 | 4 | 5 | 1 | 8 | 1 | 57 | 2 | 0 | 5 | 0 |
307508 | 0 | 0 | 0 | 1 | 6 | 7 | 1 | 2 | 1 | 10 | 4 | 39 | 2 | 0 | 4 | 0 |
307509 | 0 | 0 | 0 | 1 | 6 | 1 | 4 | 1 | 1 | 8 | 6 | 3 | 2 | 0 | 5 | 0 |
307510 | 0 | 0 | 0 | 0 | 6 | 1 | 1 | 1 | 1 | 8 | 4 | 5 | 2 | 0 | 4 | 0 |
307511 rows × 16 columns
# Some of the features are categorical ('FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE')
# the rest is numerical
living_condition_categoricals = ['FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
# living_condition_numericals = [e in living_condition if e not in living_condition_categoricals]
living_condition_numericals = np.setdiff1d(living_condition,living_condition_categoricals)
X_train_numerical[living_condition_numericals]
APARTMENTS_AVG | APARTMENTS_MEDI | APARTMENTS_MODE | BASEMENTAREA_AVG | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | COMMONAREA_AVG | COMMONAREA_MEDI | COMMONAREA_MODE | ELEVATORS_AVG | ... | NONLIVINGAREA_AVG | NONLIVINGAREA_MEDI | NONLIVINGAREA_MODE | TOTALAREA_MODE | YEARS_BEGINEXPLUATATION_AVG | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_AVG | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.02470 | 0.02500 | 0.025200 | 0.036900 | 0.036900 | 0.038300 | 0.014300 | 0.014400 | 0.014400 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.014900 | 0.972200 | 0.972200 | 0.972200 | 0.619200 | 0.624300 | 0.634100 |
1 | 0.09590 | 0.09680 | 0.092400 | 0.052900 | 0.052900 | 0.053800 | 0.060500 | 0.060800 | 0.049700 | 0.080000 | ... | 0.009800 | 0.010000 | 0.000000 | 0.071400 | 0.985100 | 0.985100 | 0.985100 | 0.796000 | 0.798700 | 0.804000 |
2 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.028358 | 0.028236 | 0.027022 | 0.102547 | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 |
3 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.028358 | 0.028236 | 0.027022 | 0.102547 | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 |
4 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.028358 | 0.028236 | 0.027022 | 0.102547 | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 0.20210 | 0.20400 | 0.100800 | 0.088700 | 0.088700 | 0.017200 | 0.020200 | 0.020300 | 0.017200 | 0.220000 | ... | 0.109500 | 0.111800 | 0.012500 | 0.289800 | 0.987600 | 0.987600 | 0.978200 | 0.830000 | 0.832300 | 0.712500 |
307507 | 0.02470 | 0.02500 | 0.025200 | 0.043500 | 0.043500 | 0.045100 | 0.002200 | 0.002200 | 0.002200 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.021400 | 0.972700 | 0.972700 | 0.972700 | 0.626000 | 0.631000 | 0.640600 |
307508 | 0.10310 | 0.10410 | 0.105000 | 0.086200 | 0.086200 | 0.089400 | 0.012300 | 0.012400 | 0.012400 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.797000 | 0.981600 | 0.981600 | 0.981600 | 0.748400 | 0.751800 | 0.758300 |
307509 | 0.01240 | 0.01250 | 0.012600 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.028358 | 0.028236 | 0.027022 | 0.008600 | 0.977100 | 0.977100 | 0.977200 | 0.752471 | 0.755746 | 0.759637 |
307510 | 0.07420 | 0.07490 | 0.075600 | 0.052600 | 0.052600 | 0.054600 | 0.017600 | 0.017700 | 0.017800 | 0.080000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.071800 | 0.988100 | 0.988100 | 0.988100 | 0.752471 | 0.755746 | 0.759637 |
307511 rows × 43 columns
X_train_categorical[living_condition_categoricals]
FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
---|---|---|---|---|
0 | 2 | 0 | 5 | 0 |
1 | 2 | 0 | 0 | 0 |
2 | 2 | 0 | 4 | 0 |
3 | 2 | 0 | 4 | 0 |
4 | 2 | 0 | 4 | 0 |
... | ... | ... | ... | ... |
307506 | 2 | 0 | 5 | 0 |
307507 | 2 | 0 | 5 | 0 |
307508 | 2 | 0 | 4 | 0 |
307509 | 2 | 0 | 5 | 0 |
307510 | 2 | 0 | 4 | 0 |
307511 rows × 4 columns
X_train_living_condition = pd.concat([X_train_numerical[living_condition_numericals], X_train_categorical[living_condition_categoricals]],axis=1)
X_train_living_condition
APARTMENTS_AVG | APARTMENTS_MEDI | APARTMENTS_MODE | BASEMENTAREA_AVG | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | COMMONAREA_AVG | COMMONAREA_MEDI | COMMONAREA_MODE | ELEVATORS_AVG | ... | YEARS_BEGINEXPLUATATION_AVG | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_AVG | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.02470 | 0.02500 | 0.025200 | 0.036900 | 0.036900 | 0.038300 | 0.014300 | 0.014400 | 0.014400 | 0.000000 | ... | 0.972200 | 0.972200 | 0.972200 | 0.619200 | 0.624300 | 0.634100 | 2 | 0 | 5 | 0 |
1 | 0.09590 | 0.09680 | 0.092400 | 0.052900 | 0.052900 | 0.053800 | 0.060500 | 0.060800 | 0.049700 | 0.080000 | ... | 0.985100 | 0.985100 | 0.985100 | 0.796000 | 0.798700 | 0.804000 | 2 | 0 | 0 | 0 |
2 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
3 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
4 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 0.20210 | 0.20400 | 0.100800 | 0.088700 | 0.088700 | 0.017200 | 0.020200 | 0.020300 | 0.017200 | 0.220000 | ... | 0.987600 | 0.987600 | 0.978200 | 0.830000 | 0.832300 | 0.712500 | 2 | 0 | 5 | 0 |
307507 | 0.02470 | 0.02500 | 0.025200 | 0.043500 | 0.043500 | 0.045100 | 0.002200 | 0.002200 | 0.002200 | 0.000000 | ... | 0.972700 | 0.972700 | 0.972700 | 0.626000 | 0.631000 | 0.640600 | 2 | 0 | 5 | 0 |
307508 | 0.10310 | 0.10410 | 0.105000 | 0.086200 | 0.086200 | 0.089400 | 0.012300 | 0.012400 | 0.012400 | 0.000000 | ... | 0.981600 | 0.981600 | 0.981600 | 0.748400 | 0.751800 | 0.758300 | 2 | 0 | 4 | 0 |
307509 | 0.01240 | 0.01250 | 0.012600 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977100 | 0.977100 | 0.977200 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 5 | 0 |
307510 | 0.07420 | 0.07490 | 0.075600 | 0.052600 | 0.052600 | 0.054600 | 0.017600 | 0.017700 | 0.017800 | 0.080000 | ... | 0.988100 | 0.988100 | 0.988100 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
307511 rows × 47 columns
Since the dataset is quite large, I use only the first 10000 observations. Then we transform the X_train_living_condition into higher dimensional space with the RBF kernel (Radial basis function) where they can be separated better. After that, we use K-means to determine the clusters. The ELBOW shows that 3 clusters is optimal. Which roughly means that the living condition of clients can be groupped into three clusters. We will create a new feature with those new cluster labels.
X_train_living_condition = X_train_living_condition[:10000]
X_train_living_condition
APARTMENTS_AVG | APARTMENTS_MEDI | APARTMENTS_MODE | BASEMENTAREA_AVG | BASEMENTAREA_MEDI | BASEMENTAREA_MODE | COMMONAREA_AVG | COMMONAREA_MEDI | COMMONAREA_MODE | ELEVATORS_AVG | ... | YEARS_BEGINEXPLUATATION_AVG | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_AVG | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.02470 | 0.02500 | 0.025200 | 0.036900 | 0.036900 | 0.038300 | 0.014300 | 0.014400 | 0.014400 | 0.000000 | ... | 0.972200 | 0.972200 | 0.972200 | 0.619200 | 0.624300 | 0.634100 | 2 | 0 | 5 | 0 |
1 | 0.09590 | 0.09680 | 0.092400 | 0.052900 | 0.052900 | 0.053800 | 0.060500 | 0.060800 | 0.049700 | 0.080000 | ... | 0.985100 | 0.985100 | 0.985100 | 0.796000 | 0.798700 | 0.804000 | 2 | 0 | 0 | 0 |
2 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
3 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
4 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 0.01630 | 0.01670 | 0.016800 | 0.000000 | 0.000000 | 0.000000 | 0.044621 | 0.044595 | 0.042553 | 0.080000 | ... | 0.980600 | 0.980600 | 0.980600 | 0.632800 | 0.637700 | 0.647200 | 2 | 0 | 5 | 0 |
9996 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
9997 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
9998 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
9999 | 0.11744 | 0.11785 | 0.114231 | 0.088442 | 0.087955 | 0.087543 | 0.044621 | 0.044595 | 0.042553 | 0.078942 | ... | 0.977735 | 0.977752 | 0.977065 | 0.752471 | 0.755746 | 0.759637 | 2 | 0 | 4 | 0 |
10000 rows × 47 columns
import numpy as np
from sklearn.decomposition import KernelPCA
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
# assume df is your DataFrame
X = X_train_living_condition.values
# Kernel PCA transformation using RBF
kpca = KernelPCA(kernel="rbf")
X_kpca = kpca.fit_transform(X)
# finding the optimal number of clusters for KMeans after transformation
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X_kpca)
kmeanModel.fit(X_kpca)
distortions.append(sum(np.min(cdist(X_kpca, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_kpca.shape[0])
# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
# Apply KMeans with the optimal number of clusters (you can find it from the above plot)
kmeans = KMeans(n_clusters=3) # here 3 is used as an example, replace it with your optimal number
kmeans.fit(X_kpca)
# getting the cluster labels for each sample
labels = kmeans.labels_
X_train_living_condition['living_condition_cluster_label'] = labels
the_rest = []
for e in X_train.columns:
if e not in living_condition:
the_rest.append(e)
the_rest
X_train_the_rest = X_train[the_rest][:10000]
X_train_new = pd.concat([X_train_the_rest, X_train_living_condition], axis=1)
Now we can create more features based on this new way of clustering living condition. For example, we can take the mean, min, max and sum of the income of those living condition brackets. They are different in the sum of income but the mean is similar.
living_condition_cluster_label | bincount | income mean | min | max | sum |
---|---|---|---|---|---|
0 | 7300 | 164439 | 25650 | 1935000 | 1000000000 |
1 | 454 | 188565 | 36000 | 810000 | 80000000 |
2 | 2246 | 172962 | 33300 | 810000 | 300000000 |
We can see that the group (2) with smallest size (454/10000) having the highest mean income (180000), even though summing up, they are the least (80000000). The group (0) with biggest size (7300) having the lowest mean income (160000), even though summing up, they are the highest. The biggest group also has the biggest range of income, ranging from 25000 to 2000000. These are the internal structure of the dataset that the kernelPCA and the Kmeans discover. Then we can merge this living situation feature into the original one.
np.bincount(labels)
X_train_living_situation = X_train_new.groupby('living_condition_cluster_label').agg({'AMT_INCOME_TOTAL': ['mean', 'min', 'max', 'sum'],
# Add more columns as needed
}).reset_index()
X_train_living_situation
X_train_living_situation.columns = ['living_condition_cluster_label', 'AMT_INCOME_TOTAL_mean','AMT_INCOME_TOTAL_min','AMT_INCOME_TOTAL_max','AMT_INCOME_TOTAL_sum' ]
X_train_new = X_train_new.merge(X_train_living_situation, on='living_condition_cluster_label', how='left')
X_train_new
SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | AMT_CREDIT_min | AMT_CREDIT_sum | AMT_ANNUITY_mean | AMT_ANNUITY_max | AMT_ANNUITY_min | DAYS_DECISION_mean | DAYS_DECISION_max | DAYS_DECISION_min | CNT_PAYMENT_mean | CNT_PAYMENT_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | ... | 179055.0 | 179055.0 | 9251.775000 | 9251.775 | 9251.775 | -606.000000 | -606.0 | -606.0 | 24.000000 | 24.0 |
1 | 100003 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | ... | 68053.5 | 1452573.0 | 56553.990000 | 98356.995 | 6737.310 | -1305.000000 | -746.0 | -2341.0 | 10.000000 | 30.0 |
2 | 100004 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | ... | 20106.0 | 20106.0 | 5357.250000 | 5357.250 | 5357.250 | -815.000000 | -815.0 | -815.0 | 4.000000 | 4.0 |
3 | 100006 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | ... | 0.0 | 2625259.5 | 23651.175000 | 39954.510 | 2482.920 | -272.444444 | -181.0 | -617.0 | 23.000000 | 138.0 |
4 | 100007 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | ... | 14616.0 | 999832.5 | 12278.805000 | 22678.785 | 1834.290 | -1222.833333 | -374.0 | -2357.0 | 20.666667 | 124.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 456251 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | ... | 40455.0 | 40455.0 | 6605.910000 | 6605.910 | 6605.910 | -273.000000 | -273.0 | -273.0 | 8.000000 | 8.0 |
307507 | 456252 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | ... | 56821.5 | 56821.5 | 10074.465000 | 10074.465 | 10074.465 | -2497.000000 | -2497.0 | -2497.0 | 6.000000 | 6.0 |
307508 | 456253 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | ... | 13945.5 | 41251.5 | 4770.405000 | 5567.715 | 3973.095 | -2380.000000 | -1909.0 | -2851.0 | 5.000000 | 10.0 |
307509 | 456254 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | ... | 21456.0 | 268879.5 | 10681.132500 | 19065.825 | 2296.440 | -299.500000 | -277.0 | -322.0 | 15.000000 | 30.0 |
307510 | 456255 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | ... | 45000.0 | 3395448.0 | 20775.391875 | 54022.140 | 2250.000 | -587.625000 | -171.0 | -991.0 | 21.750000 | 174.0 |
307511 rows × 133 columns
Adding financial_stress feature
There are multiple factors for the financial stress, but some main ones are: credit-income ratio, current income, current credit line, the number of children, the number of family members, spouse’s income, spouse’s credit line, bills. We can weight those factors differently, too, since they affect the client differently.
original = pd.concat([X_train[['AMT_INCOME_TOTAL','AMT_CREDIT','CNT_FAM_MEMBERS','CNT_CHILDREN']],X_train_categorical['NAME_FAMILY_STATUS']],axis=1)
original
AMT_INCOME_TOTAL | AMT_CREDIT | CNT_FAM_MEMBERS | CNT_CHILDREN | NAME_FAMILY_STATUS | |
---|---|---|---|---|---|
0 | 202500.0 | 406597.5 | 1.0 | 0 | 3 |
1 | 270000.0 | 1293502.5 | 2.0 | 0 | 1 |
2 | 67500.0 | 135000.0 | 1.0 | 0 | 3 |
3 | 135000.0 | 312682.5 | 2.0 | 0 | 0 |
4 | 121500.0 | 513000.0 | 1.0 | 0 | 3 |
... | ... | ... | ... | ... | ... |
307506 | 157500.0 | 254700.0 | 1.0 | 0 | 2 |
307507 | 72000.0 | 269550.0 | 1.0 | 0 | 5 |
307508 | 153000.0 | 677664.0 | 1.0 | 0 | 2 |
307509 | 171000.0 | 370107.0 | 2.0 | 0 | 1 |
307510 | 157500.0 | 675000.0 | 2.0 | 0 | 1 |
307511 rows × 5 columns
def compute_financial_stress(row):
w_credit_income_ratio = 2
w_family_size = 1
w_family_status = 1
w_children = 1.5
stress_score = (row['AMT_CREDIT'] / row['AMT_INCOME_TOTAL']) * w_credit_income_ratio + row['CNT_FAM_MEMBERS'] * w_family_size + row['NAME_FAMILY_STATUS'] * w_family_status + row['CNT_CHILDREN'] * w_children
return stress_score
original['financial_stress'] = original.apply(compute_financial_stress, axis=1)
original
AMT_INCOME_TOTAL | AMT_CREDIT | CNT_FAM_MEMBERS | CNT_CHILDREN | NAME_FAMILY_STATUS | financial_stress | |
---|---|---|---|---|---|---|
0 | 202500.0 | 406597.5 | 1.0 | 0 | 3 | 8.015778 |
1 | 270000.0 | 1293502.5 | 2.0 | 0 | 1 | 12.581500 |
2 | 67500.0 | 135000.0 | 1.0 | 0 | 3 | 8.000000 |
3 | 135000.0 | 312682.5 | 2.0 | 0 | 0 | 6.632333 |
4 | 121500.0 | 513000.0 | 1.0 | 0 | 3 | 12.444444 |
... | ... | ... | ... | ... | ... | ... |
307506 | 157500.0 | 254700.0 | 1.0 | 0 | 2 | 6.234286 |
307507 | 72000.0 | 269550.0 | 1.0 | 0 | 5 | 13.487500 |
307508 | 153000.0 | 677664.0 | 1.0 | 0 | 2 | 11.858353 |
307509 | 171000.0 | 370107.0 | 2.0 | 0 | 1 | 7.328737 |
307510 | 157500.0 | 675000.0 | 2.0 | 0 | 1 | 11.571429 |
307511 rows × 6 columns
Add previous application situation
Previous application situation is an agregation of previous loans by the same person. It might rougly tell the credibility of the person, plus other important information.
previous_application = pd.read_csv('home-credit-default-risk/previous_application.csv')
previous_application.head()
SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
# Aggregate based on the SK_ID_CURR
prev_app_agg = previous_application.groupby('SK_ID_CURR').agg({'AMT_CREDIT': ['mean', 'max', 'min', 'sum'],
'AMT_ANNUITY': ['mean', 'max', 'min'],
'DAYS_DECISION': ['mean', 'max', 'min'],
'CNT_PAYMENT': ['mean', 'sum']})
# Handle multi-level column names
prev_app_agg.columns = ['_'.join(col).strip() for col in prev_app_agg.columns.values]
# Reset the index
prev_app_agg.reset_index(inplace=True)
prev_app_agg
SK_ID_CURR | AMT_CREDIT_mean | AMT_CREDIT_max | AMT_CREDIT_min | AMT_CREDIT_sum | AMT_ANNUITY_mean | AMT_ANNUITY_max | AMT_ANNUITY_min | DAYS_DECISION_mean | DAYS_DECISION_max | DAYS_DECISION_min | CNT_PAYMENT_mean | CNT_PAYMENT_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100001 | 23787.00 | 23787.0 | 23787.0 | 23787.0 | 3951.000000 | 3951.000 | 3951.000 | -1740.000 | -1740 | -1740 | 8.00 | 8.0 |
1 | 100002 | 179055.00 | 179055.0 | 179055.0 | 179055.0 | 9251.775000 | 9251.775 | 9251.775 | -606.000 | -606 | -606 | 24.00 | 24.0 |
2 | 100003 | 484191.00 | 1035882.0 | 68053.5 | 1452573.0 | 56553.990000 | 98356.995 | 6737.310 | -1305.000 | -746 | -2341 | 10.00 | 30.0 |
3 | 100004 | 20106.00 | 20106.0 | 20106.0 | 20106.0 | 5357.250000 | 5357.250 | 5357.250 | -815.000 | -815 | -815 | 4.00 | 4.0 |
4 | 100005 | 20076.75 | 40153.5 | 0.0 | 40153.5 | 4813.200000 | 4813.200 | 4813.200 | -536.000 | -315 | -757 | 12.00 | 12.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
338852 | 456251 | 40455.00 | 40455.0 | 40455.0 | 40455.0 | 6605.910000 | 6605.910 | 6605.910 | -273.000 | -273 | -273 | 8.00 | 8.0 |
338853 | 456252 | 56821.50 | 56821.5 | 56821.5 | 56821.5 | 10074.465000 | 10074.465 | 10074.465 | -2497.000 | -2497 | -2497 | 6.00 | 6.0 |
338854 | 456253 | 20625.75 | 27306.0 | 13945.5 | 41251.5 | 4770.405000 | 5567.715 | 3973.095 | -2380.000 | -1909 | -2851 | 5.00 | 10.0 |
338855 | 456254 | 134439.75 | 247423.5 | 21456.0 | 268879.5 | 10681.132500 | 19065.825 | 2296.440 | -299.500 | -277 | -322 | 15.00 | 30.0 |
338856 | 456255 | 424431.00 | 1271929.5 | 45000.0 | 3395448.0 | 20775.391875 | 54022.140 | 2250.000 | -587.625 | -171 | -991 | 21.75 | 174.0 |
338857 rows × 13 columns
X_train_new = X_train.merge(prev_app_agg, on='SK_ID_CURR', how='left')
X_train_new
SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | AMT_CREDIT_min | AMT_CREDIT_sum | AMT_ANNUITY_mean | AMT_ANNUITY_max | AMT_ANNUITY_min | DAYS_DECISION_mean | DAYS_DECISION_max | DAYS_DECISION_min | CNT_PAYMENT_mean | CNT_PAYMENT_sum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | ... | 179055.0 | 179055.0 | 9251.775000 | 9251.775 | 9251.775 | -606.000000 | -606.0 | -606.0 | 24.000000 | 24.0 |
1 | 100003 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | ... | 68053.5 | 1452573.0 | 56553.990000 | 98356.995 | 6737.310 | -1305.000000 | -746.0 | -2341.0 | 10.000000 | 30.0 |
2 | 100004 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | ... | 20106.0 | 20106.0 | 5357.250000 | 5357.250 | 5357.250 | -815.000000 | -815.0 | -815.0 | 4.000000 | 4.0 |
3 | 100006 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | ... | 0.0 | 2625259.5 | 23651.175000 | 39954.510 | 2482.920 | -272.444444 | -181.0 | -617.0 | 23.000000 | 138.0 |
4 | 100007 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | ... | 14616.0 | 999832.5 | 12278.805000 | 22678.785 | 1834.290 | -1222.833333 | -374.0 | -2357.0 | 20.666667 | 124.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 456251 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | ... | 40455.0 | 40455.0 | 6605.910000 | 6605.910 | 6605.910 | -273.000000 | -273.0 | -273.0 | 8.000000 | 8.0 |
307507 | 456252 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | ... | 56821.5 | 56821.5 | 10074.465000 | 10074.465 | 10074.465 | -2497.000000 | -2497.0 | -2497.0 | 6.000000 | 6.0 |
307508 | 456253 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | ... | 13945.5 | 41251.5 | 4770.405000 | 5567.715 | 3973.095 | -2380.000000 | -1909.0 | -2851.0 | 5.000000 | 10.0 |
307509 | 456254 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | ... | 21456.0 | 268879.5 | 10681.132500 | 19065.825 | 2296.440 | -299.500000 | -277.0 | -322.0 | 15.000000 | 30.0 |
307510 | 456255 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | ... | 45000.0 | 3395448.0 | 20775.391875 | 54022.140 | 2250.000 | -587.625000 | -171.0 | -991.0 | 21.750000 | 174.0 |
307511 rows × 133 columns
Misc
A feature of red-flag can be added based on credit rate from external sources. As before, the external sources can have different credibility. And then some polynomial features can be added to show non linear relationship. Here we will only demonstrate how to apply polynomial conversion to one feature: the goods price that the client applies for. Since it is not clear that the price of the goods would behave linearly with the default risk, we apply quadratic transformation.
original = X_train_numerical[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
original
EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | |
---|---|---|---|
0 | 0.083037 | 0.262949 | 0.139376 |
1 | 0.311267 | 0.622246 | 0.510853 |
2 | 0.502130 | 0.555912 | 0.729567 |
3 | 0.502130 | 0.650442 | 0.510853 |
4 | 0.502130 | 0.322738 | 0.510853 |
... | ... | ... | ... |
307506 | 0.145570 | 0.681632 | 0.510853 |
307507 | 0.502130 | 0.115992 | 0.510853 |
307508 | 0.744026 | 0.535722 | 0.218859 |
307509 | 0.502130 | 0.514163 | 0.661024 |
307510 | 0.734460 | 0.708569 | 0.113922 |
307511 rows × 3 columns
def compute_red_flag(row):
w_source_1 = 2
w_source_2 = 1
w_source_3 = 1
red_flag = row['EXT_SOURCE_1'] * w_source_1 + row['EXT_SOURCE_2'] * w_source_2 + row['EXT_SOURCE_3'] * w_source_3
return red_flag
original['red_flag'] = original.apply(compute_red_flag, axis=1)
original
/var/folders/kf/5_ggvsz93vxdbx_h0tvy66xh0000gn/T/ipykernel_15715/1336237161.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
original['red_flag'] = original.apply(compute_red_flag, axis=1)
EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | red_flag | |
---|---|---|---|---|
0 | 0.083037 | 0.262949 | 0.139376 | 0.568398 |
1 | 0.311267 | 0.622246 | 0.510853 | 1.755633 |
2 | 0.502130 | 0.555912 | 0.729567 | 2.289738 |
3 | 0.502130 | 0.650442 | 0.510853 | 2.165554 |
4 | 0.502130 | 0.322738 | 0.510853 | 1.837851 |
... | ... | ... | ... | ... |
307506 | 0.145570 | 0.681632 | 0.510853 | 1.483626 |
307507 | 0.502130 | 0.115992 | 0.510853 | 1.631105 |
307508 | 0.744026 | 0.535722 | 0.218859 | 2.242634 |
307509 | 0.502130 | 0.514163 | 0.661024 | 2.179446 |
307510 | 0.734460 | 0.708569 | 0.113922 | 2.291411 |
307511 rows × 4 columns
# Quadratic transformation
original = X_train_numerical['AMT_GOODS_PRICE'][:10000]
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2, interaction_only=True)
poly.fit_transform([original])
original
0 351000.0
1 1129500.0
2 135000.0
3 297000.0
4 513000.0
...
9995 270000.0
9996 900000.0
9997 450000.0
9998 315000.0
9999 270000.0
Name: AMT_GOODS_PRICE, Length: 10000, dtype: float64
Conclusion
So far we have studied some forms of feature engineering in the wild. And the examples use non trivial methods to engineer the features. Those examples reflect the domain knowledge necessary in due diligence of financial institution when they consider credit lines/cash for consumers and other knowledge as a data scientist when it comes to process large dataset. To recap, we have seen that there are many input features regarding the building in which the client lives in, these are indirect reflect of their living condition, hence we create a new variable based on those input features. Then we see that there can be a financial stress evaluation based on the situation of the client. That indicator would be helpful since we are wondering whether the client can pay back the loan or not. For example, the credit-income ratio can tell how much the credit line would weigh the person’s income down, especially if they have children. We can see that the dataset lacks the information on the spouse situation (their income and credit), so one logical thing to do is that we might come back to the field and collect such information, to make our prediction more sensible. The third feature we engineer is the situation of the previous loan applications of the same client. The past can say a lot about the present and the future of this same client. So that we can aggregate those statistics into the current model’s calculation.
Together, those examples provide an overview of how to do feature engineering for a dataset, which is a crucial process and it can affect the model’s performance directly. Translating into real world business, it can help the institution makes better decision in aiding people in need.