Dealing with missing values
What to do if some values are missing?
There are three options:
-
Remove the rows with missing values
-
Remove the columns with missing values
-
Replace the missing values with the median of values of the corresponding column
Replacing the missing values using SimpleImputer class
You can do the replacements by using the SimpleImputer class too.
- Get rid of the non-numerical attributes from the dataframe. Call it
df_num - Import
SimpleImputerclass:from sklearn.impute import SimpleImputer - Create an instance of the
SimpleImputerclass by specifying thestrategy(a parameter) - Fit the
SimpleImputerobject by passingdf_numas a parameter. This calculates the parameter passed as thestrategy - Call
trasform()method to replace missing values with the median values of the corresponding columns. This returns a numpy array. Assign it to the variableX - Write the numpy array back into a dataframe, and call it
df_num_tr. Get the column names by passingdf_num.columnsto the parametercolumns.
df_num_tr = pd.DataFrame(X, columns=df_num.columns)
