Dealing with missing values
What to do if some values are missing?
There are three options:
-
Remove the rows with missing values
-
Remove the columns with missing values
-
Replace the missing values with the median of values of the corresponding column
Replacing the missing values using SimpleImputer
class
You can do the replacements by using the SimpleImputer
class too.
- Get rid of the non-numerical attributes from the dataframe. Call it
df_num
- Import
SimpleImputer
class:from sklearn.impute import SimpleImputer
- Create an instance of the
SimpleImputer
class by specifying thestrategy
(a parameter) - Fit the
SimpleImputer
object by passingdf_num
as a parameter. This calculates the parameter passed as thestrategy
- Call
trasform()
method to replace missing values with the median values of the corresponding columns. This returns a numpy array. Assign it to the variableX
- Write the numpy array back into a dataframe, and call it
df_num_tr
. Get the column names by passingdf_num.columns
to the parametercolumns
.
df_num_tr = pd.DataFrame(X, columns=df_num.columns)