Sea of Tranquility    About    Archive    Feed

Dealing with missing values

What to do if some values are missing?

There are three options:

  1. Remove the rows with missing values

  2. Remove the columns with missing values

  3. Replace the missing values with the median of values of the corresponding column



Replacing the missing values using SimpleImputer class

You can do the replacements by using the SimpleImputer class too.

  • Get rid of the non-numerical attributes from the dataframe. Call it df_num
  • Import SimpleImputer class: from sklearn.impute import SimpleImputer
  • Create an instance of the SimpleImputer class by specifying the strategy (a parameter)
  • Fit the SimpleImputer object by passing df_num as a parameter. This calculates the parameter passed as the strategy
  • Call trasform() method to replace missing values with the median values of the corresponding columns. This returns a numpy array. Assign it to the variable X
  • Write the numpy array back into a dataframe, and call it df_num_tr. Get the column names by passing df_num.columns to the parameter columns.
    df_num_tr = pd.DataFrame(X, columns=df_num.columns)