Random pandas notes
A dump of pandas functions which I have found useful. Use with discretion.
- Plotting code example
import matplotlib.pyplot as plt
import pandas a pd
fig = plt.figure(figsize=(6,6)) # Create a Figure object
ax = fig.subplots() # Create an AxesSubplot object
#Both Figure and AxesSubplot are classes defined in matplotlib
#Plot CO2 on y-axis and index along x-axis.
df['CO2'].plot(kind='line', color='b', ax=ax)
#If two columns are being plotted, then we should speficy both of them
df.plot(x='x-column', y='y-column', kind='line', color='b', ax=ax)
ax.set(title='Title', xlabel='XLabel', ylabel='Ylabel')
ax.legend().set_visible(False) #Remove legend
ax.legend(loc="upper left")
#Rotate x-tick labels. For datetime objects see below
for tick in ax.get_xticklabels():
tick.set_rotation(45)
#Formatting datetime tick labels
fig.autofmt_xdate()
fig.savefig('file.png', transparent=False, dpi=300, bbox_inches='tight')
plt.show()
- Change datatype of selected columns while creating a dataframe
- Use the
converters
parameterdf = pd.read_excel("datafile.xlxs", converters={'Year':np.int32, 'Month':np.int32})
In the above example, the datatype of columns ‘Year’ and ‘Month’ are changed to
numpy.int32
.
- Use the
- Changing indices in a dataframe
df.index=df['Year'] # Indices changed to Year column df.reset_index(drop=True) # Indices changed to default values df.reset_index('Year') # Indices changed to Year column
In the above example, the indices are first changed to ‘Year’ values.
reset_index
with parameterdrop
switches the indices back to default values - Print entire dataframe/series/column
print(df.to_string()) # Print entire dataframe print(df['Year'].to_string()) # Print all values of the column Year
- Sorting a data frame by column/s
df = df.sort_values(['Year', 'Month']) df.sort_values(['Year', 'Month'], inplace=True) # Same as above
Year values are sorted first, and then Month values are sorted without without affecteing the order of Year values.
Suppose there are 5 year values which are not sorted. Each each year has 12 month values which again are not sorted. The above command first sorts the year values, and then sorts the month values for each year value.
sort_values
has the following parameters. (Refer to documentation for a complete list)ascending
: boolean. Default isTrue
inplace
: boolean. Default isFalse
. Returns a sorted object ifFalse
, and returns nothing ifTrue
.axis
: 0 orindex
to sort rows. 1 orcolumns
to sort columnsna_position
:first
orlast
. Sets the position ofNan
values. Default:last
kind
: Algorithm used for sorting. Accepted values:quicksort
mergesort
heapsort
.
quicksort
.
- Dealing with NaN values
df.dropna() # Drop rows with NaN values df.isnull().sum() # Find number of NaN values
- Selection by row number
df.iloc[2:60] #selects rows from 2 to 59
- Number of rows and columns in a dataframe
df.shape #Returns the tuple (rows, cols) df.shape[0] #Returns the first entry in the tuple. This is the no. of rows df.shape[1] #Returns the second entry in the tuple. This is the no. of cols
- Reading date and time from a file
To read string objects as datetime objects, set the
parse_date
parameter.
You can either set it to True
or the column whose values muct be interpreted
as datetime objects.
By default it is set to False
.
df = pd.read_csv("filename.csv", parse_dates=["Column"])
df = pd.read_csv("filename.csv", parse_dates=True)