Displaying Data in Pandas
Pandas relies on Matplotlib for plotting. We can combine them to work seamlessly with data. To use any Pandas plotting methods, matplotlib.pyplot
must be imported.
Graphs and Charts
Many basic charts can be created through the plot
method for Dataframes.
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("some_file.csv")
df.plot() #plot all columns (must be numbers)
df[x='Column2',y='Column4'].plot() #choose columns
Example from the Python documentation:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
ts = pd.Series(np.random.randn(1000),index=pd.date_range("1/1/2000",periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=list("ABCD"))
df = df.cumsum()
df.plot();
plt.show()
Other types of charts and graphs are available and can be accessed through the kind
keyword to plot, or by df.plot.kind()
df.plot.bar()
Vertical bar chartdf.plot.barh()
Horizontal bar chartdf.plot.box()
Box plot- `df.plot.pie() Pie chart
df.plot.scatter()
Scatter plot
df[x='A',y='B'].plot.scatter()
Other keywords to plot
and its kinds control the appearance of the output.
In addition, there are separate methods df.hist()
and df.boxplot()
that have their own sets of arguments.
If multiple columns of a dataframe are compatible numerically, they can be specified and the plot
method will create a superimposed chart with a legend. Remember that the index is not a column of the dataframe, so it can be a date.
This example is a modification of another one from Pandas documentation.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dates = pd.date_range("1/1/2000", periods=1000)
df = pd.DataFrame(np.random.randn(1000, 4), index=dates, columns=list("ABCD"))
df = df.cumsum()
df.plot();
plt.show()
Bar charts are similarly simple to create, with nice default labeling.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame(abs(np.random.randn(10, 4)), columns=list("ABCD"))
df.plot.bar();
plt.show()
Tables
A Dataframe is similar to a table, but printing it directly does not always produce good-looking output. For Jupyter, Pandas has a “Styler” that can make a pretty table from a Dataframe. It uses Web standards and the output is HTML. For printing to a plot or paper, the table
function of Matplotlib can be used within Pandas.
Example
This example is based on Pandas documentation, with some modifications, for the HTML output, with the table
version based on online sources.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def rain_condition(v):
if v < 1.75:
return "Dry"
elif v < 2.75:
return "Rain"
return "Heavy Rain"
def make_pretty(styler):
styler.set_caption("Weather Conditions")
styler.format(rain_condition)
# The following line is only valid for Pandas >= 1.4
# styler.format_index(lambda v: v.strftime("%A"))
styler.background_gradient(axis=None, vmin=1, vmax=5, cmap="YlGnBu")
return styler
weather_df = pd.DataFrame(np.random.rand(10,2)*5,
index=pd.date_range(start="2021-01-01", periods=10),
columns=["Tokyo", "Beijing"])
print(weather_df)
#HTML for Jupyter
weather_df.loc["2021-01-04":"2021-01-08"].style.pipe(make_pretty)
#Plot
ax = plt.subplot(111, frame_on=False) # no visible frame
ax.xaxis.set_visible(False) # hide the x axis
ax.yaxis.set_visible(False) # hide the y axis
tabla = pd.plotting.table(ax, weather_df, loc='upper right', colWidths=[0.17]*len(weather_df.columns))
tabla.scale(1.5,1.5)
plt.show()
To see the “pretty” version, paste the text into a Jupyter notebook. If using the “table” version, place that into a separate cell.
Documentation
The Pandas visualization documentation is very thorough and shows a wide range of examples.
Exercise
Return to the
bodyfat.csv file from a previous exercise.
Use Pandas to read the data into a Dataframe. Use your BMI function from a
previous exercise to compute BMI for each row. Add a new column for BMI. Plot BMI versus body fat percentage. Look up the documentation for pandas.plot.scatter
for this plot. Does one value seem out of place?
One way to remove outliers is to compute the 25% and 75% quantiles, take the difference QIF=quantile(75)-quantile(25), then use 1.5QIF as the threshold, i.e. anything less than quantile(25)-1.5QIF or quantile(75)+1.5*QIF is rejected.
Figure out a way to set the BMI values that are outside the cutoffs to np.nan
using the .loc
method. Redo the scatter plot. Pandas automatically removes missing data.
Example solution
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def inch_to_m(length):
return length*0.0254
def pound_to_kg(weight):
return weight*0.453592
def bmi(wt,ht):
return wt/ht**2
bf_data=pd.read_csv("bodyfat.csv")
bf_data.columns=['bodyfat','age','weight_lbs','height_inch']
wt=pound_to_kg(bf_data.weight_lbs)
ht=inch_to_m(bf_data.height_inch)
bmi_vals=bmi(wt,ht)
bf_data['BMI']=bmi_vals
bf_data.plot.scatter(x='bodyfat',y='BMI')
Q1=bf_data.BMI.quantile(.25)
Q3=bf_data.BMI.quantile(.75)
QIF=Q3-Q1
lower_limit=Q1-1.5*QIF
upper_limit=Q3+1.5*QIF
bf_data.loc[(bf_data['BMI']>upper_limit) | (bf_data['BMI']<lower_limit)]=np.nan
bf_data.plot.scatter(x='bodyfat',y='BMI')
plt.show()