Displaying Data in Pandas

Pandas relies on Matplotlib for plotting. We can combine them to work seamlessly with data. To use any Pandas plotting methods, matplotlib.pyplot must be imported.

Graphs and Charts

Many basic charts can be created through the plot method for Dataframes.

import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("some_file.csv")
df.plot()  #plot all columns (must be numbers)
df[x='Column2',y='Column4'].plot() #choose columns

Example from the Python documentation:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

ts = pd.Series(np.random.randn(1000),index=pd.date_range("1/1/2000",periods=1000))
ts = ts.cumsum()

df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=list("ABCD"))
df = df.cumsum()

df.plot();
plt.show()

Other types of charts and graphs are available and can be accessed through the kind keyword to plot, or by df.plot.kind()

  • df.plot.bar()
    Vertical bar chart
  • df.plot.barh()
    Horizontal bar chart
  • df.plot.box() Box plot
  • `df.plot.pie() Pie chart
  • df.plot.scatter() Scatter plot
df[x='A',y='B'].plot.scatter() 

Other keywords to plot and its kinds control the appearance of the output. In addition, there are separate methods df.hist() and df.boxplot() that have their own sets of arguments.

If multiple columns of a dataframe are compatible numerically, they can be specified and the plot method will create a superimposed chart with a legend. Remember that the index is not a column of the dataframe, so it can be a date.

This example is a modification of another one from Pandas documentation.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dates = pd.date_range("1/1/2000", periods=1000)

df = pd.DataFrame(np.random.randn(1000, 4), index=dates, columns=list("ABCD"))
df = df.cumsum()
df.plot();

plt.show()

Bar charts are similarly simple to create, with nice default labeling.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame(abs(np.random.randn(10, 4)), columns=list("ABCD"))

df.plot.bar();
plt.show()

Tables

A Dataframe is similar to a table, but printing it directly does not always produce good-looking output. For Jupyter, Pandas has a “Styler” that can make a pretty table from a Dataframe. It uses Web standards and the output is HTML. For printing to a plot or paper, the table function of Matplotlib can be used within Pandas.

Example This example is based on Pandas documentation, with some modifications, for the HTML output, with the table version based on online sources.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def rain_condition(v):
    if v < 1.75:
        return "Dry"
    elif v < 2.75:
        return "Rain"
    return "Heavy Rain"

def make_pretty(styler):
    styler.set_caption("Weather Conditions")
    styler.format(rain_condition)
#    The following line is only valid for Pandas >= 1.4
#    styler.format_index(lambda v: v.strftime("%A"))
    styler.background_gradient(axis=None, vmin=1, vmax=5, cmap="YlGnBu")
    return styler


weather_df = pd.DataFrame(np.random.rand(10,2)*5,
                          index=pd.date_range(start="2021-01-01", periods=10),
                          columns=["Tokyo", "Beijing"])
print(weather_df)

#HTML for Jupyter
weather_df.loc["2021-01-04":"2021-01-08"].style.pipe(make_pretty)

#Plot
ax = plt.subplot(111, frame_on=False) # no visible frame
ax.xaxis.set_visible(False)  # hide the x axis
ax.yaxis.set_visible(False)  # hide the y axis
tabla = pd.plotting.table(ax, weather_df, loc='upper right', colWidths=[0.17]*len(weather_df.columns)) 
tabla.scale(1.5,1.5)
plt.show()





To see the “pretty” version, paste the text into a Jupyter notebook. If using the “table” version, place that into a separate cell.

Documentation

The Pandas visualization documentation is very thorough and shows a wide range of examples.

Exercise

Return to the bodyfat.csv file from a previous exercise. Use Pandas to read the data into a Dataframe. Use your BMI function from a previous exercise to compute BMI for each row. Add a new column for BMI. Plot BMI versus body fat percentage. Look up the documentation for pandas.plot.scatter for this plot. Does one value seem out of place?

One way to remove outliers is to compute the 25% and 75% quantiles, take the difference QIF=quantile(75)-quantile(25), then use 1.5QIF as the threshold, i.e. anything less than quantile(25)-1.5QIF or quantile(75)+1.5*QIF is rejected.

Figure out a way to set the BMI values that are outside the cutoffs to np.nan using the .loc method. Redo the scatter plot. Pandas automatically removes missing data.

Example solution

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def inch_to_m(length):
    return length*0.0254

def pound_to_kg(weight):
    return weight*0.453592

def bmi(wt,ht):
    return wt/ht**2

bf_data=pd.read_csv("bodyfat.csv")
bf_data.columns=['bodyfat','age','weight_lbs','height_inch']

wt=pound_to_kg(bf_data.weight_lbs)
ht=inch_to_m(bf_data.height_inch)

bmi_vals=bmi(wt,ht)
bf_data['BMI']=bmi_vals

bf_data.plot.scatter(x='bodyfat',y='BMI')

Q1=bf_data.BMI.quantile(.25)
Q3=bf_data.BMI.quantile(.75)
QIF=Q3-Q1
lower_limit=Q1-1.5*QIF
upper_limit=Q3+1.5*QIF
bf_data.loc[(bf_data['BMI']>upper_limit) | (bf_data['BMI']<lower_limit)]=np.nan

bf_data.plot.scatter(x='bodyfat',y='BMI')
plt.show()


Previous
Next