Working with Dataframes
Selecting and Manipulating Data
Many of the operations we perform with data involve working with parts of it. Pandas has many powerful ways to extract and reorganize data.
Accessing Rows and Columns
Series
Series can be sliced
select=randns[1:41]
This slices both the index and the data. To extract values with specific row numbers, use the iloc
method (note the square brackets).
some_vals=randns[1:41]
one_val=randns.iloc[11]
other_vals=randns.iloc[3:9]
Pandas Dataframe rows are ordered, so you can call rows by index, just as for lists or NumPy Ndarrays.
grade_book[2:4] #returns third and fourth rows
Remember that the upper bound is not included, as is usual for Python ranges.
You can use .iloc
to achieve the same result. The iloc
method selects rows based on their integer indexes.
grade_book.iloc[[2,3]]
The .loc
method is more flexible. It allows you to access row indexes based on the value of a column. loc
is label-based rather than index-based. It is used to extract a block of rows and columns by row identifier and column name, or by a Boolean.
first_student= grade_book.loc[0,'Name']
first_student_data = grade_book.loc[0]
The first_student
variable picks out the content of row 0, column Name
, which is a string in this case. The first_student_data
object is a new Series containing the information about the student in the first row.
Note that when the index is an integer, iloc[0]
and loc[0]
are equivalent. However, the index need not be the default integers and loc
can use a more general type.
We can extract multiple specified columns into a new Dataframe by providing a list of columns.
grades_only=grade_book.loc[:,["Name","Grade"]]
Observe that label slicing is inclusive.
Specifying Rows and Columns
By default, Pandas row indexes are integers starting at 0.
weather=pd.DataFrame({"Date":["2000-01-01 00:00:00","2000-01-02 00:00:00",
"2000-01-03 00:00:00","2000-01-04 00:00:00",
"2000-01-05 00:00:00","2000-01-06 00:00:00"],
"Minimum Temp":[-5.87,-3.82,-4.58,-6.40,-5.50,-3.29],
"Maximum Temp":[8.79,4.78,5.10,2.68,6.18,4.50],
"Cloud Cover":[3,5,3,2,3,5]})
print(weather)
print(weather.index)
for s in weather.index:
print(weather.loc[s])
Date Minimum Temp Maximum Temp Cloud Cover
0 2000-01-01 00:00:00 -5.87 8.79 3
1 2000-01-02 00:00:00 -3.82 4.78 5
2 2000-01-03 00:00:00 -4.58 5.10 3
3 2000-01-04 00:00:00 -6.40 2.68 2
4 2000-01-05 00:00:00 -5.50 6.18 3
5 2000-01-06 00:00:00 -3.29 4.50 5
RangeIndex(start=0, stop=6, step=1)
Date 2000-01-01 00:00:00
Minimum Temp -5.87
Maximum Temp 8.79
Cloud Cover 3
Name: 0, dtype: object
# Rest of loop output omitted
We would probably prefer to access the data by date, rather than trying to determine which rows to use. Pandas has a built-in date generator:
date_ranges=pd.date_range("2000-01-01",periods=6)
date_ranges
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05', '2000-01-06'],
dtype='datetime64[ns]', freq='D')
The default is an interval (freq
) of one day. Start and end dates can be specified. Multiple date formats are accepted. For this example, however, we will make a list and use it as the index. We can then select items by dates.
Accessing and Renaming the Column Names
If we’d like to save some typing, we can rename columns to make them conform to Python variable-naming rules. Then we can treat the column name as an attribute.
weather_df.columns=["Tmin","Tmax","Cloud Cover"]
Order matters, and each column must be included even if we do not wish to rename it. In order to rename only certain columns, we can use the rename
method, which uses a dictionary format.
weather_df.rename(columns={'Minimum Temp':'Tmin','Maximum Temp':'Tmax'},inplace=True)
Be careful with the period (“dot”) notation for column names, since if one happens to coincide with a built-in Pandas attribute or method, the method will be assumed, which may result in unpredictable or incorrect behavior.
Without an assignment, the columns
attribute holds the names of the columns.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dates=["2000-01-01 00:00:00","2000-01-02 00:00:00",
"2000-01-03 00:00:00","2000-01-04 00:00:00",
"2000-01-05 00:00:00","2000-01-06 00:00:00"]
weather=pd.DataFrame({"Minimum Temp":[-5.87,-3.82,-4.58,-6.40,-5.50,-3.29],
"Maximum Temp":[8.79,4.78,5.10,2.68,6.18,4.50],
"Cloud Cover":[3,5,3,2,3,5]},
index=dates)
print(weather.columns)
print(weather["2000-01-02":"2000-01-04"])
#Two ways to rename the columns
weather.rename(columns={'Minimum Temp':'Tmin','Maximum Temp':'Tmax'},inplace=True)
#weather.columns=["Tmin","Tmax","Cloud Cover"]
print(weather.Tmin.mean())
Index(['Minimum Temp', 'Maximum Temp', 'Cloud Cover'], dtype='object')
Minimum Temp Maximum Temp Cloud Cover
2000-01-02 00:00:00 -3.82 4.78 5
2000-01-03 00:00:00 -4.58 5.10 3
-4.91
This range syntax for the row range is not inclusive, as is usual for Python.
Extracting Row Indices
The index
attribute contains the index values
weather_df.index
To rename the indexes we use rename
much as for the column names.
weather_df.rename(index={'2000-01-01 00:00:00':'2000-01-01 00:00:10'},inplace=True
We can obtain the equivalent NumPy values for the indices. We can also convert the index object into a list.
inds_array=weather_df.index.values
inds_list=weather_df.index.tolist()
Exercise
The Seaborn package includes some sample datasets. We will look at the “iris” dataset.
import seaborn as sn
iris=sn.load_dataset('iris')
Describe the dataset. Print the column names. Iterate through the indexes and print the corresponding value of the species name for indexes 0 to 30 inclusive. Print the mean petal length. Print the series data for row 90. Make a new dataframe that contains only the petal length, petal width, and species. Summarize the new dataframe.
Example solution
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
iris=sn.load_dataset('iris')
print(iris.describe())
print(iris.columns)
for i in iris.index.tolist()[0:31]:
print(iris.loc[i,'species'])
print(iris['petal_length'].mean())
print('The 91st row')
print(iris.iloc[90])
petal_data=iris.loc[:,["petal_length","petal_width","species"]]
print(petal_data.describe())