Filtering Data with Pandas

We often need to extract data based on specified criteria, and to regroup and reorganize it. Pandas provides a wealth of methods for these operations.

Missing Data

We frequently must filter or account for missing data. Pandas can easily handle this task. Conventionally, we use the IEEE Not A Number (nan/NaN) indicator for missing data, since it is easy to check for its presence. We can also use None; Pandas treats None and NAN as essentially the same thing. In documentation, missing data is usually referenced as NA.

To check for missing data in our student_grades dataframe we can use isna() and notna(), both of which are Boolean functions so will return True or False.


dates=["2000-01-01 00:00:00","2000-01-02 00:00:00",
       "2000-01-03 00:00:00","2000-01-04 00:00:00",
       "2000-01-05 00:00:00","2000-01-06 00:00:00"]
weather=pd.DataFrame({"Minimum Temp":[-5.87,np.nan,-4.58,-6.40,-5.50,-3.29],
                      "Maximum Temp":[8.79,np.nan,5.10,2.68,6.18,4.50],
                      "Cloud Cover":[3,5,3,2,3,5]},
                      index=dates)

>>>weather.isna()
                     Minimum Temp  Maximum Temp  Cloud Cover
2000-01-01 00:00:00         False         False        False
2000-01-02 00:00:00          True          True        False
2000-01-03 00:00:00         False         False        False
2000-01-04 00:00:00         False         False        False
2000-01-05 00:00:00         False         False        False
2000-01-06 00:00:00         False         False        False

Most built-in numerical methods omit rows with missing data.

>>>weather["Minimum Temp"].mean()
-5.128

We can fill missing values with a quantity.

>>>new_weather=weather.fillna(-999.)
>>>new_weather
                     Minimum Temp  Maximum Temp  Cloud Cover
2000-01-01 00:00:00         -5.87          8.79            3
2000-01-02 00:00:00       -999.00       -999.00            5
2000-01-03 00:00:00         -4.58          5.10            3
2000-01-04 00:00:00         -6.40          2.68            2
2000-01-05 00:00:00         -5.50          6.18            3
2000-01-06 00:00:00         -3.29          4.50            5

In many cases we just wish to remove rows with missing data.

>>>corrected_weather=weather.dropna()
>>>corrected_weather
                     Minimum Temp  Maximum Temp  Cloud Cover
2000-01-01 00:00:00         -5.87          8.79            3
2000-01-03 00:00:00         -4.58          5.10            3
2000-01-04 00:00:00         -6.40          2.68            2
2000-01-05 00:00:00         -5.50          6.18            3
2000-01-06 00:00:00         -3.29          4.50            5

We can find the number of elements that are not NA in a dataframe with count

>>>weather.count()
Minimum Temp    5
Maximum Temp    5
Cloud Cover     6
dtype: int64

Searching DataFrames

Conditionals

We can search for values in columns that satisfy specified conditions. Conditionals in indexes are accepted much as for NumPy arrays. The operators are <,<=,>,>=,==, and != as for Numpy.

freezing_days=weather[weather.Tmax<=0.]

Conditionals may be compounded

cold_days=weather[(weather.Tmax<=6.) & (weather.Tmin>-5.)]

The ampersand & indicates and. Use the pipe symbol | for or. The parentheses are generally required when creating a compound conditional.

Conditional searches return a new DataFrame. Returning to the grade book example, we can select students by various criteria

second_years = grade_book.loc[grade_book['Year'] == 2]
top_students=grade_book.loc[grade_book['Grade']>=90]

With no row specification, the loc is optional in the above code.

Where

The where method returns a dataframe with NaN values for rows where the conditional is not satisfied, and the original values where it is true.

>>>grade_record.where(grade_record["Test4"]>85)
                Test1  Test2  Test3  Test4
Jim Dandy         NaN    NaN    NaN    NaN
Betty Boop       91.7   89.8   92.4   87.2
Minnie Moocher    NaN    NaN    NaN    NaN
Joe Friday        NaN    NaN    NaN    NaN
Teddy Salad      98.5   96.3   96.8   93.9

If we want to remove the NaNs we can add dropna

>>>grade_record.where(grade_record["Test4"]>85).dropna()
             Test1  Test2  Test3  Test4
Betty Boop    91.7   89.8   92.4   87.2
Teddy Salad   98.5   96.3   96.8   93.9

Exercise Return to the weather_data.ipynb notebook from the previous section. Loop through the dataframe and use some things you have learned about strings to find the first line for the station location of Richmond, Virginia. Get the corresponding station code. Use this to extract the data for Richmond, Virginia into a new dataframe. Print the overall mean temperature and the means of the minimum and maximum to two decimal places. (If using f strings, don’t forget the rule about writing quotes inside the f-string.) Print the number of days with a minimum below freezing. Hint: “False” is 0 and True is 1.

Example solution, zipped Jupyter notebook

pandas_weather_ex2.zip

Last updated on Nov 17, 2020