Selecting and Filtering Data#

Learning Objectives#

  • Understand that selecting and filtering are ways of retrieving a subset of a dataset.
  • Retrieve any subset of the columns of a Pandas DataFrame.
  • Predict when code that subsets a DataFrame will return a new DataFrame or a Series.
  • Compare a Series to a constant (numeric or string) or to another Series. Identify the return type as a Series of Boolean values, and be able to predict the values in the returned Series.
  • Filter a DataFrame using a single condition.
    • Be able to use arithmetic operators to compare values in a column to a constant.
    • Check whether values are in a list by using the Pandas isin function.
    • Compare values from multiple columns.
    • Retrieve rows with values missing (isna and notna functions).
    • Retrieve rows with no missing values (dropna function).
  • Count the number of missing values in a column of a DataFrame.
  • Filter using multiple conditions.

Overview#

Usually, you only need to work with a subset of your data. You may only need certain columns of your dataset. Or, you may wish to sample your data in which case you will choose a random sample of the rows. The process of choosing a subset of your data is called selecting. Often, you will want to work with rows that meet certain criteria. The process of keeping the rows of interest and discarding the rest is called filtering.

In this chapter, you will learn how to select and filter data using Pandas. Pandas provides many features for selecting and filtering data and there are often multiple ways to do the same thing. We will only scratch the surface of Pandas’ capabilities in this chapter. We will show you common and simple ways to select and filter your data. If you wish to dive deeper and learn more of the available features, click here.

Throughout this chapter, we will use the graduate admissions data from Kaggle to illustrate selecting and filtering.

Imports and Loading the Data#

import numpy as np
import pandas as pd

The cell below loads the graduate admissions dataset. Each row represents a student’s application to one university. Here is a list of the columns and their meanings:

Column

Meaning

Serial No.

A unique identifier for each row

GRE Score

GRE score (out of 340)

TOEFL Score

TOEFL score (out of 120)

University Rating

The quality/prestige of the university to which the student applied (out of 5)

SOP

Quality of the statement of purpose (out of 5)

LOR

Strength of the letter of recommendation (out of 5)

CGPA

Undergraduate GPA (out of 10)

Research

1 if the student has research experience. 0 if no research experience.

Chance of Admit

Percent chance of admission. Ranges from 0 to 1.

dfAdmission = pd.read_csv('data/Admission_Predict.csv')
dfAdmission.head()
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
3 4 322 110 3 3.5 2.5 8.67 1 0.80
4 5 314 103 2 2.0 3.0 8.21 0 0.65

Selecting Columns of Your Dataset#

Retrieving a subset of data frames is like slicing a list. The idea is to get a piece of the original. However, it’s slightly more complicated with data frames because there are two dimensions, rows and columns.

In this section, we will focus on retrieving a subset of the columns of a data frame. In the next section, we will show you how to retrieve a subset of the rows.

Retrieving a subset of the columns#

It is very easy to retrieve a subset of the columns of a data frame. We will show you how with an example. Let’s take the data frame dfAdmission that we loaded above. Say we want the GRE Score and TOEFL Score columns. To extract those columns, and only those two columns, run the code below:

dfAdmission[['GRE Score', 'TOEFL Score']]
GRE Score TOEFL Score
0 337 118
1 324 107
2 316 104
3 322 110
4 314 103
... ... ...
395 324 110
396 325 107
397 330 116
398 312 103
399 333 117

400 rows × 2 columns

Let’s look at the code in the cell above:

dfAdmission[['GRE Score', 'TOEFL Score']]

All we did was to type the name of the data frame, followed by double brackets. Inside the double brackets, we typed the names of the columns, separated by commas.

Why double brackets? Let’s explain that with another example. Look at the code cell below:

columns_we_want = ['GRE Score', 'TOEFL Score']
dfAdmission[columns_we_want]
GRE Score TOEFL Score
0 337 118
1 324 107
2 316 104
3 322 110
4 314 103
... ... ...
395 324 110
396 325 107
397 330 116
398 312 103
399 333 117

400 rows × 2 columns

The code was:

columns_we_want = ['GRE Score', 'TOEFL Score']
dfAdmission[columns_we_want]

The first line of that code creates a list containing the names of the columns we want. The second line of that code was dfAdmission[columns_we_want]. That should look very similar to retrieving an item from a list. We follow our variable, dfAdmission, by square brackets. Inside the square brackets, we put a list of the desired columns. Thus, you can think of the list of column names like a list index.

Column order#

You can put the columns in any order you want. The list of names does not have to be in the same order as the original data frame. Also, you can repeat columns if you want.

dfAdmission[['TOEFL Score', 'TOEFL Score', 'GRE Score']].head()
TOEFL Score TOEFL Score GRE Score
0 118 118 337
1 107 107 324
2 104 104 316
3 110 110 322
4 103 103 314

In the above, notice that we got the ‘TOEFL Score’ column twice. Also notice that, contrary to the original example, we retrieved TOEFL Score before GRE Score.

Return type when retrieving a subset of the columns#

When you use the double-bracket syntax to retrieve a subset of the columns, Pandas returns a new DataFrame. To see this, consider the following code.

type(dfAdmission[['TOEFL Score', 'GRE Score']])
pandas.core.frame.DataFrame

This will always be the case when you use double-brackets, even if you retrieve only one column.

type(dfAdmission[['TOEFL Score']])
pandas.core.frame.DataFrame

Also notice that, since the return type is a DataFrame, you can call DataFrame methods directly without creating a new variable. To see this, consider the following two code snippets that produce identical output.

# Print the first 2 rows of a DataFrame with a subset of the columns
dfAdmission[['SOP', 'LOR']].head(2)
SOP LOR
0 4.5 4.5
1 4.0 4.5
dfSubset = dfAdmission[['SOP', 'LOR']]
dfSubset.head(2)
SOP LOR
0 4.5 4.5
1 4.0 4.5

In the first code cell, we retrieved the columns SOP (statement of purpose) and LOR (letters of recommendation). The code dfAdmission[['SOP', 'LOR']] took the DataFrame dfAdmission, extracted two columns, and returned a new DataFrame. We then used the head command to retrieve the first two rows of that DataFrame.

In the second code cell, the code dfSubset = dfAdmission[['SOP', 'LOR']] took the DataFrame dfAdmission, extracted two columns, and returned a new DataFrame. It then saved that new DataFrame into the variable dfSubset. We then used the head command to retrieve the first two rows of dfSubset.

Retrieving a single column as a series#

When we first introduced Pandas, we described the DataFrame object and the Series object. The former is used to store tables of data (i.e. rows and columns). The latter is used to store a single column of data.

When retrieving a single column of a DataFrame, it is possible to retrieve that column as either a DataFrame or as a Series. We’ll explain why this matters later in this chapter. For now, let’s learn how to extract a single column.

Given the previous section of this chapter, you shouldn’t be surprised that this code will return a DataFrame. As we mentioned before, when you use double brackets, the result will always be a DataFrame.

dfAdmission[['SOP']]

If instead you use single brackets, the return type will be a Series. Let’s see this with some examples:

# Retrieve a single column as a DataFrame
dfAdmission[['SOP']].head(3)
SOP
0 4.5
1 4.0
2 3.0
# Verify that this is a DataFrame
type(dfAdmission[['SOP']])
pandas.core.frame.DataFrame
# Retrieve a single column as a Series
dfAdmission['SOP'].head(3)
0    4.5
1    4.0
2    3.0
Name: SOP, dtype: float64
# Verify that this is a DataFrame
type(dfAdmission['SOP'])
pandas.core.series.Series

In the above examples, notice that a series looks different than a DataFrame. The DataFrame is formatted nicely while the Series looks more Spartan. Also notice that when you print a series, you get information about its data type, which in this case was numpy.float64, a type of float.

Testing the Values in a Series#

In the next section of this chapter, we will show you how to filter your data by retrieving rows that meet some criteria. In order to do that, you need to understand how to work with Pandas Series. We will show you how to do that in this section.

Comparing a Series to a Numeric Constant#

Sometimes, you want to test every value in a column. For example, consider the Research column in the DataFrame dfAdmission. Let’s say we want to subset our data so we only have rows for students with research experience. In other words, we only want rows where the value in the Research column is 1. To filter the data this way, we will need to take the column Research and compare every value to 1. Let’s see what happens when we do that:

# Reminder: the code dfAdmission['Research'] will return a Series
dfAdmission['Research'] == 1
0       True
1       True
2       True
3       True
4      False
       ...  
395     True
396     True
397     True
398    False
399     True
Name: Research, Length: 400, dtype: bool

Whoa! What happened? Pandas took the column Research, a Series, and compared every value in the column to 1. It returned a new Series of Boolean values. The new Series has the exact same length as the original series.

Let’s try this again. This time, let’s compare the GRE score to 300. We would do this if we only want rows where the GRE score is at least 300:

dfAdmission['GRE Score'].head()
0    337
1    324
2    316
3    322
4    314
Name: GRE Score, dtype: int64
dfAdmission['GRE Score'] >= 300
0      True
1      True
2      True
3      True
4      True
       ... 
395    True
396    True
397    True
398    True
399    True
Name: GRE Score, Length: 400, dtype: bool

Again, notice that Pandas took the Series dfAdmission['GRE Score'], compared every value in that Series to 300, and returned a new Series of Booleans. Each row in the new Series corresponds to a row in dfAdmission['GRE Score']. Thus, since the first five values of dfAdmission['GRE Score'] are greater than or equal to 300, the first five values of the returned Series are True.

When testing a Series, it is possible to compare values to numbers, strings, or to other Series! Let’s illustrate these ideas with more sample code:

Comparing a Series to a String#

Since none of the columns in dfAdmission are of type string, let’s create a new Series.

s = pd.Series(['cat', 'dog', 'moose'])
s == 'moose'
0    False
1    False
2     True
dtype: bool

The above Series, s, contains the strings 'cat', 'dog', and 'moose'. When we compared s to 'moose', the returned Series was False, False, True since only the last value in the Series equals 'moose'.

Comparing a Series to Another Series#

dfAdmission['SOP'] == dfAdmission['LOR']
0       True
1      False
2      False
3      False
4      False
       ...  
395     True
396    False
397    False
398    False
399    False
Length: 400, dtype: bool

The above code compares the columns SOP and LOR in dfAdmission. For every row in which the SOP and LOR scores are equal, the new series will have a True value. If the SOP and LOR scores differ in a given row, the corresponding value in the return Series will be False. To see this, let’s look at the first two rows of dfAdmission.

dfAdmission[['SOP', 'LOR']].head(2)
SOP LOR
0 4.5 4.5
1 4.0 4.5

Notice that, in the first row, the SOP and LOR values are equal and therefore the first value in the return Series was True. In the second row, SOP and LOR are unequal and hence the second value in the return Series was False.

Retrieving Rows of a DataFrame via Filtering#

Filtering is the process of extracting only the rows of your dataset that meet certain criteria. Once you understand the ideas in the previous section about working with Series, filtering in Pandas in fairly straightforward. In the remainder of this section, we will show you how to filter Pandas DataFrames.

Filtering with a single condition#

Often, when filtering data, you only need to specify one condition. For example, I might want to filter my admissions data to only show students with research experience. In the remainder of this subsection, we will assume you only have one filtering criterion. In the next subsection, we will show you how to combine multiple criteria in your filter.

Comparing to a constant#

Let’s continue with the example where I only want students with research experience. In the previous section, we showed you how to compare the Research column to a constant, and showed you that the return value was a Series of Booleans. The code we used was:

dfAdmission['Research'] == 1

The above code returned a Series of True/False values. For every row where the value in the Research column was 1, our new Series had a True. For every row where the value in the Research column was 0, our new Series had a False.

To filter a Pandas DataFrame, you simply need to pass this Series of Booleans to your DataFrame! That’s it! Let’s run the following code to see this:

dfAdmission[dfAdmission['Research'] == 1]
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
3 4 322 110 3 3.5 2.5 8.67 1 0.80
5 6 330 115 5 4.5 3.0 9.34 1 0.90
... ... ... ... ... ... ... ... ... ...
394 395 329 111 4 4.5 4.0 9.23 1 0.89
395 396 324 110 3 3.5 3.5 9.04 1 0.82
396 397 325 107 3 3.0 3.5 9.11 1 0.84
397 398 330 116 4 5.0 4.5 9.45 1 0.91
399 400 333 117 4 5.0 4.0 9.66 1 0.95

219 rows × 9 columns

Let’s analyze the code we just wrote:

dfAdmission[dfAdmission['Research'] == 1]

We took the DataFrame dfAdmission and put brackets after it. Within those brackets, we passed it a Series of True/False values. Pandas then returned only those rows for which the Series was True. In other words, it returned only those rows for which the value in the Research column was 1.

Here’s another example. The University Rating column contains the values 1-5, where 5 indicates the most selective universities. Let’s return only those rows with a University Rating of 4 or 5.

dfAdmission[dfAdmission['University Rating'] > 3]
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
5 6 330 115 5 4.5 3.0 9.34 1 0.90
11 12 327 111 4 4.0 4.5 9.00 1 0.84
12 13 328 112 4 4.0 4.5 9.10 1 0.78
... ... ... ... ... ... ... ... ... ...
385 386 335 117 5 5.0 5.0 9.82 1 0.96
392 393 326 112 4 4.0 3.5 9.12 1 0.84
394 395 329 111 4 4.5 4.0 9.23 1 0.89
397 398 330 116 4 5.0 4.5 9.45 1 0.91
399 400 333 117 4 5.0 4.0 9.66 1 0.95

134 rows × 9 columns

Notice we used the condition > 3. I could have also used >= 4. Later in this chapter we’ll show you how to check whether the value is 4 or 5. There are many ways to do this.

Checking whether values are in a list using .isin()#

The DataFrame dfAdmission has numerical data. Sometimes, when you have categorical data, you will want to filter out certain categories. An easy way to do this is with the Pandas Series method isin. That method allows you to check whether the values in a column are in a list of values that you specify.

Consider the following DataFrame:

dfAnimals = pd.DataFrame({'Type': ['cat', 'dog', 'hamster', 'hippo', 'giraffe', 'cockroach']})
dfAnimals
Type
0 cat
1 dog
2 hamster
3 hippo
4 giraffe
5 cockroach

Let’s say I only want rows containing animals that are likely to be kept as pets. I could write the following code:

dfAnimals[dfAnimals['Type'].isin(['hamster', 'dog', 'cat'])]
Type
0 cat
1 dog
2 hamster

The isin method of Series returns True if a value in the input Series matches any element of the list. To see this, run the following code to generate the Boolean Series that we used as our filter.

dfAnimals['Type'].isin(['hamster', 'dog', 'cat'])
0     True
1     True
2     True
3    False
4    False
5    False
Name: Type, dtype: bool

Comparing values from multiple columns#

Sometimes you want to compare values in multiple columns. For example, say we only want rows from our Admissions dataset where the value of LOR exceeds the value of SOP. We could run the following code:

dfAdmission[dfAdmission['LOR'] > dfAdmission['SOP']]
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
4 5 314 103 2 2.0 3.0 8.21 0 0.65
6 7 321 109 3 3.0 4.0 8.20 1 0.75
7 8 308 101 2 3.0 4.0 7.90 0 0.68
... ... ... ... ... ... ... ... ... ...
389 390 320 108 3 3.5 4.0 8.44 1 0.76
390 391 314 102 2 2.0 2.5 8.24 0 0.64
391 392 318 106 3 2.0 3.0 8.65 0 0.71
396 397 325 107 3 3.0 3.5 9.11 1 0.84
398 399 312 103 3 3.5 4.0 8.78 0 0.67

163 rows × 9 columns

Working with missing values#

Often, you will have dirty data that contains missing values. There are many ways to deal with missing values, but before you can deal with them, you need to identify them. Pandas provides some very nice functions to help you do this. To illustrate those functions, let’s create a simple DataFrame that contains missing values:

dfMissing = pd.DataFrame(
    {'x': [np.nan,'B','C','D','E'], 
     'y': [1,np.nan,3,4,5], 
     'z': [92, -17, np.nan, np.nan,0]})
dfMissing
x y z
0 NaN 1.0 92.0
1 B NaN -17.0
2 C 3.0 NaN
3 D 4.0 NaN
4 E 5.0 0.0

First, notice that missing values appear as NaN, which means “not a number”. We can test a column for missing values by using the Pandas function pandas.isna. This function returns True when it sees a NaN value and returns False otherwise.

The following code tests every value in the x column of dfMissing and returns True when it finds a missing value.

pd.isna(dfMissing['x'])
0     True
1    False
2    False
3    False
4    False
Name: x, dtype: bool

We can therefore use the isna function to return rows where a given column has a missing value.

dfMissing[pd.isna(dfMissing['x'])]
x y z
0 NaN 1.0 92.0

Usually, you will either want to know how many values are missing, or you will want to drop rows with missing values from your dataset. Let’s see how to do that.

Count the number of missing values in a column#

Remember when we showed you how to perform arithmetic with different data types? One thing we showed you was that when you add Boolean values, True is converted to 1 and False to 0. We can use this to help us count the number of missing values in a column. We’ll simply take the Series of True/False values and sum it. Like this:

pd.isna(dfMissing['z'])
0    False
1    False
2     True
3     True
4    False
Name: z, dtype: bool
pd.isna(dfMissing['z']).sum()
np.int64(2)

Notice that, since there were two missing values, there were two True values in the Boolean series, so when we summed it up, we got 2.

It turns out that the pd.isna function works on DataFrames as well as Series, so if we want to count the total number of missing values in the entire DataFrame, we can just type:

pd.isna(dfMissing).sum()
x    1
y    1
z    2
dtype: int64

The above code tells us that column x has 1 missing value, column y has one missing value, and column z has two missing values.

Dropping missing values from a DataFrame#

More often than not, you will want to drop, not keep, missing values. There are two ways to do this. You can use the Pandas function notna, which is the opposite of isna. Alternatively, you can use the Pandas function dropna. Let’s illustrate both.

To drop all rows in which a value is missing from column y, you could type:

dfMissing[pd.notna(dfMissing['y'])]
x y z
0 NaN 1.0 92.0
2 C 3.0 NaN
3 D 4.0 NaN
4 E 5.0 0.0

Notice that the code above dropped the second row (with index 1) because it had a missing value in the y column.

An alternative is to use the dropna function, which is more powerful and has a cleaner syntax. By default, dropna will drop every row that has a missing value in any column. For example:

dfMissing.dropna()
x y z
4 E 5.0 0.0

Notice that dropna only returned the last row because that was the only row with no missing values.

Sometimes, you only want to drop rows that have missing values in some columns. To do this, pass the subset argument to dropna. For example, say we want to drop rows with missing values in either column x or column y, but we don’t care if there are missing values in column z. You could do that as follows:

dfMissing.dropna(subset=['x', 'y'])
x y z
2 C 3.0 NaN
3 D 4.0 NaN
4 E 5.0 0.0

The above code dropped row 0 because it had a missing value in column x. It also dropped row 1 because it had a missing value in column y.

Filtering with multiple conditions#

If you are comfortable filtering with one condition, then it is fairly easy to filter with multiple conditions. The only new things you need to know are that you need to:

  • Use the logical operators ~, |, and & to separate the conditions.

  • Enclose each condition in parentheses.

We’ll elaborate on each of these now.

Logical operators#

Remember when we taught you basic Python and you learned the operators and, or, and not? The same ideas apply, but when working with Series, you need to use symbols for these operators. The following table shows you the symbols:

Logical Operator

Symbol

and

&

or

|

not

~

In other words, when you want to join two conditions using a logical and, you need to use the & operator. When you want to join two conditions using a logical or, you need to use the | operator.

Joining conditions#

When you have multiple conditions, you will need to enclose each condition in parentheses. Without going into lots of detail, the reason for this is order-of-operations. Without the parentheses, Pandas performs operations in the wrong order and you will either get an error or incorrect results.

Let’s work through some examples. First, let’s say we want all rows from dfAdmission where the student has research experience and is applying to a top school. We could write:

dfAdmission[(dfAdmission['Research'] == 1) & (dfAdmission['University Rating'] == 5)].head(3)
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
5 6 330 115 5 4.5 3.0 9.34 1 0.90
22 23 328 116 5 5.0 5.0 9.50 1 0.94
23 24 334 119 5 5.0 4.5 9.70 1 0.95

The above code filtered first on dfAdmission['Research'] == 1. It also filtered on dfAdmission['University Rating'] == 5. Since we wanted both conditions to be true, we joined the conditions with the and operator, &. Notice that both conditions are in parentheses.

Now say we want students who either have a TOEFL score of at least 119, or who have a letter of recommendation score of 5 (the highest). We would type:

dfAdmission[(dfAdmission['TOEFL Score'] >= 119) | (dfAdmission['LOR'] == 5.0)].head(3)
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
22 23 328 116 5 5.0 5.0 9.5 1 0.94
23 24 334 119 5 5.0 4.5 9.7 1 0.95
24 25 336 119 5 4.0 3.5 9.8 1 0.97

The above code filtered first on dfAdmission['TOEFL Score'] >= 119. It also filtered on dfAdmission['LOR'] == 5.0. Since we wanted students for which either condition was true (or both were true), we joined the conditions with the or operator, |. Notice that both conditions are in parentheses.