Selecting and Filtering Data

Selecting and Filtering Data#

Learning Objectives#

Understand that selecting and filtering are ways of retrieving a subset of a dataset.
Retrieve any subset of the columns of a Pandas DataFrame.
Predict when code that subsets a DataFrame will return a new DataFrame or a Series.
Compare a Series to a constant (numeric or string) or to another Series. Identify the return type as a Series of Boolean values, and be able to predict the values in the returned Series.
Filter a DataFrame using a single condition.

Be able to use arithmetic operators to compare values in a column to a constant.
Check whether values are in a list by using the Pandas isin function.
Compare values from multiple columns.
Retrieve rows with values missing (isna and notna functions).
Retrieve rows with no missing values (dropna function).

Count the number of missing values in a column of a DataFrame.
Filter using multiple conditions.

Overview#

Usually, you only need to work with a subset of your data. You may only need certain columns of your dataset. Or, you may wish to sample your data in which case you will choose a random sample of the rows. The process of choosing a subset of your data is called selecting. Often, you will want to work with rows that meet certain criteria. The process of keeping the rows of interest and discarding the rest is called filtering.

In this chapter, you will learn how to select and filter data using Pandas. Pandas provides many features for selecting and filtering data and there are often multiple ways to do the same thing. We will only scratch the surface of Pandas’ capabilities in this chapter. We will show you common and simple ways to select and filter your data. If you wish to dive deeper and learn more of the available features, click here.

Throughout this chapter, we will use the graduate admissions data from Kaggle to illustrate selecting and filtering.

Imports and Loading the Data#

import numpy as np
import pandas as pd

The cell below loads the graduate admissions dataset. Each row represents a student’s application to one university. Here is a list of the columns and their meanings:

Column	Meaning
Serial No.	A unique identifier for each row
GRE Score	GRE score (out of 340)
TOEFL Score	TOEFL score (out of 120)
University Rating	The quality/prestige of the university to which the student applied (out of 5)
SOP	Quality of the statement of purpose (out of 5)
LOR	Strength of the letter of recommendation (out of 5)
CGPA	Undergraduate GPA (out of 10)
Research	1 if the student has research experience. 0 if no research experience.
Chance of Admit	Percent chance of admission. Ranges from 0 to 1.

dfAdmission = pd.read_csv('data/Admission_Predict.csv')
dfAdmission.head()

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4.0	4.5	8.87	1	0.76
2	3	316	104	3	3.0	3.5	8.00	1	0.72
3	4	322	110	3	3.5	2.5	8.67	1	0.80
4	5	314	103	2	2.0	3.0	8.21	0	0.65

Selecting Columns of Your Dataset#

Retrieving a subset of data frames is like slicing a list. The idea is to get a piece of the original. However, it’s slightly more complicated with data frames because there are two dimensions, rows and columns.

In this section, we will focus on retrieving a subset of the columns of a data frame. In the next section, we will show you how to retrieve a subset of the rows.

Retrieving a subset of the columns#

It is very easy to retrieve a subset of the columns of a data frame. We will show you how with an example. Let’s take the data frame dfAdmission that we loaded above. Say we want the GRE Score and TOEFL Score columns. To extract those columns, and only those two columns, run the code below:

dfAdmission[['GRE Score', 'TOEFL Score']]

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117

400 rows × 2 columns

Let’s look at the code in the cell above:

dfAdmission[['GRE Score', 'TOEFL Score']]

All we did was to type the name of the data frame, followed by double brackets. Inside the double brackets, we typed the names of the columns, separated by commas.

Why double brackets? Let’s explain that with another example. Look at the code cell below:

columns_we_want = ['GRE Score', 'TOEFL Score']
dfAdmission[columns_we_want]

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117

400 rows × 2 columns

The code was:

columns_we_want = ['GRE Score', 'TOEFL Score']
dfAdmission[columns_we_want]

The first line of that code creates a list containing the names of the columns we want. The second line of that code was dfAdmission[columns_we_want]. That should look very similar to retrieving an item from a list. We follow our variable, dfAdmission, by square brackets. Inside the square brackets, we put a list of the desired columns. Thus, you can think of the list of column names like a list index.

Column order#

You can put the columns in any order you want. The list of names does not have to be in the same order as the original data frame. Also, you can repeat columns if you want.

dfAdmission[['TOEFL Score', 'TOEFL Score', 'GRE Score']].head()

	TOEFL Score	TOEFL Score	GRE Score
0	118	118	337
1	107	107	324
2	104	104	316
3	110	110	322
4	103	103	314

In the above, notice that we got the ‘TOEFL Score’ column twice. Also notice that, contrary to the original example, we retrieved TOEFL Score before GRE Score.

Return type when retrieving a subset of the columns#

When you use the double-bracket syntax to retrieve a subset of the columns, Pandas returns a new DataFrame. To see this, consider the following code.

type(dfAdmission[['TOEFL Score', 'GRE Score']])

pandas.core.frame.DataFrame

This will always be the case when you use double-brackets, even if you retrieve only one column.

type(dfAdmission[['TOEFL Score']])

pandas.core.frame.DataFrame

Also notice that, since the return type is a DataFrame, you can call DataFrame methods directly without creating a new variable. To see this, consider the following two code snippets that produce identical output.

# Print the first 2 rows of a DataFrame with a subset of the columns
dfAdmission[['SOP', 'LOR']].head(2)

	SOP	LOR
0	4.5	4.5
1	4.0	4.5

dfSubset = dfAdmission[['SOP', 'LOR']]
dfSubset.head(2)

	SOP	LOR
0	4.5	4.5
1	4.0	4.5

In the first code cell, we retrieved the columns SOP (statement of purpose) and LOR (letters of recommendation). The code dfAdmission[['SOP', 'LOR']] took the DataFrame dfAdmission, extracted two columns, and returned a new DataFrame. We then used the head command to retrieve the first two rows of that DataFrame.

In the second code cell, the code dfSubset = dfAdmission[['SOP', 'LOR']] took the DataFrame dfAdmission, extracted two columns, and returned a new DataFrame. It then saved that new DataFrame into the variable dfSubset. We then used the head command to retrieve the first two rows of dfSubset.

Retrieving a single column as a series#

When we first introduced Pandas, we described the DataFrame object and the Series object. The former is used to store tables of data (i.e. rows and columns). The latter is used to store a single column of data.

When retrieving a single column of a DataFrame, it is possible to retrieve that column as either a DataFrame or as a Series. We’ll explain why this matters later in this chapter. For now, let’s learn how to extract a single column.

Given the previous section of this chapter, you shouldn’t be surprised that this code will return a DataFrame. As we mentioned before, when you use double brackets, the result will always be a DataFrame.

dfAdmission[['SOP']]

If instead you use single brackets, the return type will be a Series. Let’s see this with some examples:

# Retrieve a single column as a DataFrame
dfAdmission[['SOP']].head(3)

	SOP
0	4.5
1	4.0
2	3.0

# Verify that this is a DataFrame
type(dfAdmission[['SOP']])

pandas.core.frame.DataFrame

# Retrieve a single column as a Series
dfAdmission['SOP'].head(3)

  4.5
  4.0
  3.0
Name: SOP, dtype: float64

# Verify that this is a DataFrame
type(dfAdmission['SOP'])

pandas.core.series.Series

In the above examples, notice that a series looks different than a DataFrame. The DataFrame is formatted nicely while the Series looks more Spartan. Also notice that when you print a series, you get information about its data type, which in this case was numpy.float64, a type of float.

Testing the Values in a Series#

In the next section of this chapter, we will show you how to filter your data by retrieving rows that meet some criteria. In order to do that, you need to understand how to work with Pandas Series. We will show you how to do that in this section.

Comparing a Series to a Numeric Constant#

Sometimes, you want to test every value in a column. For example, consider the Research column in the DataFrame dfAdmission. Let’s say we want to subset our data so we only have rows for students with research experience. In other words, we only want rows where the value in the Research column is 1. To filter the data this way, we will need to take the column Research and compare every value to 1. Let’s see what happens when we do that:

# Reminder: the code dfAdmission['Research'] will return a Series
dfAdmission['Research'] == 1

     True
     True
     True
     True
    False
       ...  
   True
   True
   True
  False
   True
Name: Research, Length: 400, dtype: bool

Whoa! What happened? Pandas took the column Research, a Series, and compared every value in the column to 1. It returned a new Series of Boolean values. The new Series has the exact same length as the original series.

Let’s try this again. This time, let’s compare the GRE score to 300. We would do this if we only want rows where the GRE score is at least 300:

dfAdmission['GRE Score'].head()

  337
  324
  316
  322
  314
Name: GRE Score, dtype: int64

dfAdmission['GRE Score'] >= 300

    True
    True
    True
    True
    True
       ... 
  True
  True
  True
  True
  True
Name: GRE Score, Length: 400, dtype: bool

Again, notice that Pandas took the Series dfAdmission['GRE Score'], compared every value in that Series to 300, and returned a new Series of Booleans. Each row in the new Series corresponds to a row in dfAdmission['GRE Score']. Thus, since the first five values of dfAdmission['GRE Score'] are greater than or equal to 300, the first five values of the returned Series are True.

When testing a Series, it is possible to compare values to numbers, strings, or to other Series! Let’s illustrate these ideas with more sample code:

Comparing a Series to a String#

Since none of the columns in dfAdmission are of type string, let’s create a new Series.

s = pd.Series(['cat', 'dog', 'moose'])
s == 'moose'

  False
  False
   True
dtype: bool

The above Series, s, contains the strings 'cat', 'dog', and 'moose'. When we compared s to 'moose', the returned Series was False, False, True since only the last value in the Series equals 'moose'.

Comparing a Series to Another Series#

dfAdmission['SOP'] == dfAdmission['LOR']

     True
    False
    False
    False
    False
       ...  
   True
  False
  False
  False
  False
Length: 400, dtype: bool

The above code compares the columns SOP and LOR in dfAdmission. For every row in which the SOP and LOR scores are equal, the new series will have a True value. If the SOP and LOR scores differ in a given row, the corresponding value in the return Series will be False. To see this, let’s look at the first two rows of dfAdmission.

dfAdmission[['SOP', 'LOR']].head(2)

	SOP	LOR
0	4.5	4.5
1	4.0	4.5

Notice that, in the first row, the SOP and LOR values are equal and therefore the first value in the return Series was True. In the second row, SOP and LOR are unequal and hence the second value in the return Series was False.

Retrieving Rows of a DataFrame via Filtering#

Filtering is the process of extracting only the rows of your dataset that meet certain criteria. Once you understand the ideas in the previous section about working with Series, filtering in Pandas in fairly straightforward. In the remainder of this section, we will show you how to filter Pandas DataFrames.

Filtering with a single condition#

Often, when filtering data, you only need to specify one condition. For example, I might want to filter my admissions data to only show students with research experience. In the remainder of this subsection, we will assume you only have one filtering criterion. In the next subsection, we will show you how to combine multiple criteria in your filter.

Comparing to a constant#

Let’s continue with the example where I only want students with research experience. In the previous section, we showed you how to compare the Research column to a constant, and showed you that the return value was a Series of Booleans. The code we used was:

dfAdmission['Research'] == 1

The above code returned a Series of True/False values. For every row where the value in the Research column was 1, our new Series had a True. For every row where the value in the Research column was 0, our new Series had a False.

To filter a Pandas DataFrame, you simply need to pass this Series of Booleans to your DataFrame! That’s it! Let’s run the following code to see this:

dfAdmission[dfAdmission['Research'] == 1]

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4.0	4.5	8.87	1	0.76
2	3	316	104	3	3.0	3.5	8.00	1	0.72
3	4	322	110	3	3.5	2.5	8.67	1	0.80
5	6	330	115	5	4.5	3.0	9.34	1	0.90
...	...	...	...	...	...	...	...	...	...
394	395	329	111	4	4.5	4.0	9.23	1	0.89
395	396	324	110	3	3.5	3.5	9.04	1	0.82
396	397	325	107	3	3.0	3.5	9.11	1	0.84
397	398	330	116	4	5.0	4.5	9.45	1	0.91
399	400	333	117	4	5.0	4.0	9.66	1	0.95

219 rows × 9 columns

Let’s analyze the code we just wrote:

dfAdmission[dfAdmission['Research'] == 1]

We took the DataFrame dfAdmission and put brackets after it. Within those brackets, we passed it a Series of True/False values. Pandas then returned only those rows for which the Series was True. In other words, it returned only those rows for which the value in the Research column was 1.

Here’s another example. The University Rating column contains the values 1-5, where 5 indicates the most selective universities. Let’s return only those rows with a University Rating of 4 or 5.

dfAdmission[dfAdmission['University Rating'] > 3]

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4.0	4.5	8.87	1	0.76
5	6	330	115	5	4.5	3.0	9.34	1	0.90
11	12	327	111	4	4.0	4.5	9.00	1	0.84
12	13	328	112	4	4.0	4.5	9.10	1	0.78
...	...	...	...	...	...	...	...	...	...
385	386	335	117	5	5.0	5.0	9.82	1	0.96
392	393	326	112	4	4.0	3.5	9.12	1	0.84
394	395	329	111	4	4.5	4.0	9.23	1	0.89
397	398	330	116	4	5.0	4.5	9.45	1	0.91
399	400	333	117	4	5.0	4.0	9.66	1	0.95

134 rows × 9 columns

Notice we used the condition > 3. I could have also used >= 4. Later in this chapter we’ll show you how to check whether the value is 4 or 5. There are many ways to do this.

Checking whether values are in a list using `.isin()`#

The DataFrame dfAdmission has numerical data. Sometimes, when you have categorical data, you will want to filter out certain categories. An easy way to do this is with the Pandas Series method isin. That method allows you to check whether the values in a column are in a list of values that you specify.

Consider the following DataFrame:

dfAnimals = pd.DataFrame({'Type': ['cat', 'dog', 'hamster', 'hippo', 'giraffe', 'cockroach']})
dfAnimals

	Type
0	cat
1	dog
2	hamster
3	hippo
4	giraffe
5	cockroach

Let’s say I only want rows containing animals that are likely to be kept as pets. I could write the following code:

dfAnimals[dfAnimals['Type'].isin(['hamster', 'dog', 'cat'])]

	Type
0	cat
1	dog
2	hamster

The isin method of Series returns True if a value in the input Series matches any element of the list. To see this, run the following code to generate the Boolean Series that we used as our filter.

dfAnimals['Type'].isin(['hamster', 'dog', 'cat'])

   True
   True
   True
  False
  False
  False
Name: Type, dtype: bool

Comparing values from multiple columns#

Sometimes you want to compare values in multiple columns. For example, say we only want rows from our Admissions dataset where the value of LOR exceeds the value of SOP. We could run the following code:

dfAdmission[dfAdmission['LOR'] > dfAdmission['SOP']]

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
1	2	324	107	4	4.0	4.5	8.87	1	0.76
2	3	316	104	3	3.0	3.5	8.00	1	0.72
4	5	314	103	2	2.0	3.0	8.21	0	0.65
6	7	321	109	3	3.0	4.0	8.20	1	0.75
7	8	308	101	2	3.0	4.0	7.90	0	0.68
...	...	...	...	...	...	...	...	...	...
389	390	320	108	3	3.5	4.0	8.44	1	0.76
390	391	314	102	2	2.0	2.5	8.24	0	0.64
391	392	318	106	3	2.0	3.0	8.65	0	0.71
396	397	325	107	3	3.0	3.5	9.11	1	0.84
398	399	312	103	3	3.5	4.0	8.78	0	0.67

163 rows × 9 columns

Working with missing values#

Often, you will have dirty data that contains missing values. There are many ways to deal with missing values, but before you can deal with them, you need to identify them. Pandas provides some very nice functions to help you do this. To illustrate those functions, let’s create a simple DataFrame that contains missing values:

dfMissing = pd.DataFrame(
    {'x': [np.nan,'B','C','D','E'], 
     'y': [1,np.nan,3,4,5], 
     'z': [92, -17, np.nan, np.nan,0]})
dfMissing

	x	y	z
0	NaN	1.0	92.0
1	B	NaN	-17.0
2	C	3.0	NaN
3	D	4.0	NaN
4	E	5.0	0.0

First, notice that missing values appear as NaN, which means “not a number”. We can test a column for missing values by using the Pandas function pandas.isna. This function returns True when it sees a NaN value and returns False otherwise.

The following code tests every value in the x column of dfMissing and returns True when it finds a missing value.

pd.isna(dfMissing['x'])

   True
  False
  False
  False
  False
Name: x, dtype: bool

We can therefore use the isna function to return rows where a given column has a missing value.

dfMissing[pd.isna(dfMissing['x'])]

	x	y	z
0	NaN	1.0	92.0

Usually, you will either want to know how many values are missing, or you will want to drop rows with missing values from your dataset. Let’s see how to do that.

Count the number of missing values in a column#

Remember when we showed you how to perform arithmetic with different data types? One thing we showed you was that when you add Boolean values, True is converted to 1 and False to 0. We can use this to help us count the number of missing values in a column. We’ll simply take the Series of True/False values and sum it. Like this:

pd.isna(dfMissing['z'])

  False
  False
   True
   True
  False
Name: z, dtype: bool

pd.isna(dfMissing['z']).sum()

np.int64(2)

Notice that, since there were two missing values, there were two True values in the Boolean series, so when we summed it up, we got 2.

It turns out that the pd.isna function works on DataFrames as well as Series, so if we want to count the total number of missing values in the entire DataFrame, we can just type:

pd.isna(dfMissing).sum()

x    1
y    1
z    2
dtype: int64

The above code tells us that column x has 1 missing value, column y has one missing value, and column z has two missing values.

Dropping missing values from a DataFrame#

More often than not, you will want to drop, not keep, missing values. There are two ways to do this. You can use the Pandas function notna, which is the opposite of isna. Alternatively, you can use the Pandas function dropna. Let’s illustrate both.

To drop all rows in which a value is missing from column y, you could type:

dfMissing[pd.notna(dfMissing['y'])]

	x	y	z
0	NaN	1.0	92.0
2	C	3.0	NaN
3	D	4.0	NaN
4	E	5.0	0.0

Notice that the code above dropped the second row (with index 1) because it had a missing value in the y column.

An alternative is to use the dropna function, which is more powerful and has a cleaner syntax. By default, dropna will drop every row that has a missing value in any column. For example:

dfMissing.dropna()

	x	y	z
4	E	5.0	0.0

Notice that dropna only returned the last row because that was the only row with no missing values.

Sometimes, you only want to drop rows that have missing values in some columns. To do this, pass the subset argument to dropna. For example, say we want to drop rows with missing values in either column x or column y, but we don’t care if there are missing values in column z. You could do that as follows:

dfMissing.dropna(subset=['x', 'y'])

	x	y	z
2	C	3.0	NaN
3	D	4.0	NaN
4	E	5.0	0.0

The above code dropped row 0 because it had a missing value in column x. It also dropped row 1 because it had a missing value in column y.

Filtering with multiple conditions#

If you are comfortable filtering with one condition, then it is fairly easy to filter with multiple conditions. The only new things you need to know are that you need to:

Use the logical operators ~, |, and & to separate the conditions.
Enclose each condition in parentheses.

We’ll elaborate on each of these now.

Logical operators#

Remember when we taught you basic Python and you learned the operators and, or, and not? The same ideas apply, but when working with Series, you need to use symbols for these operators. The following table shows you the symbols:

Logical Operator	Symbol
and	&
or	\|
not	~

In other words, when you want to join two conditions using a logical and, you need to use the & operator. When you want to join two conditions using a logical or, you need to use the | operator.

Joining conditions#

When you have multiple conditions, you will need to enclose each condition in parentheses. Without going into lots of detail, the reason for this is order-of-operations. Without the parentheses, Pandas performs operations in the wrong order and you will either get an error or incorrect results.

Let’s work through some examples. First, let’s say we want all rows from dfAdmission where the student has research experience and is applying to a top school. We could write:

dfAdmission[(dfAdmission['Research'] == 1) & (dfAdmission['University Rating'] == 5)].head(3)

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
5	6	330	115	5	4.5	3.0	9.34	1	0.90
22	23	328	116	5	5.0	5.0	9.50	1	0.94
23	24	334	119	5	5.0	4.5	9.70	1	0.95

The above code filtered first on dfAdmission['Research'] == 1. It also filtered on dfAdmission['University Rating'] == 5. Since we wanted both conditions to be true, we joined the conditions with the and operator, &. Notice that both conditions are in parentheses.

Now say we want students who either have a TOEFL score of at least 119, or who have a letter of recommendation score of 5 (the highest). We would type:

dfAdmission[(dfAdmission['TOEFL Score'] >= 119) | (dfAdmission['LOR'] == 5.0)].head(3)

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
22	23	328	116	5	5.0	5.0	9.5	1	0.94
23	24	334	119	5	5.0	4.5	9.7	1	0.95
24	25	336	119	5	4.0	3.5	9.8	1	0.97

The above code filtered first on dfAdmission['TOEFL Score'] >= 119. It also filtered on dfAdmission['LOR'] == 5.0. Since we wanted students for which either condition was true (or both were true), we joined the conditions with the or operator, |. Notice that both conditions are in parentheses.

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117

	GRE Score	TOEFL Score
0	337	118
1	324	107
2	316	104
3	322	110
4	314	103
...	...	...
395	324	110
396	325	107
397	330	116
398	312	103
399	333	117