Sorting Data#
Learning Objectives#
- Define sorting.
- Explain the difference between ascending and descending order.
- Perform a simple sort by hand in either order.
- Explain the concept of a multi-level sort.
- Perform a multi-level sort by hand using an ascending or descending order for any column.
- Perform sorts in place in Pandas.
- Predict when the Pandas sort_values method will change the original data and when it will return a copy.
- Reset the index of a Pandas DataFrame.
Overview#
In this chapter, you will learn about sorting data, and how to do it in Pandas. You probably think that sorting is trivial, and it generally is. However, some people have never been exposed to a multi-level sort, and we want to ensure you know about that.
To sort means to arrange items in some order. This concept is easy for most people to grasp because they are used to sorting words in alphabetical order, and to sorting numbers by magnitude. Sometimes, you will need to first sort by one column, then by another. For example, imagine a dataset of property sales in New York City. You might want to sort by ZIP code and then by sales amount. In this chapter you will see what that looks like. We will also introduce the terms ascending and descending to describe sort order.
Imports and Loading the Data#
import numpy as np
import pandas as pd
In this chapter, we will work with a toy dataset. That will allow you to see the entire dataset as you work.
The following cell creates a small DataFrame called dfPets
. Imagine an animal lover who lives in the country and who has many pets. This dataset contains a list of pets that such a person might own.
Run the cell below to create the DataFrame.
dfPets = pd.DataFrame({'Name': ['Duke', 'Killer', 'Lucy', 'Oliver', 'Milo', 'Barbie', 'Nero', 'Donkey Hoté', 'Dave', 'Hazel', 'Azure', 'Wilbur', 'Stella', 'Skunky', 'Bruce', 'Crystal'],
'Type': ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Horse', 'Horse', 'Donkey', 'Donkey', 'Donkey', 'Parakeet', 'Pig', 'Pig', 'Skunk', 'Fish', 'Fish'],
'Weight (kg)': [35, 25, 5, 2, 10, 500, 420, 225, 210, 240, 0.036, 45, 32, 1, 0.105, 0.1],
'Indoor Pet': [False, False, True, True, False, False, False, False, False, False, True, False, False, False, True, True]
})
dfPets
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
0 | Duke | Dog | 35.000 | False |
1 | Killer | Dog | 25.000 | False |
2 | Lucy | Dog | 5.000 | True |
3 | Oliver | Cat | 2.000 | True |
4 | Milo | Cat | 10.000 | False |
5 | Barbie | Horse | 500.000 | False |
6 | Nero | Horse | 420.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
8 | Dave | Donkey | 210.000 | False |
9 | Hazel | Donkey | 240.000 | False |
10 | Azure | Parakeet | 0.036 | True |
11 | Wilbur | Pig | 45.000 | False |
12 | Stella | Pig | 32.000 | False |
13 | Skunky | Skunk | 1.000 | False |
14 | Bruce | Fish | 0.105 | True |
15 | Crystal | Fish | 0.100 | True |
A Simple Sort#
Pandas has a function sort_values
. The reason it is sort_values
and not just sort is that there’s also a Pandas function called sort_index
which, as you might guess, sorts by the index. We’re not going to use sort_index
in this class so let’s focus on sort_values
.
Say you want to sort the pets by weight. You could run the following code:
dfPets.sort_values('Weight (kg)')
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
10 | Azure | Parakeet | 0.036 | True |
15 | Crystal | Fish | 0.100 | True |
14 | Bruce | Fish | 0.105 | True |
13 | Skunky | Skunk | 1.000 | False |
3 | Oliver | Cat | 2.000 | True |
2 | Lucy | Dog | 5.000 | True |
4 | Milo | Cat | 10.000 | False |
1 | Killer | Dog | 25.000 | False |
12 | Stella | Pig | 32.000 | False |
0 | Duke | Dog | 35.000 | False |
11 | Wilbur | Pig | 45.000 | False |
8 | Dave | Donkey | 210.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
9 | Hazel | Donkey | 240.000 | False |
6 | Nero | Horse | 420.000 | False |
5 | Barbie | Horse | 500.000 | False |
Notice a few things about the code above:
You passed one argument to the
sort_values
method of the DataFrame object, a string containing the column name.The data was sorted from smallest to largest weight.
Pandas returned a copy of the original data. If you check the original DataFrame, you will see that its order has not been changed.
Pandas reorganized the rows. It found the lowest weight, 0.036 kg. That belongs to the bird named Azure. Pandas moved that entire row to the top. It then found the fish Crystal has the next highest weight and moved her entire row to the second spot in the dataset. And so on.
Ascending vs. Descending Order#
Sometimes, you wish to sort in reverse order, meaning from largest to smallest. This is called “descending order”. When you sort from smallest to largest, like we did above, that’s called “ascending order”.
Let’s resort the pets data in descending order. Look at the following code cell:
dfPets.sort_values('Weight (kg)', ascending=False)
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
5 | Barbie | Horse | 500.000 | False |
6 | Nero | Horse | 420.000 | False |
9 | Hazel | Donkey | 240.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
8 | Dave | Donkey | 210.000 | False |
11 | Wilbur | Pig | 45.000 | False |
0 | Duke | Dog | 35.000 | False |
12 | Stella | Pig | 32.000 | False |
1 | Killer | Dog | 25.000 | False |
4 | Milo | Cat | 10.000 | False |
2 | Lucy | Dog | 5.000 | True |
3 | Oliver | Cat | 2.000 | True |
13 | Skunky | Skunk | 1.000 | False |
14 | Bruce | Fish | 0.105 | True |
15 | Crystal | Fish | 0.100 | True |
10 | Azure | Parakeet | 0.036 | True |
By passing the optional argument ascending
and setting it to False, we told Pandas to sort in descending order. If you don’t specify that argument, Pandas assumes you want ascending order.
Changing the original data#
As we mentioned above, by default the sort_values
method does not change the original data. However, you can tell it to change the original data. Let’s see how. First, we’ll make a copy of the original data so we can work with the copy. Then we’ll reorder the copy. Take a look at the code cell below:
dfPets_copy = dfPets.copy()
dfPets_copy.sort_values('Name', inplace=True)
Well that was anticlimactic. It looks like nothing happened! Don’t worry, Pandas performed the sort. The inplace=True
argument told Pandas to perform the sort “in place”, which means that Pandas altered the DataFrame dfPets_copy
. I don’t know why sort_values
works this way, but if you call it with inplace=True
, it sorts the data and returns None
. When you call it without the inplace=True
argument, it returns a sorted copy of the data.
To show you we know what we’re talking about, let’s first check the original DataFrame and ensure it hasn’t been sorted:
dfPets.head()
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
0 | Duke | Dog | 35.0 | False |
1 | Killer | Dog | 25.0 | False |
2 | Lucy | Dog | 5.0 | True |
3 | Oliver | Cat | 2.0 | True |
4 | Milo | Cat | 10.0 | False |
Now let’s check dfPets_copy
and see if it was sorted:
dfPets_copy
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
10 | Azure | Parakeet | 0.036 | True |
5 | Barbie | Horse | 500.000 | False |
14 | Bruce | Fish | 0.105 | True |
15 | Crystal | Fish | 0.100 | True |
8 | Dave | Donkey | 210.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
0 | Duke | Dog | 35.000 | False |
9 | Hazel | Donkey | 240.000 | False |
1 | Killer | Dog | 25.000 | False |
2 | Lucy | Dog | 5.000 | True |
4 | Milo | Cat | 10.000 | False |
6 | Nero | Horse | 420.000 | False |
3 | Oliver | Cat | 2.000 | True |
13 | Skunky | Skunk | 1.000 | False |
12 | Stella | Pig | 32.000 | False |
11 | Wilbur | Pig | 45.000 | False |
Notice that dfPets_copy
was sorted in ascending order by name, just as we expected.
Resetting the index#
Look at dfPets_copy
above. Check out the index. It’s messed up! Originally, the index went from 0 to 15. Now it’s in a seemingly random order. Sometimes, you won’t care, but sometimes, you’ll want a nice, clean index.
Pandas allows you to reset the index. Check out the code cell below.
dfPets_copy.reset_index(drop=True, inplace=True)
dfPets_copy.head()
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
0 | Azure | Parakeet | 0.036 | True |
1 | Barbie | Horse | 500.000 | False |
2 | Bruce | Fish | 0.105 | True |
3 | Crystal | Fish | 0.100 | True |
4 | Dave | Donkey | 210.000 | False |
The above code deleted the index and renumbered the rows, beginning with zero. The argument drop=True
told Pandas to delete the original index. The argument inplace=True
told Pandas to alter the DataFrame. If we hadn’t used inplace=True
, the method reset_index
would have returned a copy of the original DataFrame.
FYI, we could have also written:
dfPets_copy = dfPets_copy.reset_index(drop=True)
This code would have had the same effect as the code above.
Multi-Level Sorting#
Often, you will want to sort by more than one column. For example, say I want to sort by animal type and within each animal type, I want to sort by weight. To perform a multi-level sort, pass the sort_values
method a list of column names.
Let’s perform a multi-level sort by type and weight and then examine the results:
dfPets.sort_values(['Type', 'Weight (kg)'])
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
3 | Oliver | Cat | 2.000 | True |
4 | Milo | Cat | 10.000 | False |
2 | Lucy | Dog | 5.000 | True |
1 | Killer | Dog | 25.000 | False |
0 | Duke | Dog | 35.000 | False |
8 | Dave | Donkey | 210.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
9 | Hazel | Donkey | 240.000 | False |
15 | Crystal | Fish | 0.100 | True |
14 | Bruce | Fish | 0.105 | True |
6 | Nero | Horse | 420.000 | False |
5 | Barbie | Horse | 500.000 | False |
10 | Azure | Parakeet | 0.036 | True |
12 | Stella | Pig | 32.000 | False |
11 | Wilbur | Pig | 45.000 | False |
13 | Skunky | Skunk | 1.000 | False |
Look at the Type column and notice that it is sorted alphabetically. Now look within any type. You will see that the rows within that type are sorted by weight. For example, look at the dogs. There are 3 dogs and they are sorted by weight. Lucy comes first at 5 kg, then Killer at 25 kg, and then Duke at 35 kg.
When you perform a multi-level sort, Pandas first sorts the data by the first variable. Then, for each level of the first variable, it sorts by the second variable. For example, it takes the cats (2 rows) and sorts the cats by weight. Then it takes the dogs (3 rows) and sorts them by weight. And so on. You can sort by as many levels as you want.
Specifying Ascending and Descending Order in a Multi-Level Sort#
Say you want to sort by indoor/outdoor, then type, then name. And say you want to sort:
Indoor Pet in descending order
Type in descending order
Name in ascending order.
To do that, you would run the following code:
dfPets.sort_values(['Indoor Pet', 'Type', 'Name'],
ascending=[False,False,True])
Name | Type | Weight (kg) | Indoor Pet | |
---|---|---|---|---|
10 | Azure | Parakeet | 0.036 | True |
14 | Bruce | Fish | 0.105 | True |
15 | Crystal | Fish | 0.100 | True |
2 | Lucy | Dog | 5.000 | True |
3 | Oliver | Cat | 2.000 | True |
13 | Skunky | Skunk | 1.000 | False |
12 | Stella | Pig | 32.000 | False |
11 | Wilbur | Pig | 45.000 | False |
5 | Barbie | Horse | 500.000 | False |
6 | Nero | Horse | 420.000 | False |
8 | Dave | Donkey | 210.000 | False |
7 | Donkey Hoté | Donkey | 225.000 | False |
9 | Hazel | Donkey | 240.000 | False |
0 | Duke | Dog | 35.000 | False |
1 | Killer | Dog | 25.000 | False |
4 | Milo | Cat | 10.000 | False |
First, let’s look at the data. The column Indoor Pet is sorted in descending order (remember that Boolean True is converted to 1 and False to 0). Within each level of Indoor Pet, the Type is sorted in descending order. Thus, for the indoor pets, we have parakeet, fish, dog, and cat. For each type of pet, the pets are sorted in ascending order by name. Thus, for the indoor fish, we get Bruce and Crystal. For the outdoor donkeys, we get Dave, Hazel, and My Ass.
Next, let’s look at the arguments we passed to sort_values
. The first argument is a list of column names. The second argument, ascending
, gets a list of Booleans. Each corresponds to a column name and tells Pandas whether we want to sort that column in ascending (True) or descending (False) order.