Sorting Data#

Learning Objectives#

  • Define sorting.
  • Explain the difference between ascending and descending order.
  • Perform a simple sort by hand in either order.
  • Explain the concept of a multi-level sort.
  • Perform a multi-level sort by hand using an ascending or descending order for any column.
  • Perform sorts in place in Pandas.
  • Predict when the Pandas sort_values method will change the original data and when it will return a copy.
  • Reset the index of a Pandas DataFrame.

Overview#

In this chapter, you will learn about sorting data, and how to do it in Pandas. You probably think that sorting is trivial, and it generally is. However, some people have never been exposed to a multi-level sort, and we want to ensure you know about that.

To sort means to arrange items in some order. This concept is easy for most people to grasp because they are used to sorting words in alphabetical order, and to sorting numbers by magnitude. Sometimes, you will need to first sort by one column, then by another. For example, imagine a dataset of property sales in New York City. You might want to sort by ZIP code and then by sales amount. In this chapter you will see what that looks like. We will also introduce the terms ascending and descending to describe sort order.

Imports and Loading the Data#

import numpy as np
import pandas as pd

In this chapter, we will work with a toy dataset. That will allow you to see the entire dataset as you work.

The following cell creates a small DataFrame called dfPets. Imagine an animal lover who lives in the country and who has many pets. This dataset contains a list of pets that such a person might own.

Run the cell below to create the DataFrame.

dfPets = pd.DataFrame({'Name': ['Duke', 'Killer', 'Lucy', 'Oliver', 'Milo', 'Barbie', 'Nero', 'Donkey Hoté', 'Dave', 'Hazel', 'Azure', 'Wilbur', 'Stella', 'Skunky', 'Bruce', 'Crystal'],
                       'Type': ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Horse', 'Horse', 'Donkey', 'Donkey', 'Donkey', 'Parakeet', 'Pig', 'Pig', 'Skunk', 'Fish', 'Fish'],
                       'Weight (kg)': [35, 25, 5, 2, 10, 500, 420, 225, 210, 240, 0.036, 45, 32, 1, 0.105, 0.1],
                       'Indoor Pet': [False, False, True, True, False, False, False, False, False, False, True, False, False, False, True, True]
                       })
dfPets
Name Type Weight (kg) Indoor Pet
0 Duke Dog 35.000 False
1 Killer Dog 25.000 False
2 Lucy Dog 5.000 True
3 Oliver Cat 2.000 True
4 Milo Cat 10.000 False
5 Barbie Horse 500.000 False
6 Nero Horse 420.000 False
7 Donkey Hoté Donkey 225.000 False
8 Dave Donkey 210.000 False
9 Hazel Donkey 240.000 False
10 Azure Parakeet 0.036 True
11 Wilbur Pig 45.000 False
12 Stella Pig 32.000 False
13 Skunky Skunk 1.000 False
14 Bruce Fish 0.105 True
15 Crystal Fish 0.100 True

A Simple Sort#

Pandas has a function sort_values. The reason it is sort_values and not just sort is that there’s also a Pandas function called sort_index which, as you might guess, sorts by the index. We’re not going to use sort_index in this class so let’s focus on sort_values.

Say you want to sort the pets by weight. You could run the following code:

dfPets.sort_values('Weight (kg)')
Name Type Weight (kg) Indoor Pet
10 Azure Parakeet 0.036 True
15 Crystal Fish 0.100 True
14 Bruce Fish 0.105 True
13 Skunky Skunk 1.000 False
3 Oliver Cat 2.000 True
2 Lucy Dog 5.000 True
4 Milo Cat 10.000 False
1 Killer Dog 25.000 False
12 Stella Pig 32.000 False
0 Duke Dog 35.000 False
11 Wilbur Pig 45.000 False
8 Dave Donkey 210.000 False
7 Donkey Hoté Donkey 225.000 False
9 Hazel Donkey 240.000 False
6 Nero Horse 420.000 False
5 Barbie Horse 500.000 False

Notice a few things about the code above:

  • You passed one argument to the sort_values method of the DataFrame object, a string containing the column name.

  • The data was sorted from smallest to largest weight.

  • Pandas returned a copy of the original data. If you check the original DataFrame, you will see that its order has not been changed.

  • Pandas reorganized the rows. It found the lowest weight, 0.036 kg. That belongs to the bird named Azure. Pandas moved that entire row to the top. It then found the fish Crystal has the next highest weight and moved her entire row to the second spot in the dataset. And so on.

Ascending vs. Descending Order#

Sometimes, you wish to sort in reverse order, meaning from largest to smallest. This is called “descending order”. When you sort from smallest to largest, like we did above, that’s called “ascending order”.

Let’s resort the pets data in descending order. Look at the following code cell:

dfPets.sort_values('Weight (kg)', ascending=False)
Name Type Weight (kg) Indoor Pet
5 Barbie Horse 500.000 False
6 Nero Horse 420.000 False
9 Hazel Donkey 240.000 False
7 Donkey Hoté Donkey 225.000 False
8 Dave Donkey 210.000 False
11 Wilbur Pig 45.000 False
0 Duke Dog 35.000 False
12 Stella Pig 32.000 False
1 Killer Dog 25.000 False
4 Milo Cat 10.000 False
2 Lucy Dog 5.000 True
3 Oliver Cat 2.000 True
13 Skunky Skunk 1.000 False
14 Bruce Fish 0.105 True
15 Crystal Fish 0.100 True
10 Azure Parakeet 0.036 True

By passing the optional argument ascending and setting it to False, we told Pandas to sort in descending order. If you don’t specify that argument, Pandas assumes you want ascending order.

Changing the original data#

As we mentioned above, by default the sort_values method does not change the original data. However, you can tell it to change the original data. Let’s see how. First, we’ll make a copy of the original data so we can work with the copy. Then we’ll reorder the copy. Take a look at the code cell below:

dfPets_copy = dfPets.copy()
dfPets_copy.sort_values('Name', inplace=True)

Well that was anticlimactic. It looks like nothing happened! Don’t worry, Pandas performed the sort. The inplace=True argument told Pandas to perform the sort “in place”, which means that Pandas altered the DataFrame dfPets_copy. I don’t know why sort_values works this way, but if you call it with inplace=True, it sorts the data and returns None. When you call it without the inplace=True argument, it returns a sorted copy of the data.

To show you we know what we’re talking about, let’s first check the original DataFrame and ensure it hasn’t been sorted:

dfPets.head()
Name Type Weight (kg) Indoor Pet
0 Duke Dog 35.0 False
1 Killer Dog 25.0 False
2 Lucy Dog 5.0 True
3 Oliver Cat 2.0 True
4 Milo Cat 10.0 False

Now let’s check dfPets_copy and see if it was sorted:

dfPets_copy
Name Type Weight (kg) Indoor Pet
10 Azure Parakeet 0.036 True
5 Barbie Horse 500.000 False
14 Bruce Fish 0.105 True
15 Crystal Fish 0.100 True
8 Dave Donkey 210.000 False
7 Donkey Hoté Donkey 225.000 False
0 Duke Dog 35.000 False
9 Hazel Donkey 240.000 False
1 Killer Dog 25.000 False
2 Lucy Dog 5.000 True
4 Milo Cat 10.000 False
6 Nero Horse 420.000 False
3 Oliver Cat 2.000 True
13 Skunky Skunk 1.000 False
12 Stella Pig 32.000 False
11 Wilbur Pig 45.000 False

Notice that dfPets_copy was sorted in ascending order by name, just as we expected.

Resetting the index#

Look at dfPets_copy above. Check out the index. It’s messed up! Originally, the index went from 0 to 15. Now it’s in a seemingly random order. Sometimes, you won’t care, but sometimes, you’ll want a nice, clean index.

Pandas allows you to reset the index. Check out the code cell below.

dfPets_copy.reset_index(drop=True, inplace=True)
dfPets_copy.head()
Name Type Weight (kg) Indoor Pet
0 Azure Parakeet 0.036 True
1 Barbie Horse 500.000 False
2 Bruce Fish 0.105 True
3 Crystal Fish 0.100 True
4 Dave Donkey 210.000 False

The above code deleted the index and renumbered the rows, beginning with zero. The argument drop=True told Pandas to delete the original index. The argument inplace=True told Pandas to alter the DataFrame. If we hadn’t used inplace=True, the method reset_index would have returned a copy of the original DataFrame.

FYI, we could have also written:

dfPets_copy = dfPets_copy.reset_index(drop=True)

This code would have had the same effect as the code above.

Multi-Level Sorting#

Often, you will want to sort by more than one column. For example, say I want to sort by animal type and within each animal type, I want to sort by weight. To perform a multi-level sort, pass the sort_values method a list of column names.

Let’s perform a multi-level sort by type and weight and then examine the results:

dfPets.sort_values(['Type', 'Weight (kg)'])
Name Type Weight (kg) Indoor Pet
3 Oliver Cat 2.000 True
4 Milo Cat 10.000 False
2 Lucy Dog 5.000 True
1 Killer Dog 25.000 False
0 Duke Dog 35.000 False
8 Dave Donkey 210.000 False
7 Donkey Hoté Donkey 225.000 False
9 Hazel Donkey 240.000 False
15 Crystal Fish 0.100 True
14 Bruce Fish 0.105 True
6 Nero Horse 420.000 False
5 Barbie Horse 500.000 False
10 Azure Parakeet 0.036 True
12 Stella Pig 32.000 False
11 Wilbur Pig 45.000 False
13 Skunky Skunk 1.000 False

Look at the Type column and notice that it is sorted alphabetically. Now look within any type. You will see that the rows within that type are sorted by weight. For example, look at the dogs. There are 3 dogs and they are sorted by weight. Lucy comes first at 5 kg, then Killer at 25 kg, and then Duke at 35 kg.

When you perform a multi-level sort, Pandas first sorts the data by the first variable. Then, for each level of the first variable, it sorts by the second variable. For example, it takes the cats (2 rows) and sorts the cats by weight. Then it takes the dogs (3 rows) and sorts them by weight. And so on. You can sort by as many levels as you want.

Specifying Ascending and Descending Order in a Multi-Level Sort#

Say you want to sort by indoor/outdoor, then type, then name. And say you want to sort:

  • Indoor Pet in descending order

  • Type in descending order

  • Name in ascending order.

To do that, you would run the following code:

dfPets.sort_values(['Indoor Pet', 'Type', 'Name'], 
                   ascending=[False,False,True])
Name Type Weight (kg) Indoor Pet
10 Azure Parakeet 0.036 True
14 Bruce Fish 0.105 True
15 Crystal Fish 0.100 True
2 Lucy Dog 5.000 True
3 Oliver Cat 2.000 True
13 Skunky Skunk 1.000 False
12 Stella Pig 32.000 False
11 Wilbur Pig 45.000 False
5 Barbie Horse 500.000 False
6 Nero Horse 420.000 False
8 Dave Donkey 210.000 False
7 Donkey Hoté Donkey 225.000 False
9 Hazel Donkey 240.000 False
0 Duke Dog 35.000 False
1 Killer Dog 25.000 False
4 Milo Cat 10.000 False

First, let’s look at the data. The column Indoor Pet is sorted in descending order (remember that Boolean True is converted to 1 and False to 0). Within each level of Indoor Pet, the Type is sorted in descending order. Thus, for the indoor pets, we have parakeet, fish, dog, and cat. For each type of pet, the pets are sorted in ascending order by name. Thus, for the indoor fish, we get Bruce and Crystal. For the outdoor donkeys, we get Dave, Hazel, and My Ass.

Next, let’s look at the arguments we passed to sort_values. The first argument is a list of column names. The second argument, ascending, gets a list of Booleans. Each corresponds to a column name and tells Pandas whether we want to sort that column in ascending (True) or descending (False) order.