Modifying Data#

Learning Objectives#

  • Drop columns from a Pandas DataFrame.
  • Rename columns in a Pandas DataFrame.
  • Append rows to a Pandas DataFrame.
  • Concatenate Pandas DataFrames by stacking them.

Overview of Modifying Data#

There are many ways to modify data in Pandas. In this chapter, we will barely scratch the surface. We just want to show you some common modifications that you are likely to use. Specifically, we will show you how to:

  • Drop columns from a DataFrame

  • Rename columns in a DataFrame

  • Append rows to a DataFrame

  • Concatenate multiple DataFrames

We will not show you how to modify values in cells based on a condition. To do that, you need to learn how to use the .loc method, which is extremely powerful. You might learn that method in ACCY 575. If you wish to read ahead, check out the Pandas documentation for .loc. See here and here.

Imports and Loading the Data#

import numpy as np
import pandas as pd

Let’s work with the Pets dataset that we created in the previous chapter. It’s small enough that you can easily see what is changing.

Run the following cell. It will create a subset of the Pets dataset.

dfPets = pd.DataFrame({'Name': ['Duke', 'Killer', 'Lucy', 'Oliver', 'Milo', 'Barbie'],
                       'Type': ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Horse'],
                       'Weight (kg)': [35, 25, 5, 2, 10, 500],
                       'Indoor Pet': [False, False, True, True, False, False]
                       })
dfPets
Name Type Weight (kg) Indoor Pet
0 Duke Dog 35 False
1 Killer Dog 25 False
2 Lucy Dog 5 True
3 Oliver Cat 2 True
4 Milo Cat 10 False
5 Barbie Horse 500 False

Dropping Columns#

In Pandas, it is very easy to drop columns from a DataFrame by calling the drop method. This method is very powerful and allows you to drop selected rows or columns. Let’s just focus on columns.

To drop the Weight (kg) column, you would type:

dfPets.drop(columns=['Weight (kg)'])

To drop the Weight (kg) and Indoor Pet columns, you would type:

dfPets.drop(columns=['Weight (kg)', 'Indoor Pet'])

In other words, you call the drop method and specify the columns argument. The value passed to the columns argument should be a list of column names.

Note that, like most Pandas methods, the drop method does not modify the original DataFrame. Instead, it returns a copy. To see this, consider the following two cells:

dfPets.drop(columns=['Weight (kg)'])
Name Type Indoor Pet
0 Duke Dog False
1 Killer Dog False
2 Lucy Dog True
3 Oliver Cat True
4 Milo Cat False
5 Barbie Horse False
dfPets
Name Type Weight (kg) Indoor Pet
0 Duke Dog 35 False
1 Killer Dog 25 False
2 Lucy Dog 5 True
3 Oliver Cat 2 True
4 Milo Cat 10 False
5 Barbie Horse 500 False

Notice that the DataFrame was not affected by the call to drop. To make the change permanent, you can either write code like this:

dfPets = dfPets.drop(columns=['Weight (kg)'])

or you can use the optional inplace argument and set it to True, like this:

dfPets.drop(columns=['Weight (kg)'], inplace=True)

Finally, note that you can retrieve a subset of the columns by using the method we showed you in an earlier chapter. For example, say you only want the Name and Type columns. Effectively, this is the same thing as dropping the other columns. You could get the columns you want by typing:

dfPets[['Name', 'Type']]

That code will return a copy of the DataFrame with only the columns you request.

So why bother with the drop method? Because sometimes you will have a DataFrame with hundreds of columns. It’s easier to drop a few through the drop methods than it is to type out all the columns you want!

Renaming Columns#

To rename columns in a DataFrame, you will use the … wait for it … rename method. What a surprise!

There are many ways to call rename. We recommend you pass a dictionary to the columns argument. Let’s illustrate with an example. Say I want to rename the Type and Weight (kg) columns. I could write this code:

dfPets.rename(columns={'Type': 'Species',
                       'Weight (kg)': 'Mass (kg)'
                      }
             )
Name Species Mass (kg) Indoor Pet
0 Duke Dog 35 False
1 Killer Dog 25 False
2 Lucy Dog 5 True
3 Oliver Cat 2 True
4 Milo Cat 10 False
5 Barbie Horse 500 False

Thus, in the dictionary, each key-value pair has the form old name : new name. The keys are the columns that you want to rename and the values are the new names.

Like most Pandas methods, the rename method does not modify the original DataFrame. Instead, it returns a copy. To make a change in place, use the inplace=True argument.

Appending Rows to a DataFrame#

Say that our animal lover buys a new pet, a skunk that she names “Skunky”. Believe it or not, some people keep skunks as pets.

Anyway, she wants to add Skunky to her DataFrame. To do that, she needs to create a new DataFrame and append it to the existing DataFrame. This is kind of a pain, and in our opinion, a deficiency of Pandas.

First, let’s create the new row as a DataFrame. Look at the code in the cell below. Notice that we used double brackets, indicating a list containing a list. We do that because when creating a DataFrame from a list, each sublist is treated as a row.

Also notice that, instead of typing out the column names, we simply created the list in the same column order as dfPets and then passed the column names from dfPets directly.

newrow = pd.DataFrame([['Skunky', 'Skunk', 1.0, False]],
                      columns=dfPets.columns)
newrow
Name Type Weight (kg) Indoor Pet
0 Skunky Skunk 1.0 False

The next cell shows the concat function. When you concatenate multiple DataFrames , Pandas simply pastes each new DataFrame at the end of the previous one.

We passed the ignore_index=True argument so that Pandas would not preserve the index from the DataFrames. Instead, it creates a new index for the merged dataframe, beginning at zero.

pd.concat([dfPets, newrow], ignore_index=True)
Name Type Weight (kg) Indoor Pet
0 Duke Dog 35.0 False
1 Killer Dog 25.0 False
2 Lucy Dog 5.0 True
3 Oliver Cat 2.0 True
4 Milo Cat 10.0 False
5 Barbie Horse 500.0 False
6 Skunky Skunk 1.0 False

The next section discusses concatenation in more detail.

Concatenating DataFrames#

Sometimes, you have multiple DataFrames that you want to append to each other.

Consider the following code cells, which create 4 DataFrames.

df1 = pd.DataFrame({'x': [1,2], 'y': [1,4]})
df1
x y
0 1 1
1 2 4
df2 = pd.DataFrame({'x': [3,4], 'y': [9,16]})
df2
x y
0 3 9
1 4 16
df3 = pd.DataFrame({'x': [5,6], 'y': [25,36]})
df3
x y
0 5 25
1 6 36
df4 = pd.DataFrame({'x': [7,8], 'y': [49,64]})
df4
x y
0 7 49
1 8 64

To concatenate these, pass a list of DataFrames to the concat function:

pd.concat([df1, df2, df3, df4])
x y
0 1 1
1 2 4
0 3 9
1 4 16
0 5 25
1 6 36
0 7 49
1 8 64

Ugh! Look at the index of our new DataFrame. It’s hideous! It contains the indexes of the original DataFrames. To make Pandas create a new index, pass the ignore_index=True argument.

pd.concat([df1, df2, df3, df4], ignore_index=True)
x y
0 1 1
1 2 4
2 3 9
3 4 16
4 5 25
5 6 36
6 7 49
7 8 64

That’s much better!