Pivot Tables in Pandas#

Learning Objectives#

  • Explain the concept of unique values and why it is so important.
  • Create a simple pivot table in Pandas that computes one statistic for the unique values of one column.
  • Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
  • Create a pivot table in Pandas that groups by multiple variables within rows.
  • Create a pivot table in Pandas that groups by two variables organized into rows and columns.
  • Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
  • Explain how to create more complicated pivot tables by using the index, columns, values, and aggfunc arguments.
  • Use the len function to count the number of rows that exist for unique values in different columns.
  • Explain how a Pandas pivot table is a DataFrame, and can be manipulated just like any other DataFrame.

Overview of Pivot Tables in Pandas#

In previous units, we showed you how to prepare your data for analysis. You loaded, cleaned, filtered, and manipulated data. In the previous chapter, we showed you how to begin data analysis by computing descriptive statistics. In this chapter, we will continue our journey into analyzing data by teaching you about pivot tables in Pandas.

A pivot table is a table that provides statistics for unique values of one or more variables. For example, what is the average salary in each department of a company? You will find that pivot tables will be one of the most useful tools in your toolkit. If you are adept at making pivot tables, you will be a star in your job. However, as you learn about pivot tables, realize that the skill lies not in writing code, but in thinking about what you want to learn from your data and interpreting the results. If you can envision the pivot table needed for your business question before you write any code, then the act of creating the table will seem simple.

Imports and Loading the Data#

import numpy as np
import pandas as pd

In this chapter, we will work with salary data for employees of the City of New York (link).

The cell below loads salary data for employees of the City of New York. Each row represents one employee in the one fiscal year. For a list of the columns and their meanings, click here.

NOTE: You need either the fastparquet library or the pyarrow libraryto load the data file used in these notes.

You should have installed one of these libraries at the beginning of the semester when you installed Python. If you did not, or if you get errors, don’t panic. If you just want to read these notes, you do not need to run these cells and you do not need to install either library.

If you’re wondering what these libraries do, read this paragraph. The data file is quite large, approximately 376 MB as a CSV. That makes it difficult to share. It has too many rows to store in an Excel worksheet, and if we gave you a CSV, it would take many minutes to load, and you would have to clean many of the columns by converting them from text to other data types. To avoid all this, we chose to store the data in Apache Parquet format. Parquet is a super-cool format that stores the data by column. That allows it to easily compress the data, because you tend to find the same value repeated many times in each column. This compression reduces the file size (the parquet file is just under 50 MB), and greatly speeds up the loading process. We love the Parquet format and frequently use it when working with larger datasets. It is compatible with many big data and cloud technologies. And Pandas provides support for Parquet.

dfPay = pd.read_parquet('data/NYC_Payroll_data.parquet')
# Make a fix to the Agency Start Date column
dfPay['Agency Start Date'] = pd.to_datetime(dfPay['Agency Start Date'], utc=True)
dfPay.head()
Fiscal Year Payroll Number Agency Name Last Name First Name Mid Init Agency Start Date Work Location Borough Title Description Leave Status as of June 30 Base Salary Pay Basis Regular Hours Regular Gross Paid OT Hours Total OT Paid Total Other Pay
0 2016 NaN DISTRICT ATTORNEY-MANHATTAN ABA'AHMID RAHASHEEM E 2003-07-14 00:00:00+00:00 MANHATTAN COMMUNITY ASSOCIATE ACTIVE 47678.0 PER ANNUM 1830.00 47166.03 2.25 57.28 0.0
1 2016 NaN DISTRICT ATTORNEY-MANHATTAN ABENSUR MARGARET 1995-06-12 00:00:00+00:00 MANHATTAN ADMINISTRATIVE ACCOUNTANT ACTIVE 119959.0 PER ANNUM 1831.00 119042.50 0.00 0.00 0.0
2 2016 NaN DISTRICT ATTORNEY-MANHATTAN ABOUNAOUM ANDREA L 2011-10-11 00:00:00+00:00 MANHATTAN COMMUNITY ASSOCIATE ON LEAVE 39966.0 PER ANNUM 1181.68 27452.62 1.00 21.34 33.0
3 2016 NaN DISTRICT ATTORNEY-MANHATTAN ABRAHAM JONATHAN J 2014-12-01 00:00:00+00:00 MANHATTAN COMPUTER SYSTEMS MANAGER ACTIVE 116000.0 PER ANNUM 1831.00 115104.57 0.00 0.00 0.0
4 2016 NaN DISTRICT ATTORNEY-MANHATTAN ABRAMS JOSEPH 2015-05-21 00:00:00+00:00 MANHATTAN COLLEGE AIDE CEASED 1.0 PER HOUR 0.00 3500.00 0.00 0.00 0.0
dfPay.shape
(2748818, 17)

Unique / Distinct Values#

Before we can teach you about pivot tables, we need to teach you an important underlying concept, unique values.

Consider the following randomly chosen values from the Agency Name column of our dataset:

dfPay[['Work Location Borough']].sample(5, random_state=1)
Work Location Borough
1686946 BRONX
535165 MANHATTAN
3128672 MANHATTAN
1632295 QUEENS
3092208 MANHATTAN

Notice that certain values are repeated. For example, “MANHATTAN” appears 3 times. What if we want to know all values that appear at least once in the column? We don’t want to see repeats. We simply want to know which values appear at least once in the column. Why would we care to know this? First of all, because we might want a list of the boroughs in New York City! Second, when we create a pivot table, we might want to report statistics by borough. That means that each borough should appear only once in our pivot table.

What we are looking for are the unique values in the column (note that the term distinct, which is used in SQL, is synonymous with unique). Pandas makes it very easy to compute the unique values in a column. Simply call the unique method:

dfPay['Work Location Borough'].unique()
array(['MANHATTAN', 'STATEN ISLAND', 'QUEENS', 'BRONX', 'BROOKLYN'],
      dtype=object)

Another way to get unique values in a column is the value_counts method. That method returns the unique values as well as a count of each value.

dfPay['Work Location Borough'].value_counts()
Work Location Borough
MANHATTAN        1975186
QUEENS            314960
BROOKLYN          270173
BRONX             150150
STATEN ISLAND      38349
Name: count, dtype: int64

We prefer the value_counts method to the unique method for two reasons. First, value_counts returns counts in addition to unique values and these are often useful. For example, we just learned that most of the employees in the dataset work in Manhattan. Second, value_counts returns a Pandas Series while unique returns a NumPy array. Since we are doing most of our work in Pandas, it’s often easier to work directly with Pandas objects.

The concept of unique values is super important. Not only does it underlie the pivot tables that we are about to create, but it is also used when merging data tables.

Let’s see one more example of uniqueness. Let’s retrieve the unique departments in the NYC government. From this list, which is quite long (there appear to be 160 departments in the NYC government!), we see that education is largest, followed by police.

dfPay['Agency Name'].value_counts()
Agency Name
DEPT OF ED PEDAGOGICAL            471072
DEPT OF ED PER SESSION TEACHER    426514
POLICE DEPARTMENT                 307717
DEPT OF ED PARA PROFESSIONALS     176199
BOARD OF ELECTION POLL WORKERS    170968
                                   ...  
BROOKLYN COMMUNITY BOARD #3           15
MANHATTAN COMMUNITY BOARD #7          15
BRONX COMMUNITY BOARD #5              14
BRONX COMMUNITY BOARD #1              11
BRONX COMMUNITY BOARD #3              10
Name: count, Length: 160, dtype: int64

Pivot Tables in Pandas#

A pivot table computes something for unique values. That’s all there is to it! :)

A Simple Pivot Table#

Let’s begin with a simple example: let’s compute the average salary paid to workers in each of NYC’s boroughs. We’ll first filter it to include only fiscal year 2018:

# Let's get cute and display the numbers in accounting format. :)
pd.set_option('float_format', lambda x: f'${x:,.0f}')

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc='mean')
Base Salary
Work Location Borough
BRONX $49,584
BROOKLYN $52,777
MANHATTAN $40,780
QUEENS $52,138
STATEN ISLAND $53,919

What the pivot table shows and how it computes its values#

Before we examine the code, let’s look at the pivot table. There’s one row per borough. The pivot table computed the unique values in the Work Location Borough column and created one row per unique borough in the resulting pivot table. Then, for each borough, it filtered the data for that borough and computed the average of the base salary for that borough. To see this, consider the following code that produces the exact same values as our pivot table above.

# Get only the 2018 rows. Drop any rows with missing values
df2018 = dfPay[dfPay['Fiscal Year'] == 2018].dropna(subset=['Base Salary'])
# Now get the unique boroughs as a list
boroughs = df2018['Work Location Borough'].unique().tolist()
# Sort the list of boroughs
boroughs.sort()

# For each borough in the list of unique boroughs
for borough in boroughs:
    # Filter the 2018 data on that borough
    df = df2018[df2018['Work Location Borough'] == borough]
    # Compute the mean salary for that borough in 2018
    meanSalary = df['Base Salary'].mean()
    
    print(f"{borough}: ${meanSalary:,.0f}")
BRONX: $49,584
BROOKLYN: $52,777
MANHATTAN: $40,780
QUEENS: $52,138
STATEN ISLAND: $53,919

To reiterate, the pivot table computed the unique values in the Work Location Borough column. For each unique borough, it filtered the data for that borough and then computed the average of the Base Salary for that borough.

Understanding the code to generate the pivot table#

Above, we used the following code to create our pivot table:

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc='mean')

Let’s examine the code and understand what each argument does.

  • pd.pivot_table. This is the Pandas function for creating a pivot table.

  • data=dfPay[dfPay['Fiscal Year']==2018]. You must pass a DataFrame to the data argument. This argument is required. The pivot_table function will use this data to create a pivot table.

  • index='Work Location Borough'. The index argument tells the pivot table which values to use as rows in the resulting pivot table. In this case, we told it to take the Work Location Borough column of our source data and extract its unique values.

  • values='Base Salary. The pivot table will take the values from this column and use them in its computations.

  • aggfunc='mean'. “aggfunc” stands for “aggregation function”. By passing the string 'mean', we told Pandas to use the Numpy function numpy.mean as our aggregation function. Thus, the pivot table will average the values in the Base Salary column.

In sum, the code tells Pandas to take the 2018 data, extract the unique boroughs and, for each borough, compute the mean of the Base Salary column.

Computing More Than One Statistic#

In the above example, we compute the mean salary paid to workers in each borough in 2018. We will extend that example and show how to compute the mean and median salary for each borough. Look at the following code and its output.

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc=['mean', 'median'])
mean median
Base Salary Base Salary
Work Location Borough
BRONX $49,584 $48,666
BROOKLYN $52,777 $50,604
MANHATTAN $40,780 $35,977
QUEENS $52,138 $50,604
STATEN ISLAND $53,919 $55,024

Notice that the first data column, mean Base Salary, contains the exact same information as our table above. However, this new table contains a second column that contains the median salaries.

The only change to the code was to the aggfunc argument. We changed that to aggfunc=['mean', 'median']. Instead of passing a function, we passed a list of functions. The pivot_table function then created one column per function. It did everything else identically. It first retrieved the unique boroughs (from the index argument), used values from the Base Salary column in its computations. For each borough, it computed the mean and the median.

Grouping By Multiple Variables in Rows#

Often, your business question implies a logical hierarchy among your variables. For example, say you want to see the average pay in each department, but you want to break it out by borough. Department heads might want to see this information to know if they are paying appropriate amounts for employees in different parts of the city. Another example is to show the pay by borough and break it out by hourly and salaried employees. This might be useful if one group of employees is unionized and the other is not. Or if the hourly and salaried employees belong to different unions, or if they have different contracts.

In this section, let’s use the example of pay by department and by borough. To make this tractable, we will restrict our analysis to (a) salaried employees (b) in the top 5 departments (c) in 2018.

First, let’s filter the data so we only have rows for salaried employees in 2018 for the top 5 departments. We’ll save the filtered data in a data frame called dfTop5Dept2018.

# Condition 1: fiscal year == 2018
# Condition 2: pay basis == 'PER ANNUM'
# Condition 3: get the top 5 departments, as measured by number of employees.
#     Save the top 5 departments in the variable top5depts
top5depts =  dfPay[(dfPay['Fiscal Year']==2018) & 
                   (dfPay['Pay Basis'] == 'PER ANNUM')]['Agency Name'].value_counts().head(5).index

# Filter the data frame dfPay using the 3 conditions above.
dfTop5Dept2018 = dfPay[(dfPay['Fiscal Year']==2018) & 
                       (dfPay['Pay Basis'] == 'PER ANNUM') & 
                       (dfPay['Agency Name'].isin(top5depts))]

Now let’s create a pivot table showing, for each of the top 5 departments in 2018, the average pay in each borough.

pd.pivot_table(data=dfTop5Dept2018, 
               index=['Agency Name', 'Work Location Borough'], 
               values='Base Salary', 
               aggfunc='mean')
Base Salary
Agency Name Work Location Borough
DEPT OF ED PARA PROFESSIONALS MANHATTAN $36,139
DEPT OF ED PEDAGOGICAL MANHATTAN $87,796
FIRE DEPARTMENT BRONX $72,784
BROOKLYN $77,556
MANHATTAN $76,143
QUEENS $71,143
STATEN ISLAND $80,794
HRA/DEPT OF SOCIAL SERVICES BRONX $47,447
BROOKLYN $54,658
MANHATTAN $61,390
QUEENS $46,781
STATEN ISLAND $49,353
POLICE DEPARTMENT BRONX $71,268
BROOKLYN $68,800
MANHATTAN $78,636
QUEENS $68,304
STATEN ISLAND $73,843

Take a look at the above pivot table. Notice that the Agency Name and Work Location Borough columns are organized into a hierarchy. For each unique level of Agency Name, the pivot table shows all unique values of Work Location Borough. Notice that Pandas omitted boroughs with no employees.

From this table, we can see that two of the top departments in the city, Department of Ed Para Professionals and Dept of Ed Pedagogical, have all of their employees in Manhattan. For the other three departments, we can easily compare average salaries across boroughs. However, we cannot conclude from this that employees in some boroughs get paid more for the same job than their counterparts in other boroughs. To examine that, we would need to compare job titles and years of experience.

Understanding the Unique Values when Grouping By Multiple Variables#

When you group by multiple variables, you need to think of unique values differently. In the pivot table above, the unique values are (DEPT OF ED PARA PROFESSIONALS, MANHATTAN), (DEPT OF ED PEDAGOGICAL, MANHATTAN), (FIRE DEPARTMENT, BRONX), (FIRE DEPARTMENT, BROOKLYN), etc. In other words, take the unique values from Agency Name and the unique values from Work Location Borough; form every possible combination of unique values. The pivot table computes the mean salary for each unique combination.

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables#

The code to create the pivot table above was:

pd.pivot_table(data=dfTop5Dept2018, 
               index=['Agency Name', 'Work Location Borough'], 
               values='Base Salary', 
               aggfunc='mean')

If you compare this code to the code from our simplest example, you will discover only one difference. For the index argument, we passed a list of column names. That tells Pandas to group the data first by Agency Name and then by Work Location Borough.

Grouping By Multiple Variables in Rows and Columns#

Often, you want to compute something for two variables but there is no logical hierarchy among the variables. For example, say you want to show the average pay by borough and by fiscal year. We could use a format like that in the previous section, but should we group first by borough and then fiscal year? Or by fiscal year and then borough? There’s no obvious answer. Also, the format shown above does not facilitate comparisons across the second variable. When you find yourself in a situation like this, consider organizing the data into rows and columns.

Take a look at the pivot table below:

pd.pivot_table(data=dfPay, 
               index='Work Location Borough', 
               columns='Fiscal Year', 
               values='Base Salary', 
               aggfunc='mean', 
               margins=True)
Fiscal Year 2014 2015 2016 2017 2018 2019 All
Work Location Borough
BRONX $58,788 $44,626 $46,477 $44,402 $49,584 $51,025 $47,251
BROOKLYN NaN $48,105 $50,254 $49,710 $52,777 $53,834 $50,901
MANHATTAN $62,937 $32,795 $37,827 $38,471 $40,780 $39,591 $37,889
QUEENS $71,759 $47,722 $50,650 $48,483 $52,138 $52,358 $50,339
STATEN ISLAND $69,804 $49,004 $51,869 $53,531 $53,919 $54,322 $52,599
All $63,752 $36,720 $41,248 $41,382 $43,979 $43,116 $41,311

The above table lays out the data by borough (rows) and fiscal year (columns). Notice that we included an average for each row (the All column), and an average for each column (the All row). This format allows for easy comparison across boroughs, across fiscal years, and within boroughs. It is a very efficient way to present data.

Understanding the Unique Values When Grouping By Multiple Variables in Rows and Columns#

The unique values for the pivot table are all combinations of unique values of the rows and columns. Thus, (BRONX, 2014) is a unique value, as is (STATEN ISLAND, 2019). The pivot table will compute the mean of the base salary for each of these unique values.

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables in Rows and Columns#

Let’s examine the code used to create the pivot table:

pd.pivot_table(data=dfPay, 
               index='Work Location Borough', 
               columns='Fiscal Year', 
               values='Base Salary', 
               aggfunc='mean', 
               margins=True)

When you compare this to the simplest pivot table above, you will discover two differences:

  1. We added a columns argument. This tells Pandas to get the unique values from the Fiscal Year column and use them as column headings in the pivot table.

  2. We added the argument margins=True. That tells Pandas to include summary rows and columns in the pivot table (the ones with the heading All).

Computing Statistics for Multiple Variables#

Another use case for pivot tables is to compute the same statistics for multiple variables. As a simple example, let’s compute the mean salary and mean years of experience by borough.

First, we need to compute the years of experience. Let’s use New York’s 2018 fiscal year, which ran from July 1, 2017 through June 30, 2018. We will compute the number of years of experience at the end of the fiscal year. That way, people who start in the middle of the fiscal year will show up as having a positive amount of work experience. We will also focus on salaried employees only.

# Filter on fiscal year 2018, salaried employees only
dfFilt = dfPay[(dfPay['Fiscal Year'] == 2018) & 
               (dfPay['Pay Basis'] == 'PER ANNUM')].copy()
# Create a date constant containing the end of the fiscal year
endOfFY = pd.to_datetime('6-30-2018 00:00:00 UTC', format='%m-%d-%Y %H:%M:%S %Z', utc=True)

dfFilt['Years of Experience'] = ((endOfFY - dfFilt['Agency Start Date']) / np.timedelta64(1,'D')) / 365.0
# Reset the Pandas display options
pd.reset_option('float_format')

pd.pivot_table(data=dfFilt,
               index='Work Location Borough',
               values=['Base Salary', 'Years of Experience'],
               aggfunc='mean')
Base Salary Years of Experience
Work Location Borough
BRONX 65145.129919 11.126592
BROOKLYN 67400.887592 11.648366
MANHATTAN 75038.574755 12.030538
QUEENS 68833.177272 10.573341
STATEN ISLAND 70285.517079 12.282185

In the above code, we passed a list of column names to the Values argument. That told Pandas to compute the mean of each Values column for each unique value in Work Location Borough.

More Complicated Pivot Tables#

In this chapter, we showed you how to create simple pivot tables in Pandas. You can create much more complicated tables by passing lists to multiple arguments in the pivot_table function. For example, say we want to recreate a pivot table that breaks out Pay Basis by Borough (rows). In each column, we want to show a different fiscal year. And we want to do this for two variables, Base Salary and Total Other Pay. You could do that as follows. Notice that all we’re doing is passing more columns to the arguments than we did previously.

pd.pivot_table(data=dfPay,
               index=['Work Location Borough', 'Pay Basis'],
               columns='Fiscal Year',
               values=['Base Salary', 'Total Other Pay'],
               aggfunc='mean')
Base Salary Total Other Pay
Fiscal Year 2014 2015 2016 2017 2018 2019 2014 2015 2016 2017 2018 2019
Work Location Borough Pay Basis
BRONX PER ANNUM 58914.094319 60474.766639 62271.607774 64138.005775 65145.129919 65869.988143 691.695520 6114.452525 6088.658542 6196.082143 6418.858816 6075.399860
PER DAY NaN 192.062425 209.050852 174.560338 242.845655 224.684363 NaN 1405.022760 1876.941646 1612.088610 2993.338527 2901.112367
PER HOUR 29.265000 22.159489 23.570725 27.410545 24.212095 23.177066 1348.675000 588.600548 512.666901 475.857755 624.052476 679.708500
PRORATED ANNUAL NaN 26774.639752 24266.495614 26549.752066 27333.981685 26009.958333 NaN 808.867733 96.959737 218.843554 263.200293 117.033906
BROOKLYN PER ANNUM NaN 61959.211599 63611.989467 66203.370088 67400.887592 68356.687433 NaN 5772.251540 5494.510776 6003.883035 6497.080991 5841.043359
PER DAY NaN 228.706633 250.435743 230.142034 273.271984 277.464870 NaN 2150.187101 2829.792890 2690.540262 3987.387382 4541.090950
PER HOUR NaN 19.998751 21.914079 23.865114 22.591124 23.095998 NaN 847.972633 778.960101 672.425582 817.215062 779.015431
PRORATED ANNUAL NaN 27528.794118 25489.415011 32011.370213 28857.308000 29393.553125 NaN 746.439614 141.365894 341.782043 263.126220 184.399125
MANHATTAN PER ANNUM 70302.205923 66926.799450 69219.929993 72014.764937 75038.574755 77066.139305 554.636012 1950.544238 1798.279692 1888.983719 1903.893006 1818.831188
PER DAY 278.545333 60.920634 66.200247 66.778423 68.588313 61.599935 3886.392000 80.420521 155.444418 145.750868 173.816300 164.031062
PER HOUR 3.790839 8.753896 9.441464 1167.428688 1405.911154 1295.316307 90.803226 258.536014 102.568193 193.662038 98.376645 97.677109
PRORATED ANNUAL NaN 30113.815237 32451.825311 33278.138812 35156.646578 36080.140584 NaN 638.141782 283.669740 287.708828 192.155898 239.800706
QUEENS PER ANNUM 72638.769939 64872.518101 66468.231356 67518.476146 68833.177272 70574.424486 640.292439 6113.675569 5684.608201 5539.324965 5759.776885 5555.284024
PER DAY NaN 254.609522 274.408341 261.250382 271.026471 284.399149 NaN 2612.646116 3717.570491 3392.948769 4054.715595 4678.670515
PER HOUR 28.836250 26.367431 30.445100 32.692165 31.694309 31.842055 73.767500 639.035013 722.053218 587.611901 698.913643 644.501699
PRORATED ANNUAL NaN 29193.782209 29777.313043 29397.394958 29777.990566 31113.146893 NaN 976.124172 262.186652 395.754790 328.796509 302.280734
STATEN ISLAND PER ANNUM 73064.887850 65302.354034 67092.591946 69204.875021 70285.517079 70944.995782 586.933925 7431.757451 7584.251536 7647.716819 8086.601002 7448.884692
PER DAY NaN 306.503156 326.049903 346.949620 353.915345 362.141402 NaN 4887.980225 6559.638149 6528.085054 8152.936790 8834.284530
PER HOUR 31.514000 18.891888 21.512435 22.148017 23.313177 23.942032 1020.262000 507.116870 672.432472 756.017677 829.246372 802.790677
PRORATED ANNUAL NaN 26722.350649 23334.315789 29979.666667 29518.311475 29970.100000 NaN 929.768182 117.790000 500.000000 336.795082 247.122000

Other Tips and Useful Information#

Counting the Number of Rows for Each Unique Value#

Say we want to know how many salaried employees work in each borough in each fiscal year. From this description, we know we need to pass Work Location Borough to the index argument and Fiscal Year to the columns argument. How do we tell Pandas to count the number of employees in each unique combination of borough and year? Simply pass the string 'count' as the aggfunc.

pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
               index='Work Location Borough',
               columns='Fiscal Year',
               values='Base Salary',
               aggfunc='count')
Fiscal Year 2014 2015 2016 2017 2018 2019
Work Location Borough
BRONX 933.0 21182.0 21423.0 22431.0 22388.0 22699.0
BROOKLYN NaN 42027.0 43208.0 41476.0 40716.0 41630.0
MANHATTAN 1452.0 197712.0 204959.0 204614.0 204775.0 215398.0
QUEENS 652.0 43211.0 45569.0 47918.0 47701.0 48372.0
STATEN ISLAND 107.0 5654.0 5786.0 5889.0 5861.0 6013.0

Let’s take a look at the above code. The index and columns arguments should be self-explanatory. We passed the string 'count' to the aggfunc argument. The only confusing thing might be the values argument. Why did I pass Base Salary to that? Because Pandas needs to count something. Anything. In this code, we told Pandas to count the number of rows in the Base Salary column for each unique value of ‘Work Location Borough’ and Fiscal Year. In theory, we could have counted any other column. To see this, consider the following code. It’s identical to the code above, except for the values argument.

pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
               index='Work Location Borough',
               columns='Fiscal Year',
               values='Last Name',
               aggfunc='count')
Fiscal Year 2014 2015 2016 2017 2018 2019
Work Location Borough
BRONX 933.0 21182.0 21423.0 22431.0 22388.0 22699.0
BROOKLYN NaN 42027.0 43208.0 41476.0 40716.0 41630.0
MANHATTAN 1452.0 197712.0 204959.0 204614.0 204775.0 215398.0
QUEENS 652.0 43211.0 45569.0 47918.0 47701.0 48372.0
STATEN ISLAND 107.0 5654.0 5786.0 5889.0 5861.0 6013.0

The pivot_table function returns a DataFrame#

The pivot_table function returns a DataFrame. That means you can save the pivot_table to a variable, change its contents, and use it in future computations. To see this, consider the following code:

myPivotTable = pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
                              index='Work Location Borough',
                              columns='Fiscal Year',
                              values='Last Name',
                              aggfunc='count')
myPivotTable
Fiscal Year 2014 2015 2016 2017 2018 2019
Work Location Borough
BRONX 933.0 21182.0 21423.0 22431.0 22388.0 22699.0
BROOKLYN NaN 42027.0 43208.0 41476.0 40716.0 41630.0
MANHATTAN 1452.0 197712.0 204959.0 204614.0 204775.0 215398.0
QUEENS 652.0 43211.0 45569.0 47918.0 47701.0 48372.0
STATEN ISLAND 107.0 5654.0 5786.0 5889.0 5861.0 6013.0
# Get the type of the variable myPivotTable
type(myPivotTable)
pandas.core.frame.DataFrame
# Change the column names of myPivotTable
myPivotTable.columns = ['FY2014', 'FY2015', 'FY2016', 'FY2017', 'FY2018', 'FY2019']
myPivotTable
FY2014 FY2015 FY2016 FY2017 FY2018 FY2019
Work Location Borough
BRONX 933.0 21182.0 21423.0 22431.0 22388.0 22699.0
BROOKLYN NaN 42027.0 43208.0 41476.0 40716.0 41630.0
MANHATTAN 1452.0 197712.0 204959.0 204614.0 204775.0 215398.0
QUEENS 652.0 43211.0 45569.0 47918.0 47701.0 48372.0
STATEN ISLAND 107.0 5654.0 5786.0 5889.0 5861.0 6013.0
# Add a projection for FY2020
myPivotTable['FY2020 (projected)'] = myPivotTable['FY2019'] * 1.05
myPivotTable
FY2014 FY2015 FY2016 FY2017 FY2018 FY2019 FY2020 (projected)
Work Location Borough
BRONX 933.0 21182.0 21423.0 22431.0 22388.0 22699.0 23833.95
BROOKLYN NaN 42027.0 43208.0 41476.0 40716.0 41630.0 43711.50
MANHATTAN 1452.0 197712.0 204959.0 204614.0 204775.0 215398.0 226167.90
QUEENS 652.0 43211.0 45569.0 47918.0 47701.0 48372.0 50790.60
STATEN ISLAND 107.0 5654.0 5786.0 5889.0 5861.0 6013.0 6313.65