Pivot Tables in Pandas

Pivot Tables in Pandas#

Learning Objectives#

Explain the concept of unique values and why it is so important.
Create a simple pivot table in Pandas that computes one statistic for the unique values of one column.
Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
Create a pivot table in Pandas that groups by multiple variables within rows.
Create a pivot table in Pandas that groups by two variables organized into rows and columns.
Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
Explain how to create more complicated pivot tables by using the index, columns, values, and aggfunc arguments.
Use the len function to count the number of rows that exist for unique values in different columns.
Explain how a Pandas pivot table is a DataFrame, and can be manipulated just like any other DataFrame.

Overview of Pivot Tables in Pandas#

In previous units, we showed you how to prepare your data for analysis. You loaded, cleaned, filtered, and manipulated data. In the previous chapter, we showed you how to begin data analysis by computing descriptive statistics. In this chapter, we will continue our journey into analyzing data by teaching you about pivot tables in Pandas.

A pivot table is a table that provides statistics for unique values of one or more variables. For example, what is the average salary in each department of a company? You will find that pivot tables will be one of the most useful tools in your toolkit. If you are adept at making pivot tables, you will be a star in your job. However, as you learn about pivot tables, realize that the skill lies not in writing code, but in thinking about what you want to learn from your data and interpreting the results. If you can envision the pivot table needed for your business question before you write any code, then the act of creating the table will seem simple.

Imports and Loading the Data#

import numpy as np
import pandas as pd

In this chapter, we will work with salary data for employees of the City of New York (link).

The cell below loads salary data for employees of the City of New York. Each row represents one employee in the one fiscal year. For a list of the columns and their meanings, click here.

NOTE: You need either the fastparquet library or the pyarrow libraryto load the data file used in these notes.

You should have installed one of these libraries at the beginning of the semester when you installed Python. If you did not, or if you get errors, don’t panic. If you just want to read these notes, you do not need to run these cells and you do not need to install either library.

If you’re wondering what these libraries do, read this paragraph. The data file is quite large, approximately 376 MB as a CSV. That makes it difficult to share. It has too many rows to store in an Excel worksheet, and if we gave you a CSV, it would take many minutes to load, and you would have to clean many of the columns by converting them from text to other data types. To avoid all this, we chose to store the data in Apache Parquet format. Parquet is a super-cool format that stores the data by column. That allows it to easily compress the data, because you tend to find the same value repeated many times in each column. This compression reduces the file size (the parquet file is just under 50 MB), and greatly speeds up the loading process. We love the Parquet format and frequently use it when working with larger datasets. It is compatible with many big data and cloud technologies. And Pandas provides support for Parquet.

dfPay = pd.read_parquet('data/NYC_Payroll_data.parquet')
# Make a fix to the Agency Start Date column
dfPay['Agency Start Date'] = pd.to_datetime(dfPay['Agency Start Date'], utc=True)
dfPay.head()

	Fiscal Year	Payroll Number	Agency Name	Last Name	First Name	Mid Init	Agency Start Date	Work Location Borough	Title Description	Leave Status as of June 30	Base Salary	Pay Basis	Regular Hours	Regular Gross Paid	OT Hours	Total OT Paid	Total Other Pay
0	2016	NaN	DISTRICT ATTORNEY-MANHATTAN	ABA'AHMID	RAHASHEEM	E	2003-07-14 00:00:00+00:00	MANHATTAN	COMMUNITY ASSOCIATE	ACTIVE	47678.0	PER ANNUM	1830.00	47166.03	2.25	57.28	0.0
1	2016	NaN	DISTRICT ATTORNEY-MANHATTAN	ABENSUR	MARGARET		1995-06-12 00:00:00+00:00	MANHATTAN	ADMINISTRATIVE ACCOUNTANT	ACTIVE	119959.0	PER ANNUM	1831.00	119042.50	0.00	0.00	0.0
2	2016	NaN	DISTRICT ATTORNEY-MANHATTAN	ABOUNAOUM	ANDREA	L	2011-10-11 00:00:00+00:00	MANHATTAN	COMMUNITY ASSOCIATE	ON LEAVE	39966.0	PER ANNUM	1181.68	27452.62	1.00	21.34	33.0
3	2016	NaN	DISTRICT ATTORNEY-MANHATTAN	ABRAHAM	JONATHAN	J	2014-12-01 00:00:00+00:00	MANHATTAN	COMPUTER SYSTEMS MANAGER	ACTIVE	116000.0	PER ANNUM	1831.00	115104.57	0.00	0.00	0.0
4	2016	NaN	DISTRICT ATTORNEY-MANHATTAN	ABRAMS	JOSEPH		2015-05-21 00:00:00+00:00	MANHATTAN	COLLEGE AIDE	CEASED	1.0	PER HOUR	0.00	3500.00	0.00	0.00	0.0

dfPay.shape

(2748818, 17)

Unique / Distinct Values#

Before we can teach you about pivot tables, we need to teach you an important underlying concept, unique values.

Consider the following randomly chosen values from the Agency Name column of our dataset:

dfPay[['Work Location Borough']].sample(5, random_state=1)

	Work Location Borough
1686946	BRONX
535165	MANHATTAN
3128672	MANHATTAN
1632295	QUEENS
3092208	MANHATTAN

Notice that certain values are repeated. For example, “MANHATTAN” appears 3 times. What if we want to know all values that appear at least once in the column? We don’t want to see repeats. We simply want to know which values appear at least once in the column. Why would we care to know this? First of all, because we might want a list of the boroughs in New York City! Second, when we create a pivot table, we might want to report statistics by borough. That means that each borough should appear only once in our pivot table.

What we are looking for are the unique values in the column (note that the term distinct, which is used in SQL, is synonymous with unique). Pandas makes it very easy to compute the unique values in a column. Simply call the unique method:

dfPay['Work Location Borough'].unique()

array(['MANHATTAN', 'STATEN ISLAND', 'QUEENS', 'BRONX', 'BROOKLYN'],
      dtype=object)

Another way to get unique values in a column is the value_counts method. That method returns the unique values as well as a count of each value.

dfPay['Work Location Borough'].value_counts()

Work Location Borough
MANHATTAN        1975186
QUEENS            314960
BROOKLYN          270173
BRONX             150150
STATEN ISLAND      38349
Name: count, dtype: int64

We prefer the value_counts method to the unique method for two reasons. First, value_counts returns counts in addition to unique values and these are often useful. For example, we just learned that most of the employees in the dataset work in Manhattan. Second, value_counts returns a Pandas Series while unique returns a NumPy array. Since we are doing most of our work in Pandas, it’s often easier to work directly with Pandas objects.

The concept of unique values is super important. Not only does it underlie the pivot tables that we are about to create, but it is also used when merging data tables.

Let’s see one more example of uniqueness. Let’s retrieve the unique departments in the NYC government. From this list, which is quite long (there appear to be 160 departments in the NYC government!), we see that education is largest, followed by police.

dfPay['Agency Name'].value_counts()

Agency Name
DEPT OF ED PEDAGOGICAL            471072
DEPT OF ED PER SESSION TEACHER    426514
POLICE DEPARTMENT                 307717
DEPT OF ED PARA PROFESSIONALS     176199
BOARD OF ELECTION POLL WORKERS    170968
                                   ...  
BROOKLYN COMMUNITY BOARD #3           15
MANHATTAN COMMUNITY BOARD #7          15
BRONX COMMUNITY BOARD #5              14
BRONX COMMUNITY BOARD #1              11
BRONX COMMUNITY BOARD #3              10
Name: count, Length: 160, dtype: int64

Pivot Tables in Pandas#

A pivot table computes something for unique values. That’s all there is to it! :)

A Simple Pivot Table#

Let’s begin with a simple example: let’s compute the average salary paid to workers in each of NYC’s boroughs. We’ll first filter it to include only fiscal year 2018:

# Let's get cute and display the numbers in accounting format. :)
pd.set_option('float_format', lambda x: f'${x:,.0f}')

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc='mean')

	Base Salary
Work Location Borough
BRONX	$49,584
BROOKLYN	$52,777
MANHATTAN	$40,780
QUEENS	$52,138
STATEN ISLAND	$53,919

What the pivot table shows and how it computes its values#

Before we examine the code, let’s look at the pivot table. There’s one row per borough. The pivot table computed the unique values in the Work Location Borough column and created one row per unique borough in the resulting pivot table. Then, for each borough, it filtered the data for that borough and computed the average of the base salary for that borough. To see this, consider the following code that produces the exact same values as our pivot table above.

# Get only the 2018 rows. Drop any rows with missing values
df2018 = dfPay[dfPay['Fiscal Year'] == 2018].dropna(subset=['Base Salary'])
# Now get the unique boroughs as a list
boroughs = df2018['Work Location Borough'].unique().tolist()
# Sort the list of boroughs
boroughs.sort()

# For each borough in the list of unique boroughs
for borough in boroughs:
    # Filter the 2018 data on that borough
    df = df2018[df2018['Work Location Borough'] == borough]
    # Compute the mean salary for that borough in 2018
    meanSalary = df['Base Salary'].mean()
    
    print(f"{borough}: ${meanSalary:,.0f}")

BRONX: $49,584
BROOKLYN: $52,777
MANHATTAN: $40,780
QUEENS: $52,138
STATEN ISLAND: $53,919

To reiterate, the pivot table computed the unique values in the Work Location Borough column. For each unique borough, it filtered the data for that borough and then computed the average of the Base Salary for that borough.

Understanding the code to generate the pivot table#

Above, we used the following code to create our pivot table:

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc='mean')

Let’s examine the code and understand what each argument does.

pd.pivot_table. This is the Pandas function for creating a pivot table.
data=dfPay[dfPay['Fiscal Year']==2018]. You must pass a DataFrame to the data argument. This argument is required. The pivot_table function will use this data to create a pivot table.
index='Work Location Borough'. The index argument tells the pivot table which values to use as rows in the resulting pivot table. In this case, we told it to take the Work Location Borough column of our source data and extract its unique values.
values='Base Salary. The pivot table will take the values from this column and use them in its computations.
aggfunc='mean'. “aggfunc” stands for “aggregation function”. By passing the string 'mean', we told Pandas to use the Numpy function numpy.mean as our aggregation function. Thus, the pivot table will average the values in the Base Salary column.

In sum, the code tells Pandas to take the 2018 data, extract the unique boroughs and, for each borough, compute the mean of the Base Salary column.

Computing More Than One Statistic#

In the above example, we compute the mean salary paid to workers in each borough in 2018. We will extend that example and show how to compute the mean and median salary for each borough. Look at the following code and its output.

pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
               index='Work Location Borough',
               values='Base Salary',
               aggfunc=['mean', 'median'])

	mean	median
	Base Salary	Base Salary
Work Location Borough
BRONX	$49,584	$48,666
BROOKLYN	$52,777	$50,604
MANHATTAN	$40,780	$35,977
QUEENS	$52,138	$50,604
STATEN ISLAND	$53,919	$55,024

Notice that the first data column, mean Base Salary, contains the exact same information as our table above. However, this new table contains a second column that contains the median salaries.

The only change to the code was to the aggfunc argument. We changed that to aggfunc=['mean', 'median']. Instead of passing a function, we passed a list of functions. The pivot_table function then created one column per function. It did everything else identically. It first retrieved the unique boroughs (from the index argument), used values from the Base Salary column in its computations. For each borough, it computed the mean and the median.

Grouping By Multiple Variables in Rows#

Often, your business question implies a logical hierarchy among your variables. For example, say you want to see the average pay in each department, but you want to break it out by borough. Department heads might want to see this information to know if they are paying appropriate amounts for employees in different parts of the city. Another example is to show the pay by borough and break it out by hourly and salaried employees. This might be useful if one group of employees is unionized and the other is not. Or if the hourly and salaried employees belong to different unions, or if they have different contracts.

In this section, let’s use the example of pay by department and by borough. To make this tractable, we will restrict our analysis to (a) salaried employees (b) in the top 5 departments (c) in 2018.

First, let’s filter the data so we only have rows for salaried employees in 2018 for the top 5 departments. We’ll save the filtered data in a data frame called dfTop5Dept2018.

# Condition 1: fiscal year == 2018
# Condition 2: pay basis == 'PER ANNUM'
# Condition 3: get the top 5 departments, as measured by number of employees.
#     Save the top 5 departments in the variable top5depts
top5depts =  dfPay[(dfPay['Fiscal Year']==2018) & 
                   (dfPay['Pay Basis'] == 'PER ANNUM')]['Agency Name'].value_counts().head(5).index

# Filter the data frame dfPay using the 3 conditions above.
dfTop5Dept2018 = dfPay[(dfPay['Fiscal Year']==2018) & 
                       (dfPay['Pay Basis'] == 'PER ANNUM') & 
                       (dfPay['Agency Name'].isin(top5depts))]

Now let’s create a pivot table showing, for each of the top 5 departments in 2018, the average pay in each borough.

pd.pivot_table(data=dfTop5Dept2018, 
               index=['Agency Name', 'Work Location Borough'], 
               values='Base Salary', 
               aggfunc='mean')

		Base Salary
Agency Name	Work Location Borough
DEPT OF ED PARA PROFESSIONALS	MANHATTAN	$36,139
DEPT OF ED PEDAGOGICAL	MANHATTAN	$87,796
FIRE DEPARTMENT	BRONX	$72,784
	BROOKLYN	$77,556
	MANHATTAN	$76,143
	QUEENS	$71,143
	STATEN ISLAND	$80,794
HRA/DEPT OF SOCIAL SERVICES	BRONX	$47,447
	BROOKLYN	$54,658
	MANHATTAN	$61,390
	QUEENS	$46,781
	STATEN ISLAND	$49,353
POLICE DEPARTMENT	BRONX	$71,268
	BROOKLYN	$68,800
	MANHATTAN	$78,636
	QUEENS	$68,304
	STATEN ISLAND	$73,843

Take a look at the above pivot table. Notice that the Agency Name and Work Location Borough columns are organized into a hierarchy. For each unique level of Agency Name, the pivot table shows all unique values of Work Location Borough. Notice that Pandas omitted boroughs with no employees.

From this table, we can see that two of the top departments in the city, Department of Ed Para Professionals and Dept of Ed Pedagogical, have all of their employees in Manhattan. For the other three departments, we can easily compare average salaries across boroughs. However, we cannot conclude from this that employees in some boroughs get paid more for the same job than their counterparts in other boroughs. To examine that, we would need to compare job titles and years of experience.

Understanding the Unique Values when Grouping By Multiple Variables#

When you group by multiple variables, you need to think of unique values differently. In the pivot table above, the unique values are (DEPT OF ED PARA PROFESSIONALS, MANHATTAN), (DEPT OF ED PEDAGOGICAL, MANHATTAN), (FIRE DEPARTMENT, BRONX), (FIRE DEPARTMENT, BROOKLYN), etc. In other words, take the unique values from Agency Name and the unique values from Work Location Borough; form every possible combination of unique values. The pivot table computes the mean salary for each unique combination.

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables#

The code to create the pivot table above was:

pd.pivot_table(data=dfTop5Dept2018, 
               index=['Agency Name', 'Work Location Borough'], 
               values='Base Salary', 
               aggfunc='mean')

If you compare this code to the code from our simplest example, you will discover only one difference. For the index argument, we passed a list of column names. That tells Pandas to group the data first by Agency Name and then by Work Location Borough.

Grouping By Multiple Variables in Rows and Columns#

Often, you want to compute something for two variables but there is no logical hierarchy among the variables. For example, say you want to show the average pay by borough and by fiscal year. We could use a format like that in the previous section, but should we group first by borough and then fiscal year? Or by fiscal year and then borough? There’s no obvious answer. Also, the format shown above does not facilitate comparisons across the second variable. When you find yourself in a situation like this, consider organizing the data into rows and columns.

Take a look at the pivot table below:

pd.pivot_table(data=dfPay, 
               index='Work Location Borough', 
               columns='Fiscal Year', 
               values='Base Salary', 
               aggfunc='mean', 
               margins=True)

Fiscal Year	2014	2015	2016	2017	2018	2019	All
Work Location Borough
BRONX	$58,788	$44,626	$46,477	$44,402	$49,584	$51,025	$47,251
BROOKLYN	NaN	$48,105	$50,254	$49,710	$52,777	$53,834	$50,901
MANHATTAN	$62,937	$32,795	$37,827	$38,471	$40,780	$39,591	$37,889
QUEENS	$71,759	$47,722	$50,650	$48,483	$52,138	$52,358	$50,339
STATEN ISLAND	$69,804	$49,004	$51,869	$53,531	$53,919	$54,322	$52,599
All	$63,752	$36,720	$41,248	$41,382	$43,979	$43,116	$41,311

The above table lays out the data by borough (rows) and fiscal year (columns). Notice that we included an average for each row (the All column), and an average for each column (the All row). This format allows for easy comparison across boroughs, across fiscal years, and within boroughs. It is a very efficient way to present data.

Understanding the Unique Values When Grouping By Multiple Variables in Rows and Columns#

The unique values for the pivot table are all combinations of unique values of the rows and columns. Thus, (BRONX, 2014) is a unique value, as is (STATEN ISLAND, 2019). The pivot table will compute the mean of the base salary for each of these unique values.

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables in Rows and Columns#

Let’s examine the code used to create the pivot table:

pd.pivot_table(data=dfPay, 
               index='Work Location Borough', 
               columns='Fiscal Year', 
               values='Base Salary', 
               aggfunc='mean', 
               margins=True)

When you compare this to the simplest pivot table above, you will discover two differences:

We added a columns argument. This tells Pandas to get the unique values from the Fiscal Year column and use them as column headings in the pivot table.
We added the argument margins=True. That tells Pandas to include summary rows and columns in the pivot table (the ones with the heading All).

Computing Statistics for Multiple Variables#

Another use case for pivot tables is to compute the same statistics for multiple variables. As a simple example, let’s compute the mean salary and mean years of experience by borough.

First, we need to compute the years of experience. Let’s use New York’s 2018 fiscal year, which ran from July 1, 2017 through June 30, 2018. We will compute the number of years of experience at the end of the fiscal year. That way, people who start in the middle of the fiscal year will show up as having a positive amount of work experience. We will also focus on salaried employees only.

# Filter on fiscal year 2018, salaried employees only
dfFilt = dfPay[(dfPay['Fiscal Year'] == 2018) & 
               (dfPay['Pay Basis'] == 'PER ANNUM')].copy()
# Create a date constant containing the end of the fiscal year
endOfFY = pd.to_datetime('6-30-2018 00:00:00 UTC', format='%m-%d-%Y %H:%M:%S %Z', utc=True)

dfFilt['Years of Experience'] = ((endOfFY - dfFilt['Agency Start Date']) / np.timedelta64(1,'D')) / 365.0

# Reset the Pandas display options
pd.reset_option('float_format')

pd.pivot_table(data=dfFilt,
               index='Work Location Borough',
               values=['Base Salary', 'Years of Experience'],
               aggfunc='mean')

	Base Salary	Years of Experience
Work Location Borough
BRONX	65145.129919	11.126592
BROOKLYN	67400.887592	11.648366
MANHATTAN	75038.574755	12.030538
QUEENS	68833.177272	10.573341
STATEN ISLAND	70285.517079	12.282185

In the above code, we passed a list of column names to the Values argument. That told Pandas to compute the mean of each Values column for each unique value in Work Location Borough.

More Complicated Pivot Tables#

In this chapter, we showed you how to create simple pivot tables in Pandas. You can create much more complicated tables by passing lists to multiple arguments in the pivot_table function. For example, say we want to recreate a pivot table that breaks out Pay Basis by Borough (rows). In each column, we want to show a different fiscal year. And we want to do this for two variables, Base Salary and Total Other Pay. You could do that as follows. Notice that all we’re doing is passing more columns to the arguments than we did previously.

pd.pivot_table(data=dfPay,
               index=['Work Location Borough', 'Pay Basis'],
               columns='Fiscal Year',
               values=['Base Salary', 'Total Other Pay'],
               aggfunc='mean')

		Base Salary						Total Other Pay
	Fiscal Year	2014	2015	2016	2017	2018	2019	2014	2015	2016	2017	2018	2019
Work Location Borough	Pay Basis
BRONX	PER ANNUM	58914.094319	60474.766639	62271.607774	64138.005775	65145.129919	65869.988143	691.695520	6114.452525	6088.658542	6196.082143	6418.858816	6075.399860
	PER DAY	NaN	192.062425	209.050852	174.560338	242.845655	224.684363	NaN	1405.022760	1876.941646	1612.088610	2993.338527	2901.112367
	PER HOUR	29.265000	22.159489	23.570725	27.410545	24.212095	23.177066	1348.675000	588.600548	512.666901	475.857755	624.052476	679.708500
	PRORATED ANNUAL	NaN	26774.639752	24266.495614	26549.752066	27333.981685	26009.958333	NaN	808.867733	96.959737	218.843554	263.200293	117.033906
BROOKLYN	PER ANNUM	NaN	61959.211599	63611.989467	66203.370088	67400.887592	68356.687433	NaN	5772.251540	5494.510776	6003.883035	6497.080991	5841.043359
	PER DAY	NaN	228.706633	250.435743	230.142034	273.271984	277.464870	NaN	2150.187101	2829.792890	2690.540262	3987.387382	4541.090950
	PER HOUR	NaN	19.998751	21.914079	23.865114	22.591124	23.095998	NaN	847.972633	778.960101	672.425582	817.215062	779.015431
	PRORATED ANNUAL	NaN	27528.794118	25489.415011	32011.370213	28857.308000	29393.553125	NaN	746.439614	141.365894	341.782043	263.126220	184.399125
MANHATTAN	PER ANNUM	70302.205923	66926.799450	69219.929993	72014.764937	75038.574755	77066.139305	554.636012	1950.544238	1798.279692	1888.983719	1903.893006	1818.831188
	PER DAY	278.545333	60.920634	66.200247	66.778423	68.588313	61.599935	3886.392000	80.420521	155.444418	145.750868	173.816300	164.031062
	PER HOUR	3.790839	8.753896	9.441464	1167.428688	1405.911154	1295.316307	90.803226	258.536014	102.568193	193.662038	98.376645	97.677109
	PRORATED ANNUAL	NaN	30113.815237	32451.825311	33278.138812	35156.646578	36080.140584	NaN	638.141782	283.669740	287.708828	192.155898	239.800706
QUEENS	PER ANNUM	72638.769939	64872.518101	66468.231356	67518.476146	68833.177272	70574.424486	640.292439	6113.675569	5684.608201	5539.324965	5759.776885	5555.284024
	PER DAY	NaN	254.609522	274.408341	261.250382	271.026471	284.399149	NaN	2612.646116	3717.570491	3392.948769	4054.715595	4678.670515
	PER HOUR	28.836250	26.367431	30.445100	32.692165	31.694309	31.842055	73.767500	639.035013	722.053218	587.611901	698.913643	644.501699
	PRORATED ANNUAL	NaN	29193.782209	29777.313043	29397.394958	29777.990566	31113.146893	NaN	976.124172	262.186652	395.754790	328.796509	302.280734
STATEN ISLAND	PER ANNUM	73064.887850	65302.354034	67092.591946	69204.875021	70285.517079	70944.995782	586.933925	7431.757451	7584.251536	7647.716819	8086.601002	7448.884692
	PER DAY	NaN	306.503156	326.049903	346.949620	353.915345	362.141402	NaN	4887.980225	6559.638149	6528.085054	8152.936790	8834.284530
	PER HOUR	31.514000	18.891888	21.512435	22.148017	23.313177	23.942032	1020.262000	507.116870	672.432472	756.017677	829.246372	802.790677
	PRORATED ANNUAL	NaN	26722.350649	23334.315789	29979.666667	29518.311475	29970.100000	NaN	929.768182	117.790000	500.000000	336.795082	247.122000

Other Tips and Useful Information#

Counting the Number of Rows for Each Unique Value#

Say we want to know how many salaried employees work in each borough in each fiscal year. From this description, we know we need to pass Work Location Borough to the index argument and Fiscal Year to the columns argument. How do we tell Pandas to count the number of employees in each unique combination of borough and year? Simply pass the string 'count' as the aggfunc.

pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
               index='Work Location Borough',
               columns='Fiscal Year',
               values='Base Salary',
               aggfunc='count')

Fiscal Year	2014	2015	2016	2017	2018	2019
Work Location Borough
BRONX	933.0	21182.0	21423.0	22431.0	22388.0	22699.0
BROOKLYN	NaN	42027.0	43208.0	41476.0	40716.0	41630.0
MANHATTAN	1452.0	197712.0	204959.0	204614.0	204775.0	215398.0
QUEENS	652.0	43211.0	45569.0	47918.0	47701.0	48372.0
STATEN ISLAND	107.0	5654.0	5786.0	5889.0	5861.0	6013.0

Let’s take a look at the above code. The index and columns arguments should be self-explanatory. We passed the string 'count' to the aggfunc argument. The only confusing thing might be the values argument. Why did I pass Base Salary to that? Because Pandas needs to count something. Anything. In this code, we told Pandas to count the number of rows in the Base Salary column for each unique value of ‘Work Location Borough’ and Fiscal Year. In theory, we could have counted any other column. To see this, consider the following code. It’s identical to the code above, except for the values argument.

pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
               index='Work Location Borough',
               columns='Fiscal Year',
               values='Last Name',
               aggfunc='count')

Fiscal Year	2014	2015	2016	2017	2018	2019
Work Location Borough
BRONX	933.0	21182.0	21423.0	22431.0	22388.0	22699.0
BROOKLYN	NaN	42027.0	43208.0	41476.0	40716.0	41630.0
MANHATTAN	1452.0	197712.0	204959.0	204614.0	204775.0	215398.0
QUEENS	652.0	43211.0	45569.0	47918.0	47701.0	48372.0
STATEN ISLAND	107.0	5654.0	5786.0	5889.0	5861.0	6013.0

The `pivot_table` function returns a DataFrame#

The pivot_table function returns a DataFrame. That means you can save the pivot_table to a variable, change its contents, and use it in future computations. To see this, consider the following code:

myPivotTable = pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
                              index='Work Location Borough',
                              columns='Fiscal Year',
                              values='Last Name',
                              aggfunc='count')
myPivotTable

Fiscal Year	2014	2015	2016	2017	2018	2019
Work Location Borough
BRONX	933.0	21182.0	21423.0	22431.0	22388.0	22699.0
BROOKLYN	NaN	42027.0	43208.0	41476.0	40716.0	41630.0
MANHATTAN	1452.0	197712.0	204959.0	204614.0	204775.0	215398.0
QUEENS	652.0	43211.0	45569.0	47918.0	47701.0	48372.0
STATEN ISLAND	107.0	5654.0	5786.0	5889.0	5861.0	6013.0

# Get the type of the variable myPivotTable
type(myPivotTable)

pandas.core.frame.DataFrame

# Change the column names of myPivotTable
myPivotTable.columns = ['FY2014', 'FY2015', 'FY2016', 'FY2017', 'FY2018', 'FY2019']
myPivotTable

	FY2014	FY2015	FY2016	FY2017	FY2018	FY2019
Work Location Borough
BRONX	933.0	21182.0	21423.0	22431.0	22388.0	22699.0
BROOKLYN	NaN	42027.0	43208.0	41476.0	40716.0	41630.0
MANHATTAN	1452.0	197712.0	204959.0	204614.0	204775.0	215398.0
QUEENS	652.0	43211.0	45569.0	47918.0	47701.0	48372.0
STATEN ISLAND	107.0	5654.0	5786.0	5889.0	5861.0	6013.0

# Add a projection for FY2020
myPivotTable['FY2020 (projected)'] = myPivotTable['FY2019'] * 1.05
myPivotTable

	FY2014	FY2015	FY2016	FY2017	FY2018	FY2019	FY2020 (projected)
Work Location Borough
BRONX	933.0	21182.0	21423.0	22431.0	22388.0	22699.0	23833.95
BROOKLYN	NaN	42027.0	43208.0	41476.0	40716.0	41630.0	43711.50
MANHATTAN	1452.0	197712.0	204959.0	204614.0	204775.0	215398.0	226167.90
QUEENS	652.0	43211.0	45569.0	47918.0	47701.0	48372.0	50790.60
STATEN ISLAND	107.0	5654.0	5786.0	5889.0	5861.0	6013.0	6313.65

Pivot Tables in Pandas

Contents

Pivot Tables in Pandas#

Learning Objectives#

Overview of Pivot Tables in Pandas#

Imports and Loading the Data#

Unique / Distinct Values#

Pivot Tables in Pandas#

A Simple Pivot Table#

What the pivot table shows and how it computes its values#

Understanding the code to generate the pivot table#

Computing More Than One Statistic#

Grouping By Multiple Variables in Rows#

Understanding the Unique Values when Grouping By Multiple Variables#

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables#

Grouping By Multiple Variables in Rows and Columns#

Understanding the Unique Values When Grouping By Multiple Variables in Rows and Columns#

Understanding the Code to Create a Pivot Table that Groups By Multiple Variables in Rows and Columns#

Computing Statistics for Multiple Variables#

More Complicated Pivot Tables#

Other Tips and Useful Information#

Counting the Number of Rows for Each Unique Value#

The pivot_table function returns a DataFrame#

The `pivot_table` function returns a DataFrame#