Pivot Tables in Pandas#
Learning Objectives#
- Explain the concept of unique values and why it is so important.
- Create a simple pivot table in Pandas that computes one statistic for the unique values of one column.
- Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
- Create a pivot table in Pandas that groups by multiple variables within rows.
- Create a pivot table in Pandas that groups by two variables organized into rows and columns.
- Create a pivot table in Pandas that computes multiple statistics for the unique values of one column.
- Explain how to create more complicated pivot tables by using the index, columns, values, and aggfunc arguments.
- Use the len function to count the number of rows that exist for unique values in different columns.
- Explain how a Pandas pivot table is a DataFrame, and can be manipulated just like any other DataFrame.
Overview of Pivot Tables in Pandas#
In previous units, we showed you how to prepare your data for analysis. You loaded, cleaned, filtered, and manipulated data. In the previous chapter, we showed you how to begin data analysis by computing descriptive statistics. In this chapter, we will continue our journey into analyzing data by teaching you about pivot tables in Pandas.
A pivot table is a table that provides statistics for unique values of one or more variables. For example, what is the average salary in each department of a company? You will find that pivot tables will be one of the most useful tools in your toolkit. If you are adept at making pivot tables, you will be a star in your job. However, as you learn about pivot tables, realize that the skill lies not in writing code, but in thinking about what you want to learn from your data and interpreting the results. If you can envision the pivot table needed for your business question before you write any code, then the act of creating the table will seem simple.
Imports and Loading the Data#
import numpy as np
import pandas as pd
In this chapter, we will work with salary data for employees of the City of New York (link).
The cell below loads salary data for employees of the City of New York. Each row represents one employee in the one fiscal year. For a list of the columns and their meanings, click here.
NOTE: You need either the fastparquet library or the pyarrow libraryto load the data file used in these notes.
You should have installed one of these libraries at the beginning of the semester when you installed Python. If you did not, or if you get errors, don’t panic. If you just want to read these notes, you do not need to run these cells and you do not need to install either library.
If you’re wondering what these libraries do, read this paragraph. The data file is quite large, approximately 376 MB as a CSV. That makes it difficult to share. It has too many rows to store in an Excel worksheet, and if we gave you a CSV, it would take many minutes to load, and you would have to clean many of the columns by converting them from text to other data types. To avoid all this, we chose to store the data in Apache Parquet format. Parquet is a super-cool format that stores the data by column. That allows it to easily compress the data, because you tend to find the same value repeated many times in each column. This compression reduces the file size (the parquet file is just under 50 MB), and greatly speeds up the loading process. We love the Parquet format and frequently use it when working with larger datasets. It is compatible with many big data and cloud technologies. And Pandas provides support for Parquet.
dfPay = pd.read_parquet('data/NYC_Payroll_data.parquet')
# Make a fix to the Agency Start Date column
dfPay['Agency Start Date'] = pd.to_datetime(dfPay['Agency Start Date'], utc=True)
dfPay.head()
Fiscal Year | Payroll Number | Agency Name | Last Name | First Name | Mid Init | Agency Start Date | Work Location Borough | Title Description | Leave Status as of June 30 | Base Salary | Pay Basis | Regular Hours | Regular Gross Paid | OT Hours | Total OT Paid | Total Other Pay | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | NaN | DISTRICT ATTORNEY-MANHATTAN | ABA'AHMID | RAHASHEEM | E | 2003-07-14 00:00:00+00:00 | MANHATTAN | COMMUNITY ASSOCIATE | ACTIVE | 47678.0 | PER ANNUM | 1830.00 | 47166.03 | 2.25 | 57.28 | 0.0 |
1 | 2016 | NaN | DISTRICT ATTORNEY-MANHATTAN | ABENSUR | MARGARET | 1995-06-12 00:00:00+00:00 | MANHATTAN | ADMINISTRATIVE ACCOUNTANT | ACTIVE | 119959.0 | PER ANNUM | 1831.00 | 119042.50 | 0.00 | 0.00 | 0.0 | |
2 | 2016 | NaN | DISTRICT ATTORNEY-MANHATTAN | ABOUNAOUM | ANDREA | L | 2011-10-11 00:00:00+00:00 | MANHATTAN | COMMUNITY ASSOCIATE | ON LEAVE | 39966.0 | PER ANNUM | 1181.68 | 27452.62 | 1.00 | 21.34 | 33.0 |
3 | 2016 | NaN | DISTRICT ATTORNEY-MANHATTAN | ABRAHAM | JONATHAN | J | 2014-12-01 00:00:00+00:00 | MANHATTAN | COMPUTER SYSTEMS MANAGER | ACTIVE | 116000.0 | PER ANNUM | 1831.00 | 115104.57 | 0.00 | 0.00 | 0.0 |
4 | 2016 | NaN | DISTRICT ATTORNEY-MANHATTAN | ABRAMS | JOSEPH | 2015-05-21 00:00:00+00:00 | MANHATTAN | COLLEGE AIDE | CEASED | 1.0 | PER HOUR | 0.00 | 3500.00 | 0.00 | 0.00 | 0.0 |
dfPay.shape
(2748818, 17)
Unique / Distinct Values#
Before we can teach you about pivot tables, we need to teach you an important underlying concept, unique values.
Consider the following randomly chosen values from the Agency Name column of our dataset:
dfPay[['Work Location Borough']].sample(5, random_state=1)
Work Location Borough | |
---|---|
1686946 | BRONX |
535165 | MANHATTAN |
3128672 | MANHATTAN |
1632295 | QUEENS |
3092208 | MANHATTAN |
Notice that certain values are repeated. For example, “MANHATTAN” appears 3 times. What if we want to know all values that appear at least once in the column? We don’t want to see repeats. We simply want to know which values appear at least once in the column. Why would we care to know this? First of all, because we might want a list of the boroughs in New York City! Second, when we create a pivot table, we might want to report statistics by borough. That means that each borough should appear only once in our pivot table.
What we are looking for are the unique values in the column (note that the term distinct, which is used in SQL, is synonymous with unique). Pandas makes it very easy to compute the unique values in a column. Simply call the unique
method:
dfPay['Work Location Borough'].unique()
array(['MANHATTAN', 'STATEN ISLAND', 'QUEENS', 'BRONX', 'BROOKLYN'],
dtype=object)
Another way to get unique values in a column is the value_counts
method. That method returns the unique values as well as a count of each value.
dfPay['Work Location Borough'].value_counts()
Work Location Borough
MANHATTAN 1975186
QUEENS 314960
BROOKLYN 270173
BRONX 150150
STATEN ISLAND 38349
Name: count, dtype: int64
We prefer the value_counts
method to the unique
method for two reasons. First, value_counts
returns counts in addition to unique values and these are often useful. For example, we just learned that most of the employees in the dataset work in Manhattan. Second, value_counts
returns a Pandas Series while unique
returns a NumPy array. Since we are doing most of our work in Pandas, it’s often easier to work directly with Pandas objects.
The concept of unique values is super important. Not only does it underlie the pivot tables that we are about to create, but it is also used when merging data tables.
Let’s see one more example of uniqueness. Let’s retrieve the unique departments in the NYC government. From this list, which is quite long (there appear to be 160 departments in the NYC government!), we see that education is largest, followed by police.
dfPay['Agency Name'].value_counts()
Agency Name
DEPT OF ED PEDAGOGICAL 471072
DEPT OF ED PER SESSION TEACHER 426514
POLICE DEPARTMENT 307717
DEPT OF ED PARA PROFESSIONALS 176199
BOARD OF ELECTION POLL WORKERS 170968
...
BROOKLYN COMMUNITY BOARD #3 15
MANHATTAN COMMUNITY BOARD #7 15
BRONX COMMUNITY BOARD #5 14
BRONX COMMUNITY BOARD #1 11
BRONX COMMUNITY BOARD #3 10
Name: count, Length: 160, dtype: int64
Pivot Tables in Pandas#
A pivot table computes something for unique values. That’s all there is to it! :)
A Simple Pivot Table#
Let’s begin with a simple example: let’s compute the average salary paid to workers in each of NYC’s boroughs. We’ll first filter it to include only fiscal year 2018:
# Let's get cute and display the numbers in accounting format. :)
pd.set_option('float_format', lambda x: f'${x:,.0f}')
pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
index='Work Location Borough',
values='Base Salary',
aggfunc='mean')
Base Salary | |
---|---|
Work Location Borough | |
BRONX | $49,584 |
BROOKLYN | $52,777 |
MANHATTAN | $40,780 |
QUEENS | $52,138 |
STATEN ISLAND | $53,919 |
What the pivot table shows and how it computes its values#
Before we examine the code, let’s look at the pivot table. There’s one row per borough. The pivot table computed the unique values in the Work Location Borough column and created one row per unique borough in the resulting pivot table. Then, for each borough, it filtered the data for that borough and computed the average of the base salary for that borough. To see this, consider the following code that produces the exact same values as our pivot table above.
# Get only the 2018 rows. Drop any rows with missing values
df2018 = dfPay[dfPay['Fiscal Year'] == 2018].dropna(subset=['Base Salary'])
# Now get the unique boroughs as a list
boroughs = df2018['Work Location Borough'].unique().tolist()
# Sort the list of boroughs
boroughs.sort()
# For each borough in the list of unique boroughs
for borough in boroughs:
# Filter the 2018 data on that borough
df = df2018[df2018['Work Location Borough'] == borough]
# Compute the mean salary for that borough in 2018
meanSalary = df['Base Salary'].mean()
print(f"{borough}: ${meanSalary:,.0f}")
BRONX: $49,584
BROOKLYN: $52,777
MANHATTAN: $40,780
QUEENS: $52,138
STATEN ISLAND: $53,919
To reiterate, the pivot table computed the unique values in the Work Location Borough column. For each unique borough, it filtered the data for that borough and then computed the average of the Base Salary for that borough.
Understanding the code to generate the pivot table#
Above, we used the following code to create our pivot table:
pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
index='Work Location Borough',
values='Base Salary',
aggfunc='mean')
Let’s examine the code and understand what each argument does.
pd.pivot_table
. This is the Pandas function for creating a pivot table.data=dfPay[dfPay['Fiscal Year']==2018]
. You must pass a DataFrame to thedata
argument. This argument is required. Thepivot_table
function will use this data to create a pivot table.index='Work Location Borough'
. Theindex
argument tells the pivot table which values to use as rows in the resulting pivot table. In this case, we told it to take the Work Location Borough column of our source data and extract its unique values.values='Base Salary
. The pivot table will take the values from this column and use them in its computations.aggfunc='mean'
. “aggfunc” stands for “aggregation function”. By passing the string'mean'
, we told Pandas to use the Numpy function numpy.mean as our aggregation function. Thus, the pivot table will average the values in the Base Salary column.
In sum, the code tells Pandas to take the 2018 data, extract the unique boroughs and, for each borough, compute the mean of the Base Salary column.
Computing More Than One Statistic#
In the above example, we compute the mean salary paid to workers in each borough in 2018. We will extend that example and show how to compute the mean and median salary for each borough. Look at the following code and its output.
pd.pivot_table(data=dfPay[dfPay['Fiscal Year']==2018],
index='Work Location Borough',
values='Base Salary',
aggfunc=['mean', 'median'])
mean | median | |
---|---|---|
Base Salary | Base Salary | |
Work Location Borough | ||
BRONX | $49,584 | $48,666 |
BROOKLYN | $52,777 | $50,604 |
MANHATTAN | $40,780 | $35,977 |
QUEENS | $52,138 | $50,604 |
STATEN ISLAND | $53,919 | $55,024 |
Notice that the first data column, mean Base Salary, contains the exact same information as our table above. However, this new table contains a second column that contains the median salaries.
The only change to the code was to the aggfunc argument. We changed that to aggfunc=['mean', 'median']
. Instead of passing a function, we passed a list of functions. The pivot_table function then created one column per function. It did everything else identically. It first retrieved the unique boroughs (from the index argument), used values from the Base Salary column in its computations. For each borough, it computed the mean and the median.
Grouping By Multiple Variables in Rows#
Often, your business question implies a logical hierarchy among your variables. For example, say you want to see the average pay in each department, but you want to break it out by borough. Department heads might want to see this information to know if they are paying appropriate amounts for employees in different parts of the city. Another example is to show the pay by borough and break it out by hourly and salaried employees. This might be useful if one group of employees is unionized and the other is not. Or if the hourly and salaried employees belong to different unions, or if they have different contracts.
In this section, let’s use the example of pay by department and by borough. To make this tractable, we will restrict our analysis to (a) salaried employees (b) in the top 5 departments (c) in 2018.
First, let’s filter the data so we only have rows for salaried employees in 2018 for the top 5 departments. We’ll save the filtered data in a data frame called dfTop5Dept2018.
# Condition 1: fiscal year == 2018
# Condition 2: pay basis == 'PER ANNUM'
# Condition 3: get the top 5 departments, as measured by number of employees.
# Save the top 5 departments in the variable top5depts
top5depts = dfPay[(dfPay['Fiscal Year']==2018) &
(dfPay['Pay Basis'] == 'PER ANNUM')]['Agency Name'].value_counts().head(5).index
# Filter the data frame dfPay using the 3 conditions above.
dfTop5Dept2018 = dfPay[(dfPay['Fiscal Year']==2018) &
(dfPay['Pay Basis'] == 'PER ANNUM') &
(dfPay['Agency Name'].isin(top5depts))]
Now let’s create a pivot table showing, for each of the top 5 departments in 2018, the average pay in each borough.
pd.pivot_table(data=dfTop5Dept2018,
index=['Agency Name', 'Work Location Borough'],
values='Base Salary',
aggfunc='mean')
Base Salary | ||
---|---|---|
Agency Name | Work Location Borough | |
DEPT OF ED PARA PROFESSIONALS | MANHATTAN | $36,139 |
DEPT OF ED PEDAGOGICAL | MANHATTAN | $87,796 |
FIRE DEPARTMENT | BRONX | $72,784 |
BROOKLYN | $77,556 | |
MANHATTAN | $76,143 | |
QUEENS | $71,143 | |
STATEN ISLAND | $80,794 | |
HRA/DEPT OF SOCIAL SERVICES | BRONX | $47,447 |
BROOKLYN | $54,658 | |
MANHATTAN | $61,390 | |
QUEENS | $46,781 | |
STATEN ISLAND | $49,353 | |
POLICE DEPARTMENT | BRONX | $71,268 |
BROOKLYN | $68,800 | |
MANHATTAN | $78,636 | |
QUEENS | $68,304 | |
STATEN ISLAND | $73,843 |
Take a look at the above pivot table. Notice that the Agency Name and Work Location Borough columns are organized into a hierarchy. For each unique level of Agency Name, the pivot table shows all unique values of Work Location Borough. Notice that Pandas omitted boroughs with no employees.
From this table, we can see that two of the top departments in the city, Department of Ed Para Professionals and Dept of Ed Pedagogical, have all of their employees in Manhattan. For the other three departments, we can easily compare average salaries across boroughs. However, we cannot conclude from this that employees in some boroughs get paid more for the same job than their counterparts in other boroughs. To examine that, we would need to compare job titles and years of experience.
Understanding the Unique Values when Grouping By Multiple Variables#
When you group by multiple variables, you need to think of unique values differently. In the pivot table above, the unique values are (DEPT OF ED PARA PROFESSIONALS, MANHATTAN), (DEPT OF ED PEDAGOGICAL, MANHATTAN), (FIRE DEPARTMENT, BRONX), (FIRE DEPARTMENT, BROOKLYN), etc. In other words, take the unique values from Agency Name and the unique values from Work Location Borough; form every possible combination of unique values. The pivot table computes the mean salary for each unique combination.
Understanding the Code to Create a Pivot Table that Groups By Multiple Variables#
The code to create the pivot table above was:
pd.pivot_table(data=dfTop5Dept2018,
index=['Agency Name', 'Work Location Borough'],
values='Base Salary',
aggfunc='mean')
If you compare this code to the code from our simplest example, you will discover only one difference. For the index argument, we passed a list of column names. That tells Pandas to group the data first by Agency Name and then by Work Location Borough.
Grouping By Multiple Variables in Rows and Columns#
Often, you want to compute something for two variables but there is no logical hierarchy among the variables. For example, say you want to show the average pay by borough and by fiscal year. We could use a format like that in the previous section, but should we group first by borough and then fiscal year? Or by fiscal year and then borough? There’s no obvious answer. Also, the format shown above does not facilitate comparisons across the second variable. When you find yourself in a situation like this, consider organizing the data into rows and columns.
Take a look at the pivot table below:
pd.pivot_table(data=dfPay,
index='Work Location Borough',
columns='Fiscal Year',
values='Base Salary',
aggfunc='mean',
margins=True)
Fiscal Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | All |
---|---|---|---|---|---|---|---|
Work Location Borough | |||||||
BRONX | $58,788 | $44,626 | $46,477 | $44,402 | $49,584 | $51,025 | $47,251 |
BROOKLYN | NaN | $48,105 | $50,254 | $49,710 | $52,777 | $53,834 | $50,901 |
MANHATTAN | $62,937 | $32,795 | $37,827 | $38,471 | $40,780 | $39,591 | $37,889 |
QUEENS | $71,759 | $47,722 | $50,650 | $48,483 | $52,138 | $52,358 | $50,339 |
STATEN ISLAND | $69,804 | $49,004 | $51,869 | $53,531 | $53,919 | $54,322 | $52,599 |
All | $63,752 | $36,720 | $41,248 | $41,382 | $43,979 | $43,116 | $41,311 |
The above table lays out the data by borough (rows) and fiscal year (columns). Notice that we included an average for each row (the All column), and an average for each column (the All row). This format allows for easy comparison across boroughs, across fiscal years, and within boroughs. It is a very efficient way to present data.
Understanding the Unique Values When Grouping By Multiple Variables in Rows and Columns#
The unique values for the pivot table are all combinations of unique values of the rows and columns. Thus, (BRONX, 2014) is a unique value, as is (STATEN ISLAND, 2019). The pivot table will compute the mean of the base salary for each of these unique values.
Understanding the Code to Create a Pivot Table that Groups By Multiple Variables in Rows and Columns#
Let’s examine the code used to create the pivot table:
pd.pivot_table(data=dfPay,
index='Work Location Borough',
columns='Fiscal Year',
values='Base Salary',
aggfunc='mean',
margins=True)
When you compare this to the simplest pivot table above, you will discover two differences:
We added a columns argument. This tells Pandas to get the unique values from the Fiscal Year column and use them as column headings in the pivot table.
We added the argument
margins=True
. That tells Pandas to include summary rows and columns in the pivot table (the ones with the heading All).
Computing Statistics for Multiple Variables#
Another use case for pivot tables is to compute the same statistics for multiple variables. As a simple example, let’s compute the mean salary and mean years of experience by borough.
First, we need to compute the years of experience. Let’s use New York’s 2018 fiscal year, which ran from July 1, 2017 through June 30, 2018. We will compute the number of years of experience at the end of the fiscal year. That way, people who start in the middle of the fiscal year will show up as having a positive amount of work experience. We will also focus on salaried employees only.
# Filter on fiscal year 2018, salaried employees only
dfFilt = dfPay[(dfPay['Fiscal Year'] == 2018) &
(dfPay['Pay Basis'] == 'PER ANNUM')].copy()
# Create a date constant containing the end of the fiscal year
endOfFY = pd.to_datetime('6-30-2018 00:00:00 UTC', format='%m-%d-%Y %H:%M:%S %Z', utc=True)
dfFilt['Years of Experience'] = ((endOfFY - dfFilt['Agency Start Date']) / np.timedelta64(1,'D')) / 365.0
# Reset the Pandas display options
pd.reset_option('float_format')
pd.pivot_table(data=dfFilt,
index='Work Location Borough',
values=['Base Salary', 'Years of Experience'],
aggfunc='mean')
Base Salary | Years of Experience | |
---|---|---|
Work Location Borough | ||
BRONX | 65145.129919 | 11.126592 |
BROOKLYN | 67400.887592 | 11.648366 |
MANHATTAN | 75038.574755 | 12.030538 |
QUEENS | 68833.177272 | 10.573341 |
STATEN ISLAND | 70285.517079 | 12.282185 |
In the above code, we passed a list of column names to the Values argument. That told Pandas to compute the mean of each Values column for each unique value in Work Location Borough.
More Complicated Pivot Tables#
In this chapter, we showed you how to create simple pivot tables in Pandas. You can create much more complicated tables by passing lists to multiple arguments in the pivot_table
function. For example, say we want to recreate a pivot table that breaks out Pay Basis by Borough (rows). In each column, we want to show a different fiscal year. And we want to do this for two variables, Base Salary and Total Other Pay. You could do that as follows. Notice that all we’re doing is passing more columns to the arguments than we did previously.
pd.pivot_table(data=dfPay,
index=['Work Location Borough', 'Pay Basis'],
columns='Fiscal Year',
values=['Base Salary', 'Total Other Pay'],
aggfunc='mean')
Base Salary | Total Other Pay | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fiscal Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
Work Location Borough | Pay Basis | ||||||||||||
BRONX | PER ANNUM | 58914.094319 | 60474.766639 | 62271.607774 | 64138.005775 | 65145.129919 | 65869.988143 | 691.695520 | 6114.452525 | 6088.658542 | 6196.082143 | 6418.858816 | 6075.399860 |
PER DAY | NaN | 192.062425 | 209.050852 | 174.560338 | 242.845655 | 224.684363 | NaN | 1405.022760 | 1876.941646 | 1612.088610 | 2993.338527 | 2901.112367 | |
PER HOUR | 29.265000 | 22.159489 | 23.570725 | 27.410545 | 24.212095 | 23.177066 | 1348.675000 | 588.600548 | 512.666901 | 475.857755 | 624.052476 | 679.708500 | |
PRORATED ANNUAL | NaN | 26774.639752 | 24266.495614 | 26549.752066 | 27333.981685 | 26009.958333 | NaN | 808.867733 | 96.959737 | 218.843554 | 263.200293 | 117.033906 | |
BROOKLYN | PER ANNUM | NaN | 61959.211599 | 63611.989467 | 66203.370088 | 67400.887592 | 68356.687433 | NaN | 5772.251540 | 5494.510776 | 6003.883035 | 6497.080991 | 5841.043359 |
PER DAY | NaN | 228.706633 | 250.435743 | 230.142034 | 273.271984 | 277.464870 | NaN | 2150.187101 | 2829.792890 | 2690.540262 | 3987.387382 | 4541.090950 | |
PER HOUR | NaN | 19.998751 | 21.914079 | 23.865114 | 22.591124 | 23.095998 | NaN | 847.972633 | 778.960101 | 672.425582 | 817.215062 | 779.015431 | |
PRORATED ANNUAL | NaN | 27528.794118 | 25489.415011 | 32011.370213 | 28857.308000 | 29393.553125 | NaN | 746.439614 | 141.365894 | 341.782043 | 263.126220 | 184.399125 | |
MANHATTAN | PER ANNUM | 70302.205923 | 66926.799450 | 69219.929993 | 72014.764937 | 75038.574755 | 77066.139305 | 554.636012 | 1950.544238 | 1798.279692 | 1888.983719 | 1903.893006 | 1818.831188 |
PER DAY | 278.545333 | 60.920634 | 66.200247 | 66.778423 | 68.588313 | 61.599935 | 3886.392000 | 80.420521 | 155.444418 | 145.750868 | 173.816300 | 164.031062 | |
PER HOUR | 3.790839 | 8.753896 | 9.441464 | 1167.428688 | 1405.911154 | 1295.316307 | 90.803226 | 258.536014 | 102.568193 | 193.662038 | 98.376645 | 97.677109 | |
PRORATED ANNUAL | NaN | 30113.815237 | 32451.825311 | 33278.138812 | 35156.646578 | 36080.140584 | NaN | 638.141782 | 283.669740 | 287.708828 | 192.155898 | 239.800706 | |
QUEENS | PER ANNUM | 72638.769939 | 64872.518101 | 66468.231356 | 67518.476146 | 68833.177272 | 70574.424486 | 640.292439 | 6113.675569 | 5684.608201 | 5539.324965 | 5759.776885 | 5555.284024 |
PER DAY | NaN | 254.609522 | 274.408341 | 261.250382 | 271.026471 | 284.399149 | NaN | 2612.646116 | 3717.570491 | 3392.948769 | 4054.715595 | 4678.670515 | |
PER HOUR | 28.836250 | 26.367431 | 30.445100 | 32.692165 | 31.694309 | 31.842055 | 73.767500 | 639.035013 | 722.053218 | 587.611901 | 698.913643 | 644.501699 | |
PRORATED ANNUAL | NaN | 29193.782209 | 29777.313043 | 29397.394958 | 29777.990566 | 31113.146893 | NaN | 976.124172 | 262.186652 | 395.754790 | 328.796509 | 302.280734 | |
STATEN ISLAND | PER ANNUM | 73064.887850 | 65302.354034 | 67092.591946 | 69204.875021 | 70285.517079 | 70944.995782 | 586.933925 | 7431.757451 | 7584.251536 | 7647.716819 | 8086.601002 | 7448.884692 |
PER DAY | NaN | 306.503156 | 326.049903 | 346.949620 | 353.915345 | 362.141402 | NaN | 4887.980225 | 6559.638149 | 6528.085054 | 8152.936790 | 8834.284530 | |
PER HOUR | 31.514000 | 18.891888 | 21.512435 | 22.148017 | 23.313177 | 23.942032 | 1020.262000 | 507.116870 | 672.432472 | 756.017677 | 829.246372 | 802.790677 | |
PRORATED ANNUAL | NaN | 26722.350649 | 23334.315789 | 29979.666667 | 29518.311475 | 29970.100000 | NaN | 929.768182 | 117.790000 | 500.000000 | 336.795082 | 247.122000 |
Other Tips and Useful Information#
Counting the Number of Rows for Each Unique Value#
Say we want to know how many salaried employees work in each borough in each fiscal year. From this description, we know we need to pass Work Location Borough to the index argument and Fiscal Year to the columns argument. How do we tell Pandas to count the number of employees in each unique combination of borough and year? Simply pass the string 'count'
as the aggfunc.
pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
index='Work Location Borough',
columns='Fiscal Year',
values='Base Salary',
aggfunc='count')
Fiscal Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|
Work Location Borough | ||||||
BRONX | 933.0 | 21182.0 | 21423.0 | 22431.0 | 22388.0 | 22699.0 |
BROOKLYN | NaN | 42027.0 | 43208.0 | 41476.0 | 40716.0 | 41630.0 |
MANHATTAN | 1452.0 | 197712.0 | 204959.0 | 204614.0 | 204775.0 | 215398.0 |
QUEENS | 652.0 | 43211.0 | 45569.0 | 47918.0 | 47701.0 | 48372.0 |
STATEN ISLAND | 107.0 | 5654.0 | 5786.0 | 5889.0 | 5861.0 | 6013.0 |
Let’s take a look at the above code. The index and columns arguments should be self-explanatory. We passed the string 'count'
to the aggfunc argument. The only confusing thing might be the values argument. Why did I pass Base Salary to that? Because Pandas needs to count something. Anything. In this code, we told Pandas to count the number of rows in the Base Salary column for each unique value of ‘Work Location Borough’ and Fiscal Year. In theory, we could have counted any other column. To see this, consider the following code. It’s identical to the code above, except for the values argument.
pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
index='Work Location Borough',
columns='Fiscal Year',
values='Last Name',
aggfunc='count')
Fiscal Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|
Work Location Borough | ||||||
BRONX | 933.0 | 21182.0 | 21423.0 | 22431.0 | 22388.0 | 22699.0 |
BROOKLYN | NaN | 42027.0 | 43208.0 | 41476.0 | 40716.0 | 41630.0 |
MANHATTAN | 1452.0 | 197712.0 | 204959.0 | 204614.0 | 204775.0 | 215398.0 |
QUEENS | 652.0 | 43211.0 | 45569.0 | 47918.0 | 47701.0 | 48372.0 |
STATEN ISLAND | 107.0 | 5654.0 | 5786.0 | 5889.0 | 5861.0 | 6013.0 |
The pivot_table
function returns a DataFrame#
The pivot_table
function returns a DataFrame. That means you can save the pivot_table to a variable, change its contents, and use it in future computations. To see this, consider the following code:
myPivotTable = pd.pivot_table(data=dfPay[dfPay['Pay Basis'] == 'PER ANNUM'],
index='Work Location Borough',
columns='Fiscal Year',
values='Last Name',
aggfunc='count')
myPivotTable
Fiscal Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|
Work Location Borough | ||||||
BRONX | 933.0 | 21182.0 | 21423.0 | 22431.0 | 22388.0 | 22699.0 |
BROOKLYN | NaN | 42027.0 | 43208.0 | 41476.0 | 40716.0 | 41630.0 |
MANHATTAN | 1452.0 | 197712.0 | 204959.0 | 204614.0 | 204775.0 | 215398.0 |
QUEENS | 652.0 | 43211.0 | 45569.0 | 47918.0 | 47701.0 | 48372.0 |
STATEN ISLAND | 107.0 | 5654.0 | 5786.0 | 5889.0 | 5861.0 | 6013.0 |
# Get the type of the variable myPivotTable
type(myPivotTable)
pandas.core.frame.DataFrame
# Change the column names of myPivotTable
myPivotTable.columns = ['FY2014', 'FY2015', 'FY2016', 'FY2017', 'FY2018', 'FY2019']
myPivotTable
FY2014 | FY2015 | FY2016 | FY2017 | FY2018 | FY2019 | |
---|---|---|---|---|---|---|
Work Location Borough | ||||||
BRONX | 933.0 | 21182.0 | 21423.0 | 22431.0 | 22388.0 | 22699.0 |
BROOKLYN | NaN | 42027.0 | 43208.0 | 41476.0 | 40716.0 | 41630.0 |
MANHATTAN | 1452.0 | 197712.0 | 204959.0 | 204614.0 | 204775.0 | 215398.0 |
QUEENS | 652.0 | 43211.0 | 45569.0 | 47918.0 | 47701.0 | 48372.0 |
STATEN ISLAND | 107.0 | 5654.0 | 5786.0 | 5889.0 | 5861.0 | 6013.0 |
# Add a projection for FY2020
myPivotTable['FY2020 (projected)'] = myPivotTable['FY2019'] * 1.05
myPivotTable
FY2014 | FY2015 | FY2016 | FY2017 | FY2018 | FY2019 | FY2020 (projected) | |
---|---|---|---|---|---|---|---|
Work Location Borough | |||||||
BRONX | 933.0 | 21182.0 | 21423.0 | 22431.0 | 22388.0 | 22699.0 | 23833.95 |
BROOKLYN | NaN | 42027.0 | 43208.0 | 41476.0 | 40716.0 | 41630.0 | 43711.50 |
MANHATTAN | 1452.0 | 197712.0 | 204959.0 | 204614.0 | 204775.0 | 215398.0 | 226167.90 |
QUEENS | 652.0 | 43211.0 | 45569.0 | 47918.0 | 47701.0 | 48372.0 | 50790.60 |
STATEN ISLAND | 107.0 | 5654.0 | 5786.0 | 5889.0 | 5861.0 | 6013.0 | 6313.65 |