Descriptive Statistics

Descriptive Statistics#

Learning Objectives#

Define the term descriptive statistics and explain why they are useful.
Mean:
- Define the mean of a variable and be able to compute it by hand.
- Interpret the mean as a measure of centrality of a variable's distribution
Median:
- Define the median of a variable and be able to compute it by hand.
- Interpret the mean as a measure of centrality of a variable's distribution
- Explain when a median may be more appropriate than a mean.
Standard deviation:
- Define the standard deviation of a variable.
- Interpret the standard deviation as a measure of the spread in a variable's distribution.
Correlation:
- Define, in qualitative terms, the correlation between two variables.
- Explain the difference between positive and negative correlation.
- State the range of possible correlation values.
- Interpret a correlation matrix.

Overview of Descriptive Statistics#

In previous units, we showed you how to prepare your data for analysis. You loaded, cleaned, filtered, and manipulated data. Now it’s time to start analyzing it.

The first step in data analysis is usually plotting (graphical exploration). Plots are incredibly useful because they communicate considerable information about the data. However, you often want to summarize what you see in the plots. You might want some summary measures that you can use in a computation. Or you might want to provide some numbers that tell you something about your plot. This is where descriptive statistics come in.

Descriptive statistics are statistics. That is, they are computed from your data and the measures have statistical properties. But more importantly, they describe your data.

In this chapter, we will learn about four commonly-used descriptive statistics, mean, median, standard deviation, and correlation. You have likely seen these before in statistics classes. In this class, we will focus on the uses and interpretation of the measures. As you will see, Pandas computes these measures very quickly and that will allow us to focus on interpretation.

Imports and Loading the Data#

import numpy as np
import pandas as pd

We will work with two datasets in this chapter.

Salary data for employees of the City of Chicago (link).
Selected income statement data from Compustat.

City of Chicago Salary Data#

The cell below loads salary data for employees of the City of Chicago. Each row represents one employee in the previous fiscal year. Here is a list of the columns and their meanings:

Column	Meaning
Name	Name of employee
Job Titles	Title of employee at the time when the data was updated.
Department	Department where employee worked.
Full or Part-Time	Whether the employee was employed full- (F) or part-time (P).
Salary or Hourly	Defines whether an employee is paid on an hourly basis or salary basis.
Typical Hours	Typical amount of work for hourly employees. This data does not apply to salary employees
Annual Salary	Annual salary rates. Only applies for employees whose pay frequency is “Salary”.

dfSalary = pd.read_excel('data/Chicago_Payroll_Data.xlsx', sheet_name='Data')
dfSalary.head()

	Name	Job Titles	Department	Full or Part-Time	Salary or Hourly	Typical Hours	Annual Salary	Hourly Rate
0	AARON, JEFFERY M	SERGEANT	POLICE	F	Salary	NaN	101442.0	NaN
1	AARON, KARINA	POLICE OFFICER (ASSIGNED AS DETECTIVE)	POLICE	F	Salary	NaN	94122.0	NaN
2	AARON, KIMBERLEI R	CHIEF CONTRACT EXPEDITER	GENERAL SERVICES	F	Salary	NaN	111024.0	NaN
3	ABAD JR, VICENTE M	CIVIL ENGINEER IV	WATER MGMNT	F	Salary	NaN	114780.0	NaN
4	ABARCA, EMMANUEL	CONCRETE LABORER	TRANSPORTN	F	Hourly	40.0	NaN	43.72

Compustat Data#

The cell below loads some data from the Compustat database. Compustat contains all 10-K data for U.S. companies. We filtered the data so it only includes data for selected companies in fiscal year 2018. We saved the filtered data to an Excel file. There is one row per company. Here is a list of the columns and their meanings:

Column	Meaning
GVKEY	Unique Compustat identifier
TIC	Stock ticker symbol
FYEAR	Fiscal year
SALE	Annual revenue for this company. In millions.
IB	Income before extraordinary items. In millions.
ACC	Accruals, computed as IB - OANCF. In millions.
OANCF	Cash flows from operations. In millions.

dfCompustat = pd.read_excel('data/Compustat.xlsx')
dfCompustat.head()

	GVKEY	TIC	FYEAR	SALE	IB	ACC	OANCF
0	12825	JJSF	2018	1138.265	103.596	-19.771	123.367
1	12096	DGSE	2018	54.056	0.658	0.283	0.375
2	12945	PLXS	2018	2873.508	13.040	-53.791	66.831
3	16456	HNGR	2018	1048.760	-0.858	-79.385	78.527
4	12141	MSFT	2018	110360.000	16571.000	-27313.000	43884.000

Mean#

You have likely see the term mean, or average, many times before. It’s a seemingly simple concept. You take a bunch of numbers, add them, and divide by the number of numbers. But what does it mean (no pun intended)?

Let’s first plot our salary data and then revisit the question. Run the code cell below:

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import NumeralTickFormatter, Span
output_notebook()

Loading BokehJS ...

# Notice we drop the missing values. In this dataset, hourly employees
# have missing values in the Annual Salary column
hist, edges = np.histogram(dfSalary['Annual Salary'].dropna(), bins=44)

p = figure(width=500, height=400,
           x_axis_label='Salary ($)', y_axis_label='Count',
           title='Histogram of City of Chicago Employee Salaries')
p.xaxis[0].formatter = NumeralTickFormatter(format="$0,0")
p.quad(bottom=0, top=hist,
       left=edges[:-1], right=edges[1:],
       fill_color='navy', line_color='white', alpha=0.5)
show(p)

Looking at the graph above, we see a “spread” in the salaries. Unsurprisingly, people are paid different amounts. There are many people with similar, and even identical, salaries, but not everybody is paid the same.

A city executive, or the mayor, or a city council member, might wonder if there is a number that is representative of the distribution shown above. Specifically, if we took everybody’s salary, pooled it, and then divided it up evenly among the employees, how much would each person get? That number is called the mean or average of the distribution. The mean is a measure of “centrality”. It represents the center of the distribution. A middle number, if you will.

What would you do with such a number? You might wonder how people’s salaries are changing over time. If you compare the 2019 mean to the 2018 mean to the 2017 mean, you can get a sense for whether the employees’ pay is rising over time. Alternatively, you could compare the average salary of Chicago employees to that of New York employees and Los Angeles employees. That’s very difficult to do with the entire distribution, but is a meaningful comparison when you use the average.

Computing the Mean Mathematically#

To compute the mean of a set of $n$ numbers, you add them up and divide by $n$. More formally, say you have numbers $x_1, x_2, \ldots, x_n$. The mean, or average, which is written using the greek letter mu, is defined as:

\[\mu = \frac{\sum_{i=1}^n x_i}{n}\]

Computing the Mean Using Pandas#

In Pandas, it’s super easy to compute the mean of a Series (remember that each DataFrame column is a Series). Just use the mean method:

dfSalary['Annual Salary'].mean()

np.float64(87845.37224432483)

Finally, let’s plot the mean on our histogram so we can see where it lies on the distribution. In the graph below, the mean is displayed as a dashed, red line.

# Notice we drop the missing values. In this dataset, hourly employees
# have missing values in the Annual Salary column
hist, edges = np.histogram(dfSalary['Annual Salary'].dropna(), bins=44)

p = figure(width=500, height=400,
           x_axis_label='Salary ($)', y_axis_label='Count',
           title='Histogram of City of Chicago Employee Salaries')
p.xaxis[0].formatter = NumeralTickFormatter(format="$0,0")
p.quad(bottom=0, top=hist,
       left=edges[:-1], right=edges[1:],
       fill_color='navy', line_color='white', alpha=0.5)

meanSalaryLine = Span(location = dfSalary['Annual Salary'].mean(), 
                      dimension='height',
                      line_color='red', line_dash='dashed', line_width=2)
p.add_layout(meanSalaryLine)

show(p)

Median#

The mean is probably the most commonly-reported descriptive statistic. However, sometimes using the mean can be problematic. If the distribution of your data is not “symmetric”, or if you have a lot of outliers, the mean can be misleading.

To see this, let’s work with our income data from Compustat. Remember that the IB column contains income before extraordinary items for all U.S. companies in 2018. Let’s plot the histogram of that column. We’ll also plot the mean.

Because the data has such a large spread, we’ll filter the data so it only contains incomes less than \$5 billion and greater than negative \\$1 billion.

ib = dfCompustat[(dfCompustat['IB'] < 5000) & (dfCompustat['IB'] > -1000)]['IB']

# Notice we drop the missing values. In this dataset, hourly employees
# have missing values in the Annual Salary column
hist, edges = np.histogram(ib, bins=75)

p = figure(width=500, height=400,
           x_axis_label='Income before extraordinary items ($ mil.)', y_axis_label='Count',
           title='Histogram of IB')
p.xaxis[0].formatter = NumeralTickFormatter(format="$0,0")
p.quad(bottom=0, top=hist,
       left=edges[:-1], right=edges[1:],
       fill_color='navy', line_color='white', alpha=0.5)

meanIBLine = Span(location = ib.mean(), 
                  dimension='height',
                  line_color='red', line_dash='dashed', line_width=2)
p.add_layout(meanIBLine)

show(p)

Look at the histogram above. Does the mean line appear to be in the center of the distribution? Not really. Because the distribution has a long “tail” to the right, the mean line is farther to the right than we would like. If we report the mean of this data, it might be misleading to some. The mean might be considered artificially high since a majority of companies earn less than this amount.

In situations like this, where the distribution has a long tail, people often report the median. The median is literally the middle number. To compute the median, you take all the numbers, sort them, and then choose the middle number (we’re omitting a few details, but that’s the basic idea).

The median statistic is not susceptible to outliers. Importantly, 50% of observations are less than the median, and 50% are greater than the median. It truly is the middle of the data.

Computing the Median Using Pandas#

To compute the median of a Series in Pandas, simply use the median method.

ib.median()

np.float64(21.896)

Let’s compare the median to the mean for this data:

ib.mean()

np.float64(203.05486953184956)

Wow, look at that! The mean is ten times greater than the median for this data!

Let’s plot the median on the same graph as the mean to illustrate the difference. Run the code cell below. The mean is the red dashed line, and the median is the black dashed line.

ib = dfCompustat[(dfCompustat['IB'] < 5000) & (dfCompustat['IB'] > -1000)]['IB']

# Notice we drop the missing values. In this dataset, hourly employees
# have missing values in the Annual Salary column
hist, edges = np.histogram(ib, bins=75)

p = figure(width=500, height=400,
           x_axis_label='Income before extraordinary items ($ mil.)', y_axis_label='Count',
           title='Histogram of IB')
p.xaxis[0].formatter = NumeralTickFormatter(format="$0,0")
p.quad(bottom=0, top=hist,
       left=edges[:-1], right=edges[1:],
       fill_color='navy', line_color='white', alpha=0.5)

meanIBLine = Span(location = ib.mean(), 
                  dimension='height',
                  line_color='red', line_dash='dashed', line_width=2)
p.add_layout(meanIBLine)

medianIBLine = Span(location = ib.median(), 
                    dimension='height',
                    line_color='black', line_dash='dashed', line_width=2)
p.add_layout(medianIBLine)

show(p)

Look at that graph! The median is much closer to what we would think of as the center of the distribution.

Standard Deviation#

When you look at a histogram, you probably notice that the data is spread out. It is often useful to quantify that spread.

Statisticians have developed the variance as a measure of spread in the data. Zero variance means that every number is the same. The greater the variance, the greater the spread in the data. Variance is great, but has one drawback. The units of variance are not easy to work with. For example, if we compute the variance of the City of Chicago salary data, the variance would be measured in dollars squared. That’s annoying. To make it easier to work with, many people take the square root of the variance; this is called the standard deviation.

Computing the Standard Deviation Mathematically#

To compute the standard deviation of numbers $x_1, \ldots, x_n$, use the following formula:

\[ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2} \]

In other words, take the difference between each observation and the mean and square that difference. Sum up this squared difference across all observations, divide by $n$, and take the square root.

Computing the Standard Deviation Using Pandas#

Once again, Pandas makes this easy:

dfSalary['Annual Salary'].std()

np.float64(20827.714107106418)

Let’s quickly interpret this number. The standard deviation of the annual salaries is about \$21,000. We don’t learn too much from the number itself, but we can check the standard deviation over time, or compare to other cities. For example, NYC employees might make more on average, but have a much greater spread in their salaries. That would tell us that there are many NYC employees earning a lot, but also more NYC employees earning less.

Correlation#

Correlation is a measure of how variables move together. Consider the Compustat data. We would expect that income and cash flow are positively correlated. As a company’s income rises, we would expect its cash flows to rise (if that’s not the case, then I’d be worried about accounting fraud!). Similarly, we would expect accruals and cash flows to be negatively correlated. When a company has a big cash flow, it’s more likely it will have a big accrual. Another interpretation is that accruals predict cash flows. If a company has large accruals today, we should expect the cash to arrive in the future.

The correlation between two variables lies between -1.0 and +1.0. A correlation of +1.0 means that two variables move perfectly together. As one variable increases, the other increases by a fixed amount. A correlation of -1.0 also means that two variables move perfectly together. As one variable increases, the other decreases by a fixed amount. In practice, you will only see perfect correlations when one variable is a linear function of another.

A correlation near 0.0 means that there’s no linear relationship between two variables. You cannot predict the movement in one variable from the other.

Pandas makes it very easy to compute correlations for columns in a DataFrame. Let’s perform the computation and then interpret the results.

dfCompustat[['SALE','IB','ACC','OANCF']].corr()

	SALE	IB	ACC	OANCF
SALE	1.000000	0.681791	-0.658824	0.793711
IB	0.681791	1.000000	-0.431227	0.876442
ACC	-0.658824	-0.431227	1.000000	-0.812050
OANCF	0.793711	0.876442	-0.812050	1.000000

First, look at the table above (tables like these are called a correlation matrix; each number in the table is the correlation between two variables). You will see that the diagonal terms are 1.000000. That will always be the case because, by definition, a variable is perfectly correlated with itself. Also notice that the table is symmetric. The correlation between IB and SALE (second row, first column) must exactly equal the correlation between SALE and IB (first row, second column).

Now let’s look at the correlation. Revenue (SALE) and income (IB) are positively correlated (0.68). That is unsurprising. As revenue increases, we would expect income to increase. Similarly, income (IB) and operating cash flows (OANCF) are highly correlated (0.876). Again, this is unsurprising.

Notice that accruals (ACC) are negatively correlated with the other variables. Accruals are computed as IB minus OANCF. It is therefore unsurprising that accruals are negatively correlated with cash flows. It is somewhat less intuitive that accruals are negatively correlated with income.