Viewing a DataFrame#
Learning Objectives#
- Write the import statements needed to work with Pandas, in their most common form.
- View an entire DataFrame in a Jupyter notebook.
- Use the head and tail methods to view the first or last rows of a DataFrame.
- Use the sample method to retrieve random rows from a DataFrame.
- View a DataFrame in Variable Inspector.
Overview#
Before we teach you how to load data, we’re going to show you how to view data that has been loaded into a Pandas DataFrame. After you load your data, the first thing you will do is view it. Therefore, it makes sense to know how to view the data before learning how to load it.
In this chapter and the next (chapter 3-6), we’ll show you different ways to view data stored in Pandas DataFrames. We’ll also teach you the “anatomy” of a DataFrame. That way, after you learn how to load data, you can immediately start working with your data.
Required Import Statements#
Before you can work with Pandas, you need to import the Pandas library. The most common way to write the import statement is:
import pandas as pd
Most people worldwide use the alias pd
for the Pandas library. We recommend you do the same. That way, when you search online for help with Pandas, you won’t be surprised to see pd.
all over the place, and you can easily copy and paste the online code into your notebook.
The Pandas library is built on top of a numerical analysis library called NumPy. When working with DataFrames, you will find often yourself calling functions from NumPy. Therefore, most people include the following two import statements at the top of all of their notebooks:
import numpy as np
import pandas as pd
Let’s load these libraries now. Run the cell below.
import numpy as np
import pandas as pd
Loading a Sample DataFrame#
Run the code below to load some sample data. For now, don’t worry about how the code works. However, in the folder where this notebook is saved, there should be a subfolder data
. Within that subfolder, there should be a file Admission_Predict.csv. This file contains data on graduate admissions that we got from Kaggle (see here).
df = pd.read_csv('data/Admission_Predict.csv')
type(df)
pandas.core.frame.DataFrame
Viewing the Entire DataFrame#
The code above loaded the data from a CSV and stored it in a Pandas DataFrame. The output of the type
statement tells that the variable df
is of type pandas.core.frame.DataFrame
.
How do we view this data? Well, the simplest way to see the value of any variable is to type the variable’s name into an empty cell and run the cell. If we do that, here’s what we see:
df
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 | 0.92 |
1 | 2 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 | 0.76 |
2 | 3 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 | 0.72 |
3 | 4 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 | 0.80 |
4 | 5 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 | 0.65 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | 396 | 324 | 110 | 3 | 3.5 | 3.5 | 9.04 | 1 | 0.82 |
396 | 397 | 325 | 107 | 3 | 3.0 | 3.5 | 9.11 | 1 | 0.84 |
397 | 398 | 330 | 116 | 4 | 5.0 | 4.5 | 9.45 | 1 | 0.91 |
398 | 399 | 312 | 103 | 3 | 3.5 | 4.0 | 8.78 | 0 | 0.67 |
399 | 400 | 333 | 117 | 4 | 5.0 | 4.0 | 9.66 | 1 | 0.95 |
400 rows × 9 columns
Let’s examine the output of the last code cell. You can see something that looks like an Excel worksheet. The column names are bolded, and at the bottom we are told that there are 400 rows and 9 columns. However, notice that Jupyter didn’t show us all the rows. By default, it only shows the first and last rows. It uses “…” to indicate that some rows weren’t displayed. This is the default behavior because many DataFrames contain millions of rows and displaying those would be time-consuming and somewhat useless to a data analyst.
It is possible to change the maximum number of displayed rows by change one of Pandas’ options (e.g. see here), but we don’t recommend that. There are better ways to view your data.
Viewing the First or Last Rows of a DataFrame#
You can usually get a sense for your data by looking at only a few rows. For example, if your data set didn’t load correctly, you will see that in the first few rows. If your CSV file was dirty, you will see that in the first few rows.
The head
Method of DataFrame#
To see the first 5 rows of your dataset, use the head
method of DataFrame
. We show you this in the code cell below.
df.head()
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 | 0.92 |
1 | 2 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 | 0.76 |
2 | 3 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 | 0.72 |
3 | 4 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 | 0.80 |
4 | 5 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 | 0.65 |
The head
method has one optional argument, n
. By default, n
is 5 and the head method will show the first 5 rows. If you want a different number, just pass a single, integer argument to head
. See the code cells below for examples:
df.head(2)
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 | 0.92 |
1 | 2 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 | 0.76 |
df.head(n=6)
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 | 0.92 |
1 | 2 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 | 0.76 |
2 | 3 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 | 0.72 |
3 | 4 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 | 0.80 |
4 | 5 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 | 0.65 |
5 | 6 | 330 | 115 | 5 | 4.5 | 3.0 | 9.34 | 1 | 0.90 |
The tail
Method of DataFrame#
The tail
behaves identically to head, but shows the last rows of your data.
Viewing a Random Sample of the Rows of a DataFrame#
Sometimes you want to see randomly chosen rows from your DataFrame. Pandas makes this very easy. Simply call the sample
method.
df.sample()
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
376 | 377 | 297 | 96 | 2 | 2.5 | 2.0 | 7.43 | 0 | 0.34 |
By default, sample
returns one randomly-chosen row from the data. If you want more, pass the optional argument n
.
df.sample(5)
Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit | |
---|---|---|---|---|---|---|---|---|---|
181 | 182 | 305 | 107 | 2 | 2.5 | 2.5 | 8.42 | 0 | 0.71 |
2 | 3 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 | 0.72 |
231 | 232 | 319 | 106 | 3 | 3.5 | 2.5 | 8.33 | 1 | 0.74 |
297 | 298 | 320 | 120 | 3 | 4.0 | 4.5 | 9.11 | 0 | 0.86 |
197 | 198 | 310 | 106 | 2 | 3.5 | 2.5 | 8.33 | 0 | 0.73 |
Notice the row numbers in the left-hand column.
The sample
function has many powerful features. For example, you can return a subset of the columns. You can sample with and without replacement. We will not describe those features at this time. If you’re interested, type the following into an empty cell and evaluate it. Or click here.:
?pd.core.frame.DataFrame.sample