Overview of Working with Data#
Overview#
In this unit, we will teach you the basics of working with data. As in previous units, we want you to focus on learning processes, not code. Understanding the processes will make you extremely valuable to your organization, and processes are easily transferable to new programming languages and packages.
This unit is designed around two tasks, loading and cleaning data. These tasks are likely to occupy a majority of your time when you work with data. Don’t believe us? Read this article that surveys data scientists; they find that 60% of data scientists’ time is spent on these two tasks! In units 1 and 2 of ACCY 570, we insulated you from these tasks. For example, we cleaned all of the datasets that you used for visualization before giving them to you. That allowed you to focus on the plots. In this unit, you will see how to transform data from its raw form into the proper form for your analyses.
We will use the Pandas package in Python to work with data for the remainder of the course. Pandas is a very popular library that is used by data analysts worldwide. It is extremely powerful and can probably handle any task that you throw at it. Importantly, to use Pandas, you must first think through the transformations you want to perform with your data before you write any code. If you try to “hunt and peck” with data like you do in Excel, you will likely get errors or garbage results.
This unit is the heart of ACCY 570. By the end of this unit, you should be able to take a dataset saved as an Excel or CSV file, load it into Pandas, remove unwanted rows and columns, clean the individual data values, create new columns, and save the data to a file.
Loading Data#
Loading data is the process of taking data saved somewhere and importing it into a software package for analysis. For example, in unit 2, you had to load data into Tableau before you could create graphics. In this course, you will work mostly with Excel files and CSV (comma-separated value) files. Small datasets are often stored in these file formats and you are likely to encounter them frequently in your careers. Another common way to store data is in a database. You will learn about databases in ACCY 575.
When you load data, your software package must read the data from your input file and convert it to an internal representation that it understands. We’ll give you two examples of this.
Example 1: Loading a CSV file. A CSV file is a text file. If you have the number 103.86 in a CSV file, it’s stored as text. When Excel or Pandas reads it in, it initially reads the string value
'103.86'
and must decide whether to convert it to a number.If Pandas finds that all string values in a column can be converted to numbers, it will automatically convert the entire column to numbers. However, if some values in a column cannot be converted to numbers, Pandas will save the entire column as strings.
If Excel can convert a string to a number, it will. It does not matter what other values are in the same column. Sounds convenient, right? Not always. Imagine column with 500,000 values, some of which are strings and some of which are numbers. You might not realize there are some strings in there. You have to identify the strings and then decide how to convert them.
Example 2: Dates. Excel and Pandas store dates in different ways. Internally, Excel stores a date as the number of days since January 1, 1900. By contrast, Pandas stores dates as the number of seconds since January 1, 1970. Therefore, when Pandas reads an Excel file, it must convert Excel date values into Pandas date values.
In many cases, software will automate the loading process. However, sometimes you need to tell your software exactly what you want done. In this unit, we will show you common behaviors that arise when loading data and teach you how to handle them.
Cleaning Data#
As we say above, data scientists spend upwards of 60% cleaning data (see this article for an overview of dirty data). This is because data often contains misspellings, duplicate rows, and invalid characters. Often, you will encounter data values that violate your company’s business rules. Legacy data is particularly problematic. As a data analyst, you need to know how to work with dirty data.
Sometimes you can convert the data into a valid format. For example, if you see the string '($400.00)'
, you can convert it to the value -400. However, what if you see the string '-($400.00)'
? What should you do? Does that mean negative four hundred dollars? Or is it a double-negative? Should you discard it because it is corrupt? The answer is that it depends on the circumstances and the history of that data. We cannot give you hard-and-fast rules. We can show you how to identify such data and how to convert it to a valid format.
In this unit, we will show you how to perform basic data transformations in Pandas. You will use these transformations to clean data, and to create new data values for use in visualizations and statistical analyses. The ideas behind these transformations are universal: you should be able to apply them in Excel, Tableau, R, or the like.
Topics in Unit 3#
In this unit, we will cover these topics in approximately this order:
Working with files and folders on your computer.
CSV files
Overview of Pandas
Viewing a Pandas DataFrame
Anatomy of a Pandas data frame
Loading data into Pandas
From Excel
From CSV
Cleaning data (strings to numbers, dates)
Selecting and filtering data
Working with missing data
Creating new columns
Sorting data
Modifying data