As some frequent R users may know, there are many datasets in R (the famous MASS package and mtcars, etc.) ready for you to play your analysis skills with. These data sets are all “clean and neat” in some ways. Now I am going to reveal some real-world sh*t (PG-13).
In the real world, no matter if it is for business or scientific research, data is entered in somewhat manual way. For example, sales record is recorded by each every sales representatives. Some may argue that the system may record some information automatically after appropriately setting, but we all seem to agree the notion that nothing is with 100% reliability; otherwise we are more than likely to see that on the Nobel Prize. There are several things that routinely, will be considered when people are handling some raw data sets. As one speaker noted in his speech for USC students said, “80% of the time is spent on data cleaning and preparation work.”
“The real world is, 80% of the time is spent on data cleaning and preparation work. ”
Missing data points might be the first thing one might notice in a new data set. If you are using R, the NAs are the things you want to ignore. But certain analysis are not quite easy on this. For example, if you are comparing two data sets with same number of observations, when you experience missing data in one set, the test may have a difficult time with you.
Another reason missing data is annoying is that when you are missing a large portion of data points. For example, LA city public safety data is provided with crime record in 2015. If you filter the Location 1 field, you will see that a lot of records are missing the geographic location information. For some records, where the detailed street address is recorded, this may be fine for some analysis. However, for those records whose locations are recorded as the crossing of two street, this would make such a difficult time for those who want to visualize them and analyze the data with some spatial tools.
Different missing data has different solutions. Simple ones like one missed question in a survey response, could be simply disregarded. Some more sophisticated data set, like the one with missing geographic information, may need more complex tools to help, like ArcGIS. Nevertheless, having missing data points in the data sets without carefully examine them is never a good idea.
Some modeling could be useful in missing data handling. When you discover a pattern, using the missing variable as response variable (Y), with other explanatory variables (X), you could fill in the missing variable with predicted value. The risk of such operation is that the relationship between the response variable and explanatory variables do not even exist; or the model built is not accurate. Assessing the data carefully is required for this method. When more models are built in further analysis, the model built to fill in the missing data may decrease the power of further analysis. Therefore, this option is not a universal way to deal with missing data points.
Data is recorded in different ways. For those who have experience in relational database, the data type plays very important role when variables are going through different tables. It is the same in data analysis. The idea data set for a data analyst, is the one with every possible answers within, i.e. all possible explanatory variables are included. The possibilities of this situation in real life, I can’t say for sure, but I have never been exposed to such ideal data set yet. This then would require analysts constantly looking for new information, input, to supply the analysis.
But not all data play along with each other. Still the same LA public safety data set, the area names and other records are not recorded in the same format. Some columns are upper case, some other columns only have the first letter uppercase. This reveals great difference and require formatting work if you are about to joining other data sets. Imagine you have a population data that record all population within LA by the area name. If the area name in the new data set are recorded in a different way than the ones in this public safety data set, you will need sometime to think about how to make them two match.
A more or less standard way is to index. For example, you can create a reference list by indexing all LA area names to numbers and number both data sets with same index. When joining two data sets, matching the index number instead of string values, not only your accuracy increases, but also computer process much faster lookup in number matching than string matching.
These are just some typical data problems one will face in data analysis. As my projects progress, I will share more information on my blog.