As some frequent R users may know, there are many datasets in R (the famous MASS package and mtcars, etc.) ready for you to play your analysis skills with. These data sets are all “clean and neat” in some ways. Now I am going to reveal some real-world sh*t (PG-13).
In the real world, no matter if it is for business or scientific research, data is entered in somewhat manual way. For example, sales record is recorded by each every sales representatives. Some may argue that the system may record some information automatically after appropriately setting, but we all seem to agree the notion that nothing is with 100% reliability; otherwise we are more than likely to see that on the Nobel Prize. There are several things that routinely, will be considered when people are handling some raw data sets. As one speaker noted in his speech for USC students said, “80% of the time is spent on data cleaning and preparation work.”
“The real world is, 80% of the time is spent on data cleaning and preparation work. ”