This post is a revised version of presentation for Hack Santa Monica Meetup on November 17th, 2016. To learn more about Hack Santa Monica Meetup, please look here.
Hack Santa Monica Meetup was initiated by non-profit group SixThirty Group (formerly known as Team SixThirty). For more information of SixThirty Group, please check here.
Often in Business Intelligence field, people would say the first thing to do is to visualize data. However, one might be confused, among various visualization methods, what are the one visualization method that will reveal insights? And more exactly, what are called insights?
Insights, as it literally suggests, are things that are embedded in data and waiting for you to discover. That is somewhat a romantic way to put it, and the process of discovering one takes lots of patience and cautious. Different people might be looking at different things. Accountants may look at ledgers and balance sheets, whereas economists may look at annual labor data or stock market data.
Here I use data from Santa Monica Open Data Portal, which is a data set describing Santa Monica’s fire report records.
- Look for pattern
- Look for anomaly
- Knowing the context
Since I moved to Code-mania (CA) to start my graduate education, I was shocked and also amazed by how many people can code and program in CA. For a person who came from a business background and is not tech savvy, this transfer can be hard at the starting point. Though I have experienced some formal education from university, most of the basic programming skills were self-taught through online tutorials. So here I want to introduce some useful concepts and resource for those beginners.
I have learnt Python mainly by myself through online tutorials. Sometimes learning from online tutorials always give you the same stuff: data type, indexing, and using libraries. Most of the time when you need to write your own functions or methods, understanding some useful libraries will get you some help but you still need to understand the big picture. Per my experience, I think programming language, in general, have 3 major components you need to understand: Value assignment, Logic, and Loops(or sometime called iterations).
“Over half of the time, analysts are trying to import/cleaning the data.”
— By numerous John/Jane Does of data analysts
Data these days can be flown in from various sources: web, database, local files, user input, etc. Analysts now often have to work with various format of data input, in order to make them compatible with each other for analysis. Though sometimes considered to be a data engineer’s work, data preparation is still an essential skills for all data analysts, especially those who work in small to medium size firms (as I am doing now).
I am going to introduce data reading/manipulation with pandas library in Python 3. I have recently worked extensively with pandas in Python 3 and started realized the powerful component in the library. In this post, I will the one I used most frequently, groupby() with pandas.
As some frequent R users may know, there are many datasets in R (the famous MASS package and mtcars, etc.) ready for you to play your analysis skills with. These data sets are all “clean and neat” in some ways. Now I am going to reveal some real-world sh*t (PG-13).
In the real world, no matter if it is for business or scientific research, data is entered in somewhat manual way. For example, sales record is recorded by each every sales representatives. Some may argue that the system may record some information automatically after appropriately setting, but we all seem to agree the notion that nothing is with 100% reliability; otherwise we are more than likely to see that on the Nobel Prize. There are several things that routinely, will be considered when people are handling some raw data sets. As one speaker noted in his speech for USC students said, “80% of the time is spent on data cleaning and preparation work.”
“The real world is, 80% of the time is spent on data cleaning and preparation work. ”
No matter what kind of data work you are doing, at the end of every task, people, your audiences will always want to see some graphs. Even people with most sophisticated data/programming skills want to see some simple illustration that immediately deliver the message. In this case, data visualization is very important.
I am using R as my primary language when analyzing data. R is a very powerful language and for most people without programming experience, it is easier than most other language. And this conclusion is based on learning Python and C++ by myself.
R has a very powerful package called ggplot2. GG stands for Graphic Grammar, an idea developed by Leland Wilkinson. There is a book that he wrote about this package and the idea behind it. If you are interested in learning more complex ideas about this graphic grammar, you can find this book online.
Here I will just do a simple illustration of what a common ggplot code looks like and how each part works.
Esri is a company specialized in Geographic Information System tools. ArcGIS, ArcMap are two of the most commonly used and powerful tools among all the GIS tools provided. As more and more data in incorporating geo-coded information to provide details of certain events, GIS has become more and more popular among data analysts and business intelligent.
ArcGIS and ArcMap are available for download on arcgis.com. The price for these tools are very expensive. If you belong to an educational institution, you may want to check with the related departments to see if such tools are provided free of charge within your institution. For businesses, contacting Esri getting a contract might be a better idea based on the scale of business. Here I am using ArcMap under free trial. You can sign up to use the software free of charge for a certain period of time.