This post is a revised version of presentation for Hack Santa Monica Meetup on November 17th, 2016. To learn more about Hack Santa Monica Meetup, please look here.
Hack Santa Monica Meetup was initiated by non-profit group SixThirty Group (formerly known as Team SixThirty). For more information of SixThirty Group, please check here.
Often in Business Intelligence field, people would say the first thing to do is to visualize data. However, one might be confused, among various visualization methods, what are the one visualization method that will reveal insights? And more exactly, what are called insights?
Insights, as it literally suggests, are things that are embedded in data and waiting for you to discover. That is somewhat a romantic way to put it, and the process of discovering one takes lots of patience and cautious. Different people might be looking at different things. Accountants may look at ledgers and balance sheets, whereas economists may look at annual labor data or stock market data.
Here I use data from Santa Monica Open Data Portal, which is a data set describing Santa Monica’s fire report records.
- Look for pattern
- Look for anomaly
- Knowing the context
I often compare the process of data analysis to forensic work or detective stories. Three main things to bear in mind when digging into data: Pattern, Anomaly, and Context.
import os import pandas as pd import numpy as np import matplotlib.pyplot as plt ##TODO: change working directory os.chdir("/Users/willacheng/GitHub/blog") # The data contains 2 year of fire call records # for Santa Monica city, CA # from 4/24/2015 to 4/24/2017 data = pd.read_csv(r"Fire_Calls_for_Service.csv", sep = ",") data.describe()
After initial review, we can select some columns and drop certain features.The result shows 16 columns which some of them we might not need. Next step I got rid of some columns. (not all columns are shown)
# data selection for useful columns # Here we removed the census columns as we are not looking into those. # Also we removed the latitude and longitude as map point already got such information data = data.drop(['Census Block 2000 GeoId', 'Census Tract 2000 GeoId', 'Census Block 2010 GeoId', 'Census Tract 2010 GeoId', 'Latitude', 'Longitude'], 1) print(data.head(6))
Incident Number Incident Date Call Type Description \
0 16006582 05/20/2016 Automatic Alarm
1 16006583 05/20/2016 Public Assist
2 16007082 05/30/2016 Structure Fire
3 15014343 10/27/2015 Broken Gas Main
4 15015451 11/18/2015 Emergency Medical Service (EMS)
5 15015942 11/30/2015 Automatic Alarm
# Visualize incidents frequency by different stations group_station = data.groupby("Station") freq_station_table = group_station.count()[] print(freq_station_table) # get station name for x coordinate stations = freq_station_table.index.values x_bar = np.arange(len(stations)) # get values to plot y_val = freq_station_table['Incident Number'] # plot of different frequency by station # plt.bar gives the bar height and position # plt.xticks specifies the range of x coordinate on the graph # plt.ylabel and title label the data on axis and the title of the graph plt.bar(x_bar, y_val, align='center', alpha=0.5) plt.xticks(x_bar, stations) plt.ylabel('Incident frequency') plt.title('Incident Freuquency by Stations') plt.show()
The result is shown as below:
As we can see here, Station 1 clearly shows higher number incidents reported. And though there are 4 stations in total, we could see that Station 4 is missing.
Here comes the important concept I mentioned: Context.
Context suggests you know what you are studying and looking. For a data piece coming from government open data portal of Santa Monica Fire Department. Some contexts we need to know before we dig into the data:
- Where is Santa Monica, CA?
- What is it like to live there?
- The building type distribution and population info in the area
- Weather of the area
The list can be very long depending what you want to know. With the help of SixThirty Group, a lot of the questions can be answered by people from the local government. For example, I got the answer that Station 4 used to exist but was closed for some time. The fire department might be thinking of opening it again. The importance of context is that it could point you to the right direction to look for useful information.
Then I plot the data by days.
# Visualize incidents by time/date # group date by date group_date = data.groupby("Incident Date") freq_date_table = group_date.count()[] print("Number of days in the data is", freq_date_table.shape, ".") # get index of dates date_index = freq_date_table.index.values # plot data plt.figure(figsize=(16, 8), dpi=100) plt.plot_date(x=date_index, y=freq_date_table['Incident Number'], fmt="r-") plt.title("Daily Number of Indicents ") plt.ylabel("Number of Incidents") plt.grid(True) plt.show()
Number of days in the data is 732 .
The x-axis is showing the date and the line is showing how numbers of fire calls received changed from day-to-day.
Looking at this graph, the maximum and minimum point can be easily observed by eye. The maximum occurred in the year of 2016, around February and minimum point occurred at the beginning of 2017.
# look for the maximum and minimum print("The daily maximum number of fire calls received is", int(freq_date_table.max())) print("The maximum of 2016 happens at",freq_date_table['Incident Number'].argmax()) print("The daily minimum number of fire calls received is", int(freq_date_table.min())) print("The maximum of 2017 happens at",freq_date_table['Incident Number'].argmin())
The daily maximum number of fire calls received is 90
The maximum of 2016 happens at 2016-02-14 00:00:00
The daily minimum number of fire calls received is 2
The maximum of 2017 happens at 2017-01-01 00:00:00
So what anomaly brings us is that something must have happened to create these out-of-norm data points. It might be good or it might be bad. For example, we figured here on the Valentine’s Day of 2016, the number of incidents skyrocketed. Then we need to know what exactly happened that day to understand the cause of this number. Was there an event going on? Were people chasing each other and playing fireworks on the beach? Did people keep kissing for so long that they suffered hypoxia? Many things could happen to cause the extreme numbers, learning it could prevent it from happening, that is, learning from the history.
And the last part is, looking for pattern, something occurs repetitively. Patterns can follow certain cycle or story. No pattern at all is a pattern as well.
# Month data extraction month_freq_date = freq_date_table[:120] month_index = month_freq_date.index.values column = ['Num'] month_freq_date.columns = column # get weekend dummy variable month_freq_date['weekend'] = 'no' month_freq_date.loc[(month_freq_date.index.weekday == 5) | (month_freq_date.index.weekday == 6), 'weekend'] = 'yes' plt.figure(figsize=(12, 7), dpi=100) # Plot the 4-month data with weekend highlighted plt.plot_date(x=month_index, y=month_freq_date['Num'], fmt="r-") plt.plot(month_freq_date[month_freq_date.weekend=='yes'].index, month_freq_date[month_freq_date.weekend=='yes'].Num,'ro') plt.title("Daily Number of Indicents ") plt.ylabel("Number of Incidents") plt.grid(True) plt.show()
As here we can see that with weekends and weekdays distinguished from each other, we still could not obtain an idea of the pattern for this trend. Therefore, we conclude that the trend/pattern of this data is random.
This might sound very disappointing at first, but notice that we came to this conclusion without any other data sources. We did not consider other changes like population or weather, etc. In reality, most of the data at first glance will look like that they are random, but if you look them in combine with other data sources, they will start to make sense.
This summarizes my short limited introduction on what is called data insights. Data may not always have the answer you want but knowing what you want, look close to anomalies and patterns, it will always bring you fresh thoughts.