Data Transform using pandas library in Python

“Over half of the time, analysts are trying to import/cleaning the data.”

— By numerous John/Jane Does of data analysts

Data these days can be flown in from various sources: web, database, local files, user input, etc. Analysts now often have to work with various format of data input, in order to make them compatible with each other for analysis. Though sometimes considered to be a data engineer’s work, data preparation is still an essential skills for all data analysts, especially those who work in small to medium size firms (as I am doing now).

I am going to introduce data reading/manipulation with pandas library in Python 3. I have recently worked extensively with pandas in Python 3 and started realized the powerful component in the library. In this post, I will the one I used most frequently, groupby() with pandas.

Here I will use the Boston HMDA Data Set as an illustration. The data set is one of the pre-defined data set in R. You can find more data here.

import pandas as pd
df = pd.read_csv("Hdma.csv")
print(df)

The result will be something similar to this below: (not all columns are shown in this table)

Unnamed: 0 dir hir lvr ccs mcs pbcr dmi \
0 1 0.221000 0.221000 0.800000 5.000000 2.000000 no no
1 2 0.265000 0.265000 0.921875 2.000000 2.000000 no no
2 3 0.372000 0.248000 0.920398 1.000000 2.000000 no no
3 4 0.320000 0.250000 0.860465 1.000000 2.000000 no no
4 5 0.360000 0.350000 0.600000 1.000000 1.000000 no no
5 6 0.240000 0.170000 0.510526 1.000000 1.000000 no no
6 7 0.350000 0.290000 0.746667 1.000000 2.000000 no no
7 8 0.280000 0.220000 0.850000 2.000000 2.000000 no no
8 9 0.310000 0.240000 0.897260 2.000000 2.000000 no yes
9 10 0.180000 0.170000 0.358744 2.000000 1.000000 no no
10 11 0.280000 0.270000 0.223881 1.000000 2.000000 no no
11 12 0.310000 0.300000 0.688889 1.000000 2.000000 no no
12 13 0.280000 0.270000 0.537634 1.000000 2.000000 no yes

Note that the default data format of pandas library is DataFrame. There will be times that the data type will be something else. But usually when you read a 2-D csv into the program, DataFrame is the dtype.

type(df)
pandas.core.frame.DataFrame

Data Frame is one of the frequently used data type by data analysts, due to the ton of functions you could apply to data frame. Here I want to demonstrate the groupby() function.

Group by is useful when you want to mark the difference/similarities between 2 distinctive groups. Usually data analysts tend to group by the categorical variables.
For example, in this data set, the categorical variables pbcr and dmi are ones we can easily group by the data, Yes and No. We can certainly group numerical data, by creating meaningful buckets to group certain values. Most commonly seen one is age group, marketing agencies usually group them by 18-24, 25-34, 35-45, 45-65, etc. When creating such buckets or groups, it is always useful to look at the distribution in the data, to understand the ratio of each group.

groupby_pcbr = df.groupby('pcbr')
print("Type of groupby object is",type(groupby_pcbr)


Type of groupby object is class 'pandas.core.groupby.DataFrameGroupBy'

This will create a pandas class object. There are a lot of pandas functions you can apply using groupby(). You can check the Pandas documentation here. Some examples are written below:

# calculate grouped mean
mean_table = groupby_pbcr.mean()
print(mean_table)


Unnamed: 0 dir hir lvr ccs mcs uria \
pbcr
no 1176.395918 0.327723 0.253809 0.732696 1.981406 1.712472 3.768118
yes 1368.211429 0.369754 0.274708 0.801777 3.817143 1.828571 3.854857

comdominiom
pbcr
no 0.284354
yes 0.337143

You can even apply linear models to different groups:

from scipy.stats import linregress
results = (groupby_pbcr.apply(lambda x: pd.Series(linregress(x['dir'], x['uria'])).
                         rename(columns = {
                                 0: 'slope',
                                 1: 'intercept',
                                 2: 'rvalue',
                                 3: 'pvalue',
                                 4: 'stderr'})))
print(results)


0 1 2 3 4
pbcr
no 1.174097 3.383339 0.059923 0.004882 0.416701
yes -1.101972 4.262315 -0.075264 0.322217 1.110016

For those who are used to numpy package element-wise operations. The similar approach still exist in pandas library, only with different format. Instead of direct operate on elements/values for each data point, pandas offer methods like apply, map, to operate on data points in different data type.

For example, if we want to calculate a scaled score of dir variable. The first thing we need to ensure is the data type this variable is stored in. Since we are still using the df data frame, the data type remain as Data Frame. A scaled score can be calculated by (point - mu)/std, where mu is the mean, and std refers to the standard deviation of a series of data.

std = df.std(axis=0, skipna=True)
# axis=0 suggests we calculate standard deviation by rows,
# meaning each column will have one standard deviation
# skipna suggests we are going to exclude missing values

print("Type of std object", type(std))
print(std)


Type of std object class 'pandas.core.series.Series'
Unnamed: 0 687.479818
dir 0.107235
hir 0.096635
lvr 0.178715
ccs 1.666370
mcs 0.537169
uria 2.026636
comdominiom 0.452980
dtype: float64

We can think of the pd series data type as list for now and accessing the value we want by indexing. And we can create mean in the same way.

dir_std = std[1]

# calculate mean
mean = df.mean()
dir_mean = mean[1]
print("Mean value is", dir_mean, "\nStandard deviation is", dir_std)

Mean value is 0.330813567212
Standard deviation is 0.107234749192

And now we have both mean value and standard deviation, we can calculate the scaled value of dir variable.

# using custom function
def scale(x, mean, std):
    scale_x = (x-mean)/std
    return scale_x
# apply function using apply to the column/series you want to
# perform the calculation, additional arguments can be passed
# in the args=() argument
scaled_val_dir = df['dir'].apply(scale, args=(dir_mean,dir_std,))

# using lambda
# lambda suggest applying certain calculation method to x, and after
# x: specifying the calculation/method you want to apply on x
# in this case, x is the df['dir']
scaled_val_dir2 = df['dir'].apply(lambda x: (x-dir_mean)/dir_std)

As you can see, groupby() and other apply methods in pandas are powerful within pandas library, and are compatible in multiple forms. I adopted python as data exploring tool recently and found it being powerful and useful. I hope you enjoy the explanation. Not the most tech savvy one, but should be enough to cover the basics. 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s