Machine Learning with Swift
上QQ阅读APP看书,第一时间看更新

Exploratory data analysis

First, we want to see how many individuals of each class we have. This is important, because if the class distribution is very imbalanced (like 1 to 100, for example), we will have problems training our classification models. You can get data frame columns via the dot notation. For example, df.label will return you the label column as a new data frame. The data frame class has all kinds of useful methods for calculating the summary statistics. The value_counts() method returns the counts of each element type in the data frame:

In []: 
df.label.value_counts() 
Out[]: 
platyhog       520 
rabbosaurus    480 
Name: label, dtype: int64 

The class distribution looks okay for our purposes. Now let's explore the features.

We need to group our data by classes, and calculate feature statistics separately to see the difference between the creature classes. This can be done using the groupby() method. It takes the label of the column by which you want to group your data:

In []: 
grouped = df.groupby('label') 

The grouped data frame has all the same methods and column labels as the original data frame. Let's see the descriptive statistics of a length feature:

In []: 
grouped.length.describe() 
Out[]: 

What can we learn from this table? Platyhogs have a length with the mean of about 20 meters, and standard deviation of about 5. Rabbosauruses on average are 30 meters long, with a standard deviation of 5. The smallest platyhog is about 4 meters long, and the largest rabbosaurus is about 48 meters long. That's a lot, but less than the biggest Earth life forms (see Amphicoelias fragillimus, for example).

Color distribution can be viewed using the familiar value_counts() method:

In []: 
grouped.color.value_counts() 
Out[]: 
label        color            
platyhog     light black         195 
             purple polka-dot    174 
             pink gold           151 
rabbosaurus  light black         168 
             pink gold           156 
             space gray          156 
Name: color, dtype: int64 

We can represent this in a more appealing form, using unstack() and plot() methods:

In []: 
plot = grouped.color.value_counts().unstack().plot(kind='barh', stacked=True, figsize=[16,6], colormap='autumn') 
Out[]: 
Figure 2.2: Color distribution

Looks like purple polka dot is a strong predictor of a platyhog class. But if we see a space-gray individual, we can be sure we should run quickly.

In a similar manner, fluffiness distribution can be visualized using:

In []: 
plot = grouped.fluffy.value_counts().unstack().plot(kind='barh', stacked=True, figsize=[16,6], colormap='winter') 
Out[]: 
Figure 2.3: Fluffiness distribution

Rabbosauruses go in three colors: light black, pink gold, and space gray. 90% of them are fluffy (the remaining 10% are probably old and bald). Platyhogs, on the other hand, can be light black, pink gold, or purple polka dot. 30% of them are fluffy (mutants, maybe?).

For more complex data visualization, we need the matplotlib plotting library:

In []: 
import matplotlib.pyplot as plt 

Drawing the histogram of length distribution:

In []: 
plt.figure() 
plt.hist(df[df.label == 'rabbosaurus'].length, bins=15, normed=True) 
plt.hist(df[df.label == 'platyhog'].length, bins=15, normed=True) 
plt.title("Length Distribution Histogram") 
plt.xlabel("Length") 
plt.ylabel("Frequency") 
fig = plt.gcf() 
plt.show() 
Out[]: 
Figure 2.4: Length distribution

In general, one can say that the platyhogs are smaller, but there is significant range of overlap approximately between 20 and 30 meters, where the length alone is not enough to discriminate between two classes.