Exploratory data analysis
First, we want to see how many individuals of each class we have. This is important, because if the class distribution is very imbalanced (like 1 to 100, for example), we will have problems training our classification models. You can get data frame columns via the dot notation. For example, df.label will return you the label column as a new data frame. The data frame class has all kinds of useful methods for calculating the summary statistics. The value_counts() method returns the counts of each element type in the data frame:
In []: df.label.value_counts() Out[]: platyhog 520 rabbosaurus 480 Name: label, dtype: int64
The class distribution looks okay for our purposes. Now let's explore the features.
We need to group our data by classes, and calculate feature statistics separately to see the difference between the creature classes. This can be done using the groupby() method. It takes the label of the column by which you want to group your data:
In []: grouped = df.groupby('label')
The grouped data frame has all the same methods and column labels as the original data frame. Let's see the descriptive statistics of a length feature:
In []: grouped.length.describe() Out[]:
What can we learn from this table? Platyhogs have a length with the mean of about 20 meters, and standard deviation of about 5. Rabbosauruses on average are 30 meters long, with a standard deviation of 5. The smallest platyhog is about 4 meters long, and the largest rabbosaurus is about 48 meters long. That's a lot, but less than the biggest Earth life forms (see Amphicoelias fragillimus, for example).
Color distribution can be viewed using the familiar value_counts() method:
In []: grouped.color.value_counts() Out[]: label color platyhog light black 195 purple polka-dot 174 pink gold 151 rabbosaurus light black 168 pink gold 156 space gray 156 Name: color, dtype: int64
We can represent this in a more appealing form, using unstack() and plot() methods:
In []: plot = grouped.color.value_counts().unstack().plot(kind='barh', stacked=True, figsize=[16,6], colormap='autumn') Out[]:
Looks like purple polka dot is a strong predictor of a platyhog class. But if we see a space-gray individual, we can be sure we should run quickly.
In a similar manner, fluffiness distribution can be visualized using:
In []: plot = grouped.fluffy.value_counts().unstack().plot(kind='barh', stacked=True, figsize=[16,6], colormap='winter') Out[]:
Rabbosauruses go in three colors: light black, pink gold, and space gray. 90% of them are fluffy (the remaining 10% are probably old and bald). Platyhogs, on the other hand, can be light black, pink gold, or purple polka dot. 30% of them are fluffy (mutants, maybe?).
For more complex data visualization, we need the matplotlib plotting library:
In []: import matplotlib.pyplot as plt
Drawing the histogram of length distribution:
In []: plt.figure() plt.hist(df[df.label == 'rabbosaurus'].length, bins=15, normed=True) plt.hist(df[df.label == 'platyhog'].length, bins=15, normed=True) plt.title("Length Distribution Histogram") plt.xlabel("Length") plt.ylabel("Frequency") fig = plt.gcf() plt.show() Out[]:
In general, one can say that the platyhogs are smaller, but there is significant range of overlap approximately between 20 and 30 meters, where the length alone is not enough to discriminate between two classes.