Machine Learning with Swift
上QQ阅读APP看书,第一时间看更新

Loading the dataset

Create and open a new IPython notebook. In the chapter's supplementary materials, you can see the file extraterrestrials.csv. Copy it to the same folder where you created your notebook. In the first cell of your notebook, execute the magical command:

In []: 
%matplotlib inline 

This is needed to see inline plots right in the notebook in the future.

The library we are using for datasets loading and manipulation is pandas. Let's import it, and load the .csv file:

In []: 
import pandas as pd 
df = pd.read_csv('extraterrestrials.csv', sep='t', encoding='utf-8', index_col=0) 

Object df is a data frame. This is a table-like data structured for efficient manipulations over the different data types. To see what's inside, execute:

In []: 
df.head() 
Out[]: 

This prints the first five rows of the table. The first three columns (length, color, and fluffy) are features, and the last one is the class label.

How many samples do we have in total? Run this code to find out:

In []: 
len(df) 
Out[]: 
1000 

Looks like the most samples in the beginning are rabbosauruses. Let's fetch five samples at random to see if it holds true in other parts of the dataset:

In []: 
df.sample(5) 
Out[]: 

Well, this isn't helpful, as it would be too tedious to analyze the table content in this way. We need some more advanced tools to perform descriptive statistics computations and data visualization.