Understanding model capacity trade-offs
Let's train trees with different depths: starting from 1 split, and to maximal 23 splits:
In []: train_losses = [] test_losses = [] for depth in xrange(1, 23): tree_model.max_depth = depth tree_model = tree_model.fit(X_train, y_train) train_losses.append(1 - tree_model.score(X_train, y_train)) test_losses.append(1 - tree_model.score(X_test, y_test)) figure = plt.figure() plt.plot(train_losses, label="training loss", linestyle='--') plt.plot(test_losses, label="test loss") plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=2, mode="expand", borderaxespad=0.) Out[]:
On the x axis, we've plotted the tree depth, and on the y axis, we've plotted the model's error. An interesting phenomenon that we're observing here is well familiar to any machine learning practitioner: as the model gets more complex, it gets more prone to overfitting. At first, as the model's capacity grows, both training and test loss (error) decreases, but then something strange happens: while error on the training set continues to go down, test error starts growing. This means that the model fits itself to the training examples so well, that it is not able to generalize well on unseen data anymore. That's why it's so important to have a held-out dataset, and perform your model validation on it. From the above plot, we can see that our more-or-less random choice of max_depth=4 was lucky: test error at this point became even less than training error.