Exploring the Iris dataset with scikit-learn and ipython
Posted on Tue 02 July 2013 in misc
Today I'll share some of the goodies I've found while exploring scikit-learn's tutorial and practice embedding code snippets in posts. I'll try to avoid just rehashing the tutorial. To get started, scikit-learn (and numpy, which it depends on) should be installed. I also assume we're using ipython, since it makes exploring objects more fun. Any '?' or '%' prefixes to the following code snippets are functions of IPython.
from sklearn.datasets import load_iris
iris = load_iris()
The iris dataset is commonly used among the Machine-Learning/Data-Mining communities. Enough is known about the properties of the data that practitioners are confident using it as a test case for new algorithms and the like. I like that it comes built in! To start, let's pretend we don't know anything about the iris data.
iris? # Not helpful--whhat attributes are available?
iris. # .DESCR, .data, .target, feature_names ...
print(iris.DESCR) # Data summary, features and classes...
iris.data? # Ah, a numpy ndarray
iris.data.shape # (data_points, features) == (150, 4)
iris.target # 1-D array of 0's, 1's, and 2's
iris.target.size # There are 150; these must be the classes
iris.feature_names # Here is the enumerated type to the 0/1/2's
numpy's arrays are really nice for indexing and slicing.
row, col = 5, 10
iris.data[row] # Row 5
iris.data[:,col] # Column 10
iris.data[5,10] # Element at 5,10
iris.data[[1,2,3]] # Return array of rows 1, 2, and 3
iris.data.argmax() # Largest element in flattened array
iris.data.flatten()[iris.data.argmax()] # 7.9
That was just exploring the data with IPython. If we wanted to completely skip that step, we can jump right in and start learning some rules.
X,y = iris.data, iris.target # Training set and classes
from sklearn.svm import LinearSVC # Support Vector Classifer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# Instantiate each learner
l1 = LinearSVC()
l2 = LogisticRegression()
l3 = KNeighborsClassifier
new_data = [[ 5.0, 3.6, 1.3, 0.25]] # Predict its class
for learner in [l1,l2,l3]:
learner.fit(X,y)
print(learner.predict(new_data)) # Each says class-0
I like the standard interface to train and use each classifier. In addition, the LogisticRegression learner also assigns a probability to each possible class.
l2.predict_proba(new_data) # A list of three floats
# [.90, ,09, .00]
l2.predict_proba(new_data).sum() # ~ 1.0
Each learner also comes with a "score" method for measuring accuracy relative to provided test and class data. It's straightforward to randomize the data, split it into a 'training' and 'test' set, and compare the learners.
import numpy as np # We want to use its arrays
from numpy.random import RandomState # Random functions
RandomState.seed(42)
order = np.arange(0,150) # Flat array, 0..149
# Shuffle the data, determine the training/testing split
RandomState.shuffle(order) # Randomize order in situ
X = iris.data[order]
y = iris.target[order]
split = 150 * 2 / 3 # 2/3 training, 1/3 testing
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Now train and test
for learner in l1,l2,l3:
learner.fit(X_train, y_train)
print(larner.score(X_test,y_test))
The resulting scores are .92, .88, and .94 respectively. It appears Nearest Neighbors is the champ today.