Sunday, 21 October 2012

Learn scikit and never use R again

Hallelujah, my brothers and sisters. Free yourselves from the <- and the $, the "how do I select a row again? I only did this last week!", and the arcane differences between a table and a data.frame. Join with me in using scikit-learn, and never use R again.

Could it be true? No need to ever use R again? Well, that's how it looks to me. Scikit-learn is a Python module for machine learning which seems to replicate almost all of the multivariate analysis modules I used to use in R. Thanks to Nikolas Fechner at the RDKit UGM for tuning me into this.

Let's see it in action for a simple example that uses SVM to classify irises (not the eyeball type). First, the R:
library(e1071)
library(MASS)
data(iris)

mysvm <- svm(Species ~ ., iris)
mysvm.pred <- predict(mysvm, iris)
table(mysvm.pred,iris$Species)
# mysvm.pred   setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          2
#   virginica   0      2         48

And now the Python:
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()

mysvm = svm.SVC().fit(iris.data, iris.target)
mysvm_pred = mysvm.predict(iris.data)
print confusion_matrix(mysvm_pred, iris.target)
# [[50  0  0]
#  [ 0 48  2]
#  [ 0  0 50]]
This library is quite new, but there seems to be quite a bit of momentum in the data processing space right now in Python. See also Statsmodels and Pandas. These videos from PyData 2012 give an overview of some of these projects.

2 comments:

Pat Walters said...

Great post Noel, but I don't think I'm ready to walk away from R just yet ... Although I'm a huge fan of Python and have been using scikit-learn and pandas, there are a few reasons to maintain my allegiance to R

R is a fantastic tool for interactive data analysis. There are so many great tools for slicing and dicing data, and packages like plyr give R capabilities that I can't find anywhere else. It's possible that all of this can be done with pandas, and I just need to get better at it.

There is incredible breadth in what's available in R. Over the last few years, R has become the standard for academic statistics and machine learning. When I go looking for an implementation of a new method, I can usually find an R package.

R has excellent plotting capabilities, especially with the addition of lattice and ggplot. I haven't found any other package that gives me the power and control over plots that I get with R.

I agree that R can be syntactically strange and that things are not implemented in a consistent fashion. After 10 years of using R, my brain is sufficiently warped that I'm starting to get the hang of it.

Noel O'Boyle said...

I don't disagree with any of this, but I have always found the process of using R frustrating. Unfortunately, R seemed to be so well established that I had despaired of ever being able to dispense with it.