Sunday, 21 October 2012

Learn scikit and never use R again

Hallelujah, my brothers and sisters. Free yourselves from the <- and the $, the "how do I select a row again? I only did this last week!", and the arcane differences between a table and a data.frame. Join with me in using scikit-learn, and never use R again.

Could it be true? No need to ever use R again? Well, that's how it looks to me. Scikit-learn is a Python module for machine learning which seems to replicate almost all of the multivariate analysis modules I used to use in R. Thanks to Nikolas Fechner at the RDKit UGM for tuning me into this.

Let's see it in action for a simple example that uses SVM to classify irises (not the eyeball type). First, the R:

mysvm <- svm(Species ~ ., iris)
mysvm.pred <- predict(mysvm, iris)
# mysvm.pred   setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          2
#   virginica   0      2         48

And now the Python:
from sklearn import svm, datasets
from sklearn.metrics import confusion_matrix
iris = datasets.load_iris()

mysvm = svm.SVC().fit(,
mysvm_pred = mysvm.predict(
print confusion_matrix(mysvm_pred,
# [[50  0  0]
#  [ 0 48  2]
#  [ 0  0 50]]
This library is quite new, but there seems to be quite a bit of momentum in the data processing space right now in Python. See also Statsmodels and Pandas. These videos from PyData 2012 give an overview of some of these projects.


molhacker said...

Great post Noel, but I don't think I'm ready to walk away from R just yet ... Although I'm a huge fan of Python and have been using scikit-learn and pandas, there are a few reasons to maintain my allegiance to R

R is a fantastic tool for interactive data analysis. There are so many great tools for slicing and dicing data, and packages like plyr give R capabilities that I can't find anywhere else. It's possible that all of this can be done with pandas, and I just need to get better at it.

There is incredible breadth in what's available in R. Over the last few years, R has become the standard for academic statistics and machine learning. When I go looking for an implementation of a new method, I can usually find an R package.

R has excellent plotting capabilities, especially with the addition of lattice and ggplot. I haven't found any other package that gives me the power and control over plots that I get with R.

I agree that R can be syntactically strange and that things are not implemented in a consistent fashion. After 10 years of using R, my brain is sufficiently warped that I'm starting to get the hang of it.

baoilleach said...

I don't disagree with any of this, but I have always found the process of using R frustrating. Unfortunately, R seemed to be so well established that I had despaired of ever being able to dispense with it.