Monday, 13 January 2014

Convert distance matrix to 2D projection with Python

In my continuing quest to never use R again, I've been trying to figure out how to embed points described by a distance matrix into 2D. This can be done with several manifold embeddings provided by scikit-learn. The diagram below was generated using metric multi-dimensional scaling based on a distance matrix of pairwise distances between European cities (docs here and here).
import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn import manifold

# Distance file available from RMDS project:
#    https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv
reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';')
data = list(reader)

dists = []
cities = []
for d in data:
    cities.append(d[0])
    dists.append(map(float , d[1:-1]))

adist = np.array(dists)
amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6)
results = mds.fit(adist)

coords = results.embedding_

plt.subplots_adjust(bottom = 0.1)
plt.scatter(
    coords[:, 0], coords[:, 1], marker = 'o'
    )
for label, x, y in zip(cities, coords[:, 0], coords[:, 1]):
    plt.annotate(
        label,
        xy = (x, y), xytext = (-20, 20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()
Notes: If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.

6 comments:

Björn Grüning said...

Hi Noel,

thanks for sharing. We did something similar in ChemicalToolBox to display an NxN matrix with mds.
https://github.com/bgruening/galaxytools/blob/master/chemicaltoolbox/ctb_machine_learning/mds_plot.py

The input matrix on the other hand can be created for example with chemfp.

Cheers,
Bjoern

Noel O'Boyle said...

Nice. One more reason for me to hurry up and try out ChemicalToolBox.

I remember with R that I used to use cmdscale, classical MDS, which is slightly different. I wonder if scikit-learn intends to implement this...

Sixtine Vervial said...

Hi Noel,

nice post, very inspiring!
I'm currently looking for a dataset like yours, displaying distance between cities in Europe. May I ask where you found yours? Thxs in advance!

Noel O'Boyle said...

I give the link to the source in the blogpost.

Jean-Baptiste Pressac said...

Hello,
Thank you for sharing this code. Do you know the differences between an euclidean and a precomputed dissimilarity ? How to choose between both of them ?
Thanks,

Noel O'Boyle said...

The input to 'fit' depends on the choice. If precomputed, you pass a distance matrix; if euclidean, you pass a set of feature vectors and it uses the Euclidean distance between them as the distances. (To my mind, this is just confusing.)