Monday 13 January 2014

Convert distance matrix to 2D projection with Python

In my continuing quest to never use R again, I've been trying to figure out how to embed points described by a distance matrix into 2D. This can be done with several manifold embeddings provided by scikit-learn. The diagram below was generated using metric multi-dimensional scaling based on a distance matrix of pairwise distances between European cities (docs here and here).
import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn import manifold

# Distance file available from RMDS project:
#    https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv
reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';')
data = list(reader)

dists = []
cities = []
for d in data:
    cities.append(d[0])
    dists.append(map(float , d[1:-1]))

adist = np.array(dists)
amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6)
results = mds.fit(adist)

coords = results.embedding_

plt.subplots_adjust(bottom = 0.1)
plt.scatter(
    coords[:, 0], coords[:, 1], marker = 'o'
    )
for label, x, y in zip(cities, coords[:, 0], coords[:, 1]):
    plt.annotate(
        label,
        xy = (x, y), xytext = (-20, 20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()
Notes: If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.

11 comments:

Unknown said...

Hi Noel,

thanks for sharing. We did something similar in ChemicalToolBox to display an NxN matrix with mds.
https://github.com/bgruening/galaxytools/blob/master/chemicaltoolbox/ctb_machine_learning/mds_plot.py

The input matrix on the other hand can be created for example with chemfp.

Cheers,
Bjoern

Noel O'Boyle said...

Nice. One more reason for me to hurry up and try out ChemicalToolBox.

I remember with R that I used to use cmdscale, classical MDS, which is slightly different. I wonder if scikit-learn intends to implement this...

Sixtine Vervial said...

Hi Noel,

nice post, very inspiring!
I'm currently looking for a dataset like yours, displaying distance between cities in Europe. May I ask where you found yours? Thxs in advance!

Noel O'Boyle said...

I give the link to the source in the blogpost.

Jean-Baptiste said...

Hello,
Thank you for sharing this code. Do you know the differences between an euclidean and a precomputed dissimilarity ? How to choose between both of them ?
Thanks,

Noel O'Boyle said...

The input to 'fit' depends on the choice. If precomputed, you pass a distance matrix; if euclidean, you pass a set of feature vectors and it uses the Euclidean distance between them as the distances. (To my mind, this is just confusing.)

Anonymous said...

Hi Noel,
Very nice post and I found something I was looking for.
I have used PCA for my analysis and would like to know if you have any idea what is the difference between PCA nad NMDS. How to choose between both of them ?

Thanks!

Noel O'Boyle said...

The method described here reproduces a distance matrix in a lower dimension. PCA leaves the points where they are (at all the same distances - many people seem unaware of this) but rotates the axes so that the first one points along the direction of greatest variance, the second one along the next direction of variance, and so on.

Which one is right for you depends on what you want to do.

Unknown said...

hi every one,I have doubt,plz clarify it if anyone know the ans...Based on which metrics,conversion of distance matrix to coordinates happens..one more ques...here features means what?

Ulyses Rico Rea said...

Hello, I'm kind of new to this data science topics, I was requested at school to do some stuff but I don't know how to start. Here's my problem.
I was requested to do 1) "Rule and compass" and 2) "eigen vectors & eigenvalues" methods, I must apply them to a collection of 10 documents (so a 10 x 10 matrix) I have already made the matrix td-idf for all those documents, so how can I apply those 1) and 2) methods to visualize the information?

Am I okay or I'm kind of lost?

Thanks!

Maggie Don Lau said...

Hi Noel, thanks for sharing the code. I have problem running the codes.

amax = np.amax(adist)
TypeError: '>=' not supported between instances of 'map' and 'map'


results = mds.fit(adist)
TypeError: float() argument must be a string or a real number, not 'map'

I am using python 3.7