import csv import numpy as np import matplotlib.pyplot as plt from sklearn import manifold # Distance file available from RMDS project: # https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';') data = list(reader) dists = [] cities = [] for d in data: cities.append(d[0]) dists.append(map(float , d[1:-1])) adist = np.array(dists) amax = np.amax(adist) adist /= amax mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6) results = mds.fit(adist) coords = results.embedding_ plt.subplots_adjust(bottom = 0.1) plt.scatter( coords[:, 0], coords[:, 1], marker = 'o' ) for label, x, y in zip(cities, coords[:, 0], coords[:, 1]): plt.annotate( label, xy = (x, y), xytext = (-20, 20), textcoords = 'offset points', ha = 'right', va = 'bottom', bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5), arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0')) plt.show()

**Notes:**If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.

## 8 comments:

Hi Noel,

thanks for sharing. We did something similar in ChemicalToolBox to display an NxN matrix with mds.

https://github.com/bgruening/galaxytools/blob/master/chemicaltoolbox/ctb_machine_learning/mds_plot.py

The input matrix on the other hand can be created for example with chemfp.

Cheers,

Bjoern

Nice. One more reason for me to hurry up and try out ChemicalToolBox.

I remember with R that I used to use cmdscale, classical MDS, which is slightly different. I wonder if scikit-learn intends to implement this...

Hi Noel,

nice post, very inspiring!

I'm currently looking for a dataset like yours, displaying distance between cities in Europe. May I ask where you found yours? Thxs in advance!

I give the link to the source in the blogpost.

Hello,

Thank you for sharing this code. Do you know the differences between an euclidean and a precomputed dissimilarity ? How to choose between both of them ?

Thanks,

The input to 'fit' depends on the choice. If precomputed, you pass a distance matrix; if euclidean, you pass a set of feature vectors and it uses the Euclidean distance between them as the distances. (To my mind, this is just confusing.)

Hi Noel,

Very nice post and I found something I was looking for.

I have used PCA for my analysis and would like to know if you have any idea what is the difference between PCA nad NMDS. How to choose between both of them ?

Thanks!

The method described here reproduces a distance matrix in a lower dimension. PCA leaves the points where they are (at all the same distances - many people seem unaware of this) but rotates the axes so that the first one points along the direction of greatest variance, the second one along the next direction of variance, and so on.

Which one is right for you depends on what you want to do.

Post a Comment