import csv import numpy as np import matplotlib.pyplot as plt from sklearn import manifold # Distance file available from RMDS project: # https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';') data = list(reader) dists = [] cities = [] for d in data: cities.append(d[0]) dists.append(map(float , d[1:-1])) adist = np.array(dists) amax = np.amax(adist) adist /= amax mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6) results = mds.fit(adist) coords = results.embedding_ plt.subplots_adjust(bottom = 0.1) plt.scatter( coords[:, 0], coords[:, 1], marker = 'o' ) for label, x, y in zip(cities, coords[:, 0], coords[:, 1]): plt.annotate( label, xy = (x, y), xytext = (-20, 20), textcoords = 'offset points', ha = 'right', va = 'bottom', bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5), arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0')) plt.show()Notes: If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.
Monday, 13 January 2014
Convert distance matrix to 2D projection with Python
In my continuing quest to never use R again, I've been trying to figure out how to embed points described by a distance matrix into 2D. This can be done with several manifold embeddings provided by scikit-learn. The diagram below was generated using metric multi-dimensional scaling based on a distance matrix of pairwise distances between European cities (docs here and here).
Hi Noel,
ReplyDeletethanks for sharing. We did something similar in ChemicalToolBox to display an NxN matrix with mds.
https://github.com/bgruening/galaxytools/blob/master/chemicaltoolbox/ctb_machine_learning/mds_plot.py
The input matrix on the other hand can be created for example with chemfp.
Cheers,
Bjoern
Nice. One more reason for me to hurry up and try out ChemicalToolBox.
ReplyDeleteI remember with R that I used to use cmdscale, classical MDS, which is slightly different. I wonder if scikit-learn intends to implement this...
Hi Noel,
ReplyDeletenice post, very inspiring!
I'm currently looking for a dataset like yours, displaying distance between cities in Europe. May I ask where you found yours? Thxs in advance!
I give the link to the source in the blogpost.
ReplyDeleteHello,
ReplyDeleteThank you for sharing this code. Do you know the differences between an euclidean and a precomputed dissimilarity ? How to choose between both of them ?
Thanks,
The input to 'fit' depends on the choice. If precomputed, you pass a distance matrix; if euclidean, you pass a set of feature vectors and it uses the Euclidean distance between them as the distances. (To my mind, this is just confusing.)
ReplyDeleteHi Noel,
ReplyDeleteVery nice post and I found something I was looking for.
I have used PCA for my analysis and would like to know if you have any idea what is the difference between PCA nad NMDS. How to choose between both of them ?
Thanks!
The method described here reproduces a distance matrix in a lower dimension. PCA leaves the points where they are (at all the same distances - many people seem unaware of this) but rotates the axes so that the first one points along the direction of greatest variance, the second one along the next direction of variance, and so on.
ReplyDeleteWhich one is right for you depends on what you want to do.
hi every one,I have doubt,plz clarify it if anyone know the ans...Based on which metrics,conversion of distance matrix to coordinates happens..one more ques...here features means what?
ReplyDeleteHello, I'm kind of new to this data science topics, I was requested at school to do some stuff but I don't know how to start. Here's my problem.
ReplyDeleteI was requested to do 1) "Rule and compass" and 2) "eigen vectors & eigenvalues" methods, I must apply them to a collection of 10 documents (so a 10 x 10 matrix) I have already made the matrix td-idf for all those documents, so how can I apply those 1) and 2) methods to visualize the information?
Am I okay or I'm kind of lost?
Thanks!
Hi Noel, thanks for sharing the code. I have problem running the codes.
ReplyDeleteamax = np.amax(adist)
TypeError: '>=' not supported between instances of 'map' and 'map'
results = mds.fit(adist)
TypeError: float() argument must be a string or a real number, not 'map'
I am using python 3.7