Noel O'Blog

Friday, 31 January 2014

Turning vim into an IDE: Part 1

Probably like many other users of vim (or gvim), I have never taken the time to set it up so that it works like a modern IDE with auto-completion, syntax checking, and whatever else. I've decided to start sorting this out.

My starting point is the following _gvimrc (in C:\Users\noel) which sets the behaviour of the tab key and indentation for several different file formats. I also highlight whitespace at the end of lines so that I don't add it by accident.

autocmd FileType python set expandtab
autocmd FileType python set sw=4
autocmd FileType python set ts=4
autocmd FileType python set autoindent

autocmd FileType cmake set expandtab
autocmd FileType cmake set sw=2
autocmd FileType cmake set ts=2
autocmd FileType cmake set autoindent

autocmd FileType html set expandtab
autocmd FileType html set sw=2
autocmd FileType html set ts=2
autocmd FileType html set autoindent

autocmd FileType javascript set expandtab
autocmd FileType javascript set sw=2
autocmd FileType javascript set ts=2
autocmd FileType javascript set autoindent

autocmd FileType css set expandtab
autocmd FileType css set sw=2
autocmd FileType css set ts=2
autocmd FileType css set autoindent

" Highlight whitespace at end of line
highlight ExtraWhitespace ctermbg=red guibg=red
autocmd Syntax * syn match ExtraWhitespace /\s\+$\| \+\ze\t/

" Don't create those annoying backup files ending with ~
set nobackup

Testing webapps from Python with Selenium

I'm a great believer in tests to ensure code quality, although naturally I don't always practise what I preach. Recently I've been working on a webapp with a Python server backend. I can test the server independently of the webapp, but testing the webapp itself involves clicking on various things, waiting for the server to respond, and then checking the result. Doing this manually quickly becomes a chore and that's without even considering different browsers.

The solution is Python of course (as usual on this blog). The aptly-named Selenium allows you to automate this sort of thing from the point of view of writing tests. It's available in several languages (I think they all use the same server backend) but I'm going to focus here on the Python. "pip install selenium" on Windows installs the basics; this will allow you to test against Firefox and Safari. For Chrome and IE, you need to download two additional files and add them to the PATH.

The Selenium docs are a bit confusing because (a) they keep referring back to an earlier version of the software (Selenium 1) which had different behaviour and (b) the commands for the Selenium IDE are very similar but not the same as the commands for the Python bindings.

Anyhoo, here's a modified version of the Selenium script that I'm using just to give some idea of how it works. When you run this at the command-line, a new Firefox window pops up, resizes to the specified dimensions, flies through the test, shuts down, and then I get the usual unittest output from the Python script. Specifying a browser name (e.g. Chrome) runs the test against Chrome instead.

import sys
import unittest

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

import selenium.webdriver.support.ui as ui

browsers = {"firefox": webdriver.Firefox, "ie": webdriver.Ie, "chrome": webdriver.Chrome, "safari": webdriver.Safari}

class Tests(unittest.TestCase):
    def setUp(self):
        self.driver = browser()
        self.driver.set_window_size(1380, 880)
        self.driver.get("file:///C:/Tools/myfile.htm")

        # Convenience functions
        self.driver.by_id = self.driver.find_element_by_id
        self.driver.by_xpath = self.driver.find_element_by_xpath
        self.wait = ui.WebDriverWait(self.driver, 5)

    def test_the_webpage(self):
        d = self.driver
        d.refresh()

        self.drag_and_drop(d.by_id("ID1"), d.by_xpath("//div[@id='things']/img[3]"))
        self.wait.until(lambda d: d.by_xpath("//table[@id='results']"))
        self.assertEqual("45", d.by_xpath("//table[@id='results']/tbody/tr/td[2]").text)

        d.by_id("thingy").click()
        self.wait.until(lambda d: d.by_xpath("//table[@id='results']"))
        self.assertEqual("64", d.by_xpath("//table[@id='results']/tbody/tr/td[2]").text)

    def drag_and_drop(self, elem, target):
        ActionChains(self.driver).drag_and_drop(elem, target).perform()

    def tearDown(self):
        self.driver.quit()

if __name__ == "__main__":
    browsername = "firefox"
    if len(sys.argv) == 2:
        browsername = sys.argv[1].lower()
    browser = browsers[browsername]

    suite = unittest.TestLoader().loadTestsFromTestCase(Tests)
    unittest.TextTestRunner().run(suite)

Note: For another Python project related to browser automation, see also twill.

Monday, 13 January 2014

Convert distance matrix to 2D projection with Python

In my continuing quest to never use R again, I've been trying to figure out how to embed points described by a distance matrix into 2D. This can be done with several manifold embeddings provided by scikit-learn. The diagram below was generated using metric multi-dimensional scaling based on a distance matrix of pairwise distances between European cities (docs here and here).

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn import manifold

# Distance file available from RMDS project:
#    https://github.com/cheind/rmds/blob/master/examples/european_city_distances.csv
reader = csv.reader(open("european_city_distances.csv", "r"), delimiter=';')
data = list(reader)

dists = []
cities = []
for d in data:
    cities.append(d[0])
    dists.append(map(float , d[1:-1]))

adist = np.array(dists)
amax = np.amax(adist)
adist /= amax

mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=6)
results = mds.fit(adist)

coords = results.embedding_

plt.subplots_adjust(bottom = 0.1)
plt.scatter(
    coords[:, 0], coords[:, 1], marker = 'o'
    )
for label, x, y in zip(cities, coords[:, 0], coords[:, 1]):
    plt.annotate(
        label,
        xy = (x, y), xytext = (-20, 20),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

Notes: If you don't specify a random_state, then a slightly different embedding may be generated each time (with arbitary rotation) in the 2D plane. If it's slow, you can use multiple CPUs via n_jobs=N.

Wednesday, 1 January 2014

QM Speed Test: NWChem

When I started the QM Speed Test I was afraid that I'd lose steam (and face!) before getting too far along. At the same time, I hoped that others might be inspired to send in some results for their systems and QM packages. This would make it easier for me (since half the work is setting up the input file), verify that my results were not pure invention, and extend the speed test to packages to which I don't have access.

I've just checked in the results for NWChem 6.3 compiled on Centos (also 6.3). But in the meanwhile, others have shot ahead by forking the speed test and checking in their own results:

Eric Berquist, a grad student at the Uni of Pittsburgh, has checked in results for Dalton, ORCA, and Q-Chem.
Michael Banck, Debichem guru, has checked in results for all those packages available on Debian 7 (and presumably Ubuntu and other derivatives) namely NWChem, MPQC, and PSI4, as well as ERKALE.

Dudes, I salute you.

Let's start with my results. Rather than repeat the table here, you can check out the figures on the Github page. The summary is that NWChem is quite a bit faster at geo-opts than ERKALE. This is a combination of more steps in the geo-opt, and faster energy calculations. So if you are using some of ERKALE's other features, it would probably make sense to geo-opt first with NWChem. I did recompile ERKALE with "-O3 -ffast-math -funroll-loops -fPIC -march=native" and got a 10% improvement but that's not enough to change things.

From the other calculations, what do we now know? Well all of the packages tested can actually do geo-opts at the HF/6-31G and B3LYP/6-31G level of theory. Figuring this out by looking at their websites is not always as easy as you might think. Also worth noting is the relative slowdown when using B3LYP versus HF. In some cases the slowdown is of the order of 30-40%, in other cases the DFT calculations are twice as slow.

Michael's show that NWChem is faster than any of the other Open Source QM packages tested. Very useful information; again, it points at a large time-saving by using NWChem for the initial geo-opt before turning to whatever package you need for the subsequent analysis. Eric has compared several proprietary packages (I was going to say commercial but I can't tell if ORCA is available for purchase) and Q-Chem is much faster than ORCA and Dalton.

What we haven't looked at is scaling. Maybe the orders all change when we double the size of the molecule and increase the basis set. But that's a question for another day. Right now, the goal is to get as much breadth as possible on the original question.

Notes on compiling NWChem

1. Boy were there some quirks with this one. First of all, although the instructions on the website give everything in terms of the CShell (remember csh? - big on Unix, not so big these days), you can just use the bash equivalents.
2. I wanted only single-CPU calculations so I tried to turn off MPI support. My first attempt "USE_MPI=n" caused build failures, and the only way to get around build failures appeared to be to delete the folder and untar the source again. Following the instructions on this post, I used "export ARMCI_NETWORK=SOCKETS" instead.
3. My favourite quirk was that when you untar the source, the generated folder has a really long name which causes a compile failure after some time as follows:

The directory name chosen for NWCHEM_TOP is longer than
the maximum allowed value of 65 characters
current NWCHEM_TOP=/home/noel/Tools/Quantum/nwchem-6.3.revision2-src.2013-10-17/src

Renaming to "nwchem" also failed (different error - can't find something or other I think). Renaming to "nwchem-6.3" worked.
4. I couldn't figure out how to get NWChem to use the system lapack library but got it to link to blas as described in the docs.

Saturday, 21 December 2013

Top 5 favourite blogs of chemistry bloggers

I'm always looking for ways to free up time on the webs, and my New Year's resolution is to avoid reading all Top N lists. These ubiquitous lists somehow compel me to read the contents no matter how trivial. No more. Well, no more after this, my own top N list.

A list of all chemistry blogs is maintained by Peter Maas over at Chemical Blogspace (CB). You can subscribe to one of the feeds over there to keep on top of all of them at the same time (or a relevant subset). If you have a chemistry blog and it's not on there, you should submit it to the site and your readership will balloon overnight.

Keeping on top of new chemistry blogs is tricky though. One way I thought to find new blogs was to collate the blog rolls of all of the existing blogs on CB; to a first approximation one can do this by collating all of the links on each blog's front page. This gives some idea of which blogs other bloggers read.

The Top 5 are Derek Lowe's "In the Pipeline", Egon Willighagen's "Chem-bla-ics", Paul Bracher's "ChemBark", Milkshake's "Orp Prep Daily" and Nature Chemistry's "The Sceptical Chymist".

And here are the raw results (code at the end of post) sorted by frequency of occurrence.

34 http://pipeline.corante.com/
31 https://subscribe.wordpress.com/
27 http://wordpress.org/
19 http://chem-bla-ics.blogspot.com/
18 http://blog.chembark.com/
17 http://wordpress.com/
16 http://orgprepdaily.wordpress.com/
16 http://blogs.nature.com/thescepticalchymist/
15 http://www.chemspider.com/blog/
12 http://gaussling.wordpress.com/
12 http://ashutoshchemist.blogspot.com/
12 http://www.blogger.com/
11 http://depth-first.com/
11 http://wwmm.ch.cam.ac.uk/blogs/murrayrust/
11 http://www.chemistry-blog.com/
10 http://www.thechemblog.com/
10 http://curlyarrow.blogspot.com/
10 http://usefulchem.blogspot.com/
10 http://www.statcounter.com/
9 http://miningdrugs.blogspot.com/
9 http://cultureofchemistry.blogspot.com/
9 http://chemjobber.blogspot.com/
9 http://www.sciencebase.com/science-blog/
9 http://scienceblogs.com/moleculeoftheday/
9 http://www.chemspider.com/
8 http://kinasepro.wordpress.com/
8 http://cenblog.org/
8 http://liquidcarbon.livejournal.com/
8 http://chemistrylabnotebook.blogspot.com/
7 http://www.chemicalforums.com/
7 http://omicsomics.blogspot.com/
7 http://transitionstate.wordpress.com/
7 http://blog.chemicalforums.com/
7 http://homebrewandchemistry.blogspot.com/
6 http://www.fieldofscience.com/
6 http://www.coronene.com/blog/
6 http://blog.everydayscientist.com/
6 http://totallymechanistic.wordpress.com/
6 http://scienceblogs.com/insolence/
6 http://www.bccms.uni-bremen.de/cms/
6 http://www.researchblogging.org/
6 http://www.nature.com/
6 http://outonalims.wordpress.com/
6 http://www.totallysynthetic.com/blog/
6 http://www.tiberlab.com/
6 http://www.rscweb.org/blogs/cw/
5 http://wiki.cubic.uni-koeln.de/cb/
5 http://organometallics.blogspot.com/
5 http://baoilleach.blogspot.com/
5 http://wavefunction.fieldofscience.com/
5 http://atompusher.blogspot.com/
5 http://feedburner.google.com/
5 http://chemical-quantum-images.blogspot.com/
5 http://propterdoc.blogspot.com/
5 http://allthingsmetathesis.com/
5 http://cb.openmolecules.net/
5 http://the-half-decent-pharmaceutical-chemistry-blog.chemblogs.org/
5 http://www.ch.ic.ac.uk/rzepa/blog/
5 http://usefulchem.wikispaces.com/
4 http://www.rsc.org/chemistryworld/
4 http://justlikecooking.blogspot.com/
4 http://synthreferee.wordpress.com/
4 http://mndoci.com/blog/
4 http://scienceblogs.com/pharyngula/
4 http://jmgs.wordpress.com/
4 http://pubchem.ncbi.nlm.nih.gov/
4 http://walkerma.wordpress.com/
4 http://gmc2007.blogspot.com/
4 http://purl.org/dc/elements/1.1/
4 http://coronene.blogspot.com/
4 http://naturalproductman.wordpress.com/
4 http://blog.rguha.net/
4 http://syntheticnature.wordpress.com/
4 http://www.organic-chemistry.org/
4 http://www.emolecules.com/
4 http://www.livejournal.com/
4 http://carbontet.blogspot.com/
4 http://therealmoforganicsynthesis.blogspot.com/
4 http://syntheticenvironment.blogspot.com/
4 http://totallymedicinal.wordpress.com/
4 http://comporgchem.com/blog/
4 http://www.jungfreudlich.de/
4 http://graphiteworks.wordpress.com/
4 http://chemicalcrystallinity.blogspot.com/
4 http://www.ebi.ac.uk/
4 http://scienceblogs.com/principles/
4 http://sanjayat.wordpress.com/
4 http://greenchemtech.blogspot.com/
4 http://www.feedburner.com/
4 https://www.ebi.ac.uk/chembl/
3 http://blogs.nature.com/
3 http://chiraljones.wordpress.com/
3 http://pubs.acs.org/journals/joceah/
3 http://cen07.wordpress.com/
3 http://www.natureasia.com/
3 http://masterorganicchemistry.com/
3 http://chem-eng.blogspot.com/
3 http://pubs.acs.org/journals/orlef7/
3 http://molecularmodelingbasics.blogspot.com/
3 http://chemistswithoutborders.blogspot.com/
3 http://www.typepad.com/
3 http://wordpress.org/extend/ideas/
3 http://www.sciencebase.com/
3 http://codex.wordpress.org/
3 http://chemicalmusings.blogspot.com/
3 http://chemicalblogspace.blogspot.com/
3 http://wordpress.org/extend/themes/
3 http://drexel-coas-elearning.blogspot.com/
3 http://cenblog.org/terra-sigillata/
3 http://planet.wordpress.org/
3 http://l-stat.livejournal.com/
3 http://wordpress.org/extend/plugins/
3 http://jmol.sourceforge.net/
3 http://chembl.blogspot.com/
3 http://theme.wordpress.com/themes/regulus/
3 http://scientopia.org/blogs/ethicsandscience/
3 http://www.google.com/
3 http://youngfemalescientist.blogspot.com/
3 http://www.badscience.net/
3 http://paulingblog.wordpress.com/
3 http://l-api.livejournal.com/
3 http://waterinbiology.blogspot.com/
3 http://gmpg.org/xfn/
3 http://totallysynthetic.com/blog/
3 http://chemicalmusings.wordpress.com/
3 http://liberalchemistry.blogspot.com/
3 http://www.hdreioplus.de/wordpress/
3 http://pubs.acs.org/
3 http://wordpress.org/news/
3 http://verpa.wordpress.com/
3 http://www.opentox.org/
3 http://pubs.acs.org/journals/jacsat/
3 http://www.chemical-chimera.blogspot.com/
3 http://www.kilomentor.com/
3 http://invivoblog.blogspot.com/
3 http://blog.khymos.org/
3 http://www.livejournal.com/search/
3 http://chemicalsabbatical.blogspot.com/
3 http://wordpress.org/support/
3 http://scienceblogs.com/clock/
3 http://profmaster.blogspot.com/
3 http://www.orgsyn.org/
3 http://www.surechembl.org/
3 http://www.ebi.ac.uk/chebi/
3 http://scienceblogs.com/aetiology/
3 http://www.mazepath.com/uncleal/
3 http://syntheticremarks.com/
3 http://www.nature.com/nchem/
2 http://www.syntheticpages.org/
2 http://www.eyeonfda.com/
2 http://genchemist.wordpress.com/
2 http://cenblog.org/newscripts/2013/12/amusing-news-aliquots-128/
2 http://blogs.scientificamerican.com/the-curious-wavefunction/2013/10/09/computational-chemistry-wins-2013-nobel-prize-in-chemistry/
2 http://cenblog.org/just-another-electron-pusher/
2 http://www.uu.se/
2 http://bugs.bioclipse.net/
2 http://www.ch.cam.ac.uk/magnus/
2 http://archive.tenderbutton.com/
2 http://cenblog.org/the-safety-zone/2013/12/csb-report-on-chevron-refinery-fire-urges-new-regulatory-approach/
2 http://pubs.acs.org/cen/
2 http://interfacialscience.blogspot.com/
2 http://www.blogtopsites.com/
2 http://www.amazingcounters.com/
2 http://www.chemheritage.org/
2 http://drexelisland.wikispaces.com/
2 http://intermolecular.wordpress.com/
2 http://www.aldaily.com/
2 http://www.plos.org/
2 http://kashthealien.wordpress.com/
2 http://web.expasy.org/groups/swissprot/
2 http://theme.wordpress.com/themes/enterprise/
2 http://cenboston.wordpress.com/
2 http://infiniflux.blogspot.com/
2 http://laserjock.wordpress.com/
2 http://bkchem.zirael.org/
2 http://www.qdinformation.com/qdisblog/
2 http://syntheticorganic.blogspot.com/
2 http://www.sciencebasedmedicine.org/
2 http://scienceblogs.com/pontiff/
2 http://www.chemistryguide.org/
2 http://pubs.acs.org/journals/jmcmar/
2 http://blog.openwetware.org/scienceintheopen/
2 http://cenblog.org/transition-states/
2 http://theme.wordpress.com/themes/contempt/
2 http://www.fiercebiotech.com/
2 http://www.paulbracher.com/blog/
2 http://scienceblogs.com/ethicsandscience/
2 http://tripod.nih.gov/
2 http://sciencegeist.net/
2 http://spectroscope.blogspot.com/
2 http://kilomentor.chemicalblogs.com/
2 http://pharmagossip.blogspot.com/
2 http://chembioinfo.com/
2 http://philipball.blogspot.com/
2 http://browsehappy.com/
2 http://www.realclimate.org/
2 http://chem.vander-lingen.nl/
2 http://cenblog.org/the-safety-zone/2013/12/lab-safety-is-critical-in-high-school-too/
2 http://joaquinbarroso.com/
2 http://eristocracy.co.uk/brsm/
2 http://daneelariantho.wordpress.com/
2 http://blogs.discovermagazine.com/cosmicvariance/
2 http://johnirwin.docking.org/
2 http://www.scienceblog.com/cms/
2 http://www.chemistry-blog.com/2013/12/11/number-11-hydrogen/
2 http://thebioenergyblog.blogspot.com/
2 http://www.eclipse.org/
2 http://www.ebyte.it/stan/
2 http://scienceblogs.com/
2 http://boscoh.com/
2 http://www.nature.com/nchem/journal/v6/n1/
2 http://cdavies.wordpress.com/
2 http://chemistandcook.blogspot.com/
2 http://www.rheothing.blogspot.com/
2 http://openbabel.org/
2 http://www.metabolomics2012.org/
2 http://cenblog.org/grand-central/
2 http://www.scilogs.es/
2 http://openflask.blogspot.com/
2 http://d3js.org/
2 http://stuartcantrill.com/
2 http://blogs.discovermagazine.com/gnxp/
2 http://www.bioclipse.net/
2 http://rajcalab.wordpress.com/
2 http://bacspublish.blogspot.com/
2 https://cszamudio.wordpress.com/
2 http://scienceblogs.com/sciencewoman/
2 http://theorganicsolution.wordpress.com/2013/12/12/my-top-10-chemistry-papers-of-2013/
2 http://chem242.wikispaces.com/
2 http://icpmassspectrometry.blogspot.com/
2 http://theme.wordpress.com/themes/ocean-mist/
2 http://theme.wordpress.com/themes/andreas09/
2 http://impactstory.org/
2 http://blog.metamolecular.com/
2 http://lamsonproject.org/
2 http://agilemolecule.wordpress.com/
2 http://proteinsandwavefunctions.blogspot.com/
2 http://cenblog.org/newscripts/2013/12/heirloom-chemistry-set/
2 http://beautifulphotochemistry.wordpress.com/
2 http://www.etracker.com/
2 http://chem241.wikispaces.com/
2 http://practicalfragments.blogspot.com/
2 http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2013/
2 http://researchblogging.org/
2 http://retractionwatch.wordpress.com/
2 http://u-of-o-nmr-facility.blogspot.com/
2 http://www3.interscience.wiley.com/cgi-bin/jhome/26293/
2 http://scienceblogs.com/goodmath/
2 http://creativecommons.org/licenses/by-nc-sa/3.0/
2 http://cenblog.org/the-haystack/
2 http://www.simbiosys.com/
2 http://www.steinbeck-molecular.de/steinblog/
2 http://chemistry.about.com/
2 http://cen.acs.org/
2 http://cniehaus.livejournal.com/
2 http://chem.chem.rochester.edu/~nvd/
2 http://www.chemtube3d.com/
2 http://news.google.com/
2 http://theme.wordpress.com/themes/digg3/
2 http://theme.wordpress.com/themes/mistylook/
2 http://www.wordpress.org/
2 http://luysii.wordpress.com/
2 http://disqus.com/
2 http://openbabel.sourceforge.net/
2 http://networkedblogs.com/
2 http://www.chemspy.com/
2 http://www.openphacts.org/
2 http://cic-fachgruppe.blogspot.com/
2 http://weconsent.us/
2 http://cdktaverna.wordpress.com/
2 http://gilleain.blogspot.com/
2 http://scienceblogs.com/scientificactivist/
2 http://synchemist.blogspot.com/
2 http://www.compchemhighlights.org/
2 http://acdlabs.typepad.com/elucidation/
2 http://cdk.sf.net/
2 http://www.phds.org/
2 http://zinc.docking.org/
2 http://www.ch.imperial.ac.uk/rzepa/blog/
2 http://www.cas.org/
2 http://brsmblog.com/
2 http://www.sciencetext.com/
2 http://altchemcareers.wordpress.com/
2 http://theeccentricchemist.blogspot.com/
2 http://www.sciscoop.com/
2 http://www.agile2robust.com/
2 http://mwclarkson.blogspot.com/
2 http://www.jcheminf.com/
2 http://www.tns-counter.ru/V13a****sup_ru/ru/UTF-8/tmsec=lj_noncyr/
2 http://www.slideshare.net/
2 http://scienceblogs.com/eruptions/
2 http://www.scilogs.com/
2 http://scienceblogs.com/seejanecompute/
2 http://blog.tenderbutton.com/
2 http://www.amazingcounter.com/
2 http://creativecommons.org/licenses/by/3.0/
2 http://scienceblogs.com/greengabbro/
2 http://madskills.com/public/xml/rss/module/trackback/
2 http://edheckarts.wordpress.com/
2 http://www.scilogs.fr/
2 http://feed.informer.com/
2 http://scienceblogs.com/chaoticutopia/
2 http://www.chemaxon.com/
2 http://www.rsc.org/mpemba-competition/
2 http://neksa.blogspot.com/
2 http://orgchem.livejournal.com/
2 http://scienceblogs.com/transcript/
2 http://drexel-coas-elearning-transcripts.blogspot.com/
2 http://pmgb.wordpress.com/
2 http://www.acs.org/
2 http://www.scienceblogs.com/
2 http://www.milomuses.com/chemicalmusings/
2 http://www.the-scientist.com/
2 http://calvinus.wordpress.com/
2 http://www.pandasthumb.org/


import re
import os
import urllib
from collections import defaultdict

from bs4 import BeautifulSoup

def seedFromCB():
    if not os.path.isdir("CB"):
        os.mkdir("CB")
        url = "http://cb.openmolecules.net/blogs.php?skip=%d"
        N = 0
        while True:
            page = urllib.urlopen(url % N)
            # Need to remove apostrophes from tag URLs or BeautifulSoup will choke
            html = page.read().replace("you'll", "youll").replace("what's", "whats")
            if html.find("0 total") >= 0: break
            print >> open(os.path.join("CB", "CB_%d.html" % N), "w"), html
            N += 10
    N = 0
    allurls = []
    while True:
        filename = os.path.join("CB", "CB_%d.html" % N)
        if not os.path.isfile(filename): break

        soup = BeautifulSoup(open(filename).read())
        divs = soup.find_all("div", class_="blogbox_byline")
        urls = []
        for div in divs:
            children = div.find_all("a")
            anchor = children[2]
            urls.append(anchor['href'])

        print filename, len(urls)
        allurls.extend(urls)
        N += 10

    notfound = ["http://imagingchemistry.com", "http://www.caspersteinmann.dk"]
    allurls = [url for url in allurls if url not in notfound]
    return allurls

def url2file(url):
    for x in "/:?":
        url = url.replace(x, "_")
    return url

def norm(url):
    for x in [".org", ".com", ".es", ".fr", ".net"]:
        if url.endswith(x):
            return url + "/"
    for txt in ["index.html", "index.php", "index.htm", "blog.html"]:
        if url.endswith(txt):
            return url[:-len(txt)]
    if url == "http://www.corante.com/pipeline/":
        return "http://pipeline.corante.com/"
    return url

def getAllLinks(urls):
    if not os.path.isdir("Blogs"):
        os.mkdir("Blogs")
    for url in urls:
        filename = os.path.join("Blogs", url2file(url))
        print filename
        if not os.path.isfile(filename):
            html = urllib.urlopen(url).read()
            print >> open(filename, "w"), html

    countlinks = defaultdict(int)
    for url in urls:
        filename = os.path.join("Blogs", url2file(url))
        links = re.findall('"((http|ftp)s?://.*?)"', open(filename).read())
        links = set([x[0] for x in links if x[1]=='http'])
        for link in links:
            countlinks[norm(link)] += 1
    return countlinks

if __name__ == "__main__":
    blogURLs = seedFromCB()

    countlinks = getAllLinks(blogURLs)

    tmp = countlinks.items()
    tmp.sort(key=lambda x:x[1], reverse=True)
    err = open("err.txt", "w")
    for x, y in tmp:
        if y > 1:
            if x.endswith("/") and not (y==6 and "accelrys" in x) and not (y==3 and "fieldofscience" in x):
                print y, x
            else:
                print >> err, y, x

Sunday, 15 December 2013

QM Speed Test: ERKALE

I've checked in the first results from the speed test: ERKALE, an open-source QM package by Susi Lehtola. You can find full details at https://github.com/baoilleach/qmspeedtest but here's the summary:

HF

QM Package	Time (min)	Steps	per step	Total E	HOMO	LUMO
erkale	810	90	9	-644.67570139	-0.353712	0.074269

B3LYP

QM Package	Time (min)	Steps	per step	Total E	HOMO	LUMO
erkale	933	58	16.1	-686.49566820	-0.260899	-0.064457

ERKALE is a very interesting project as it shows how the existing ecosystem of open source QM and maths libraries can be exploited as part of a modern QM package. But is it fast or slow? Who knows - we'll have to measure some more programs first...

Feel free to play along by forking the project and checking in your own results.

Note to self: The ERKALE version tested was SVN r1013 (2013-10-21). I had to do an extra single point calculation afterwards to find out the energy of the HOMO/LUMO. This time was not included in the assessment (as it shouldn't really have been necessary).

Saturday, 30 November 2013

The shortest route to the longest path

Recently I had to quickly come up with Python code that found the longest path through a weighted DAG (directed acylic graph). This particular DAG had a single start node and a single end node, but there were multiple weighted paths from the start to the end. "Longest path" in this context meant the path with the highest sum of weights.

To begin with, I coded up an exhuastive algorithm for trying all possible paths. This wouldn't be fast enough in the general case, but it didn't take long to write and is useful for comparison.

Internet rumour (contradicted by Wikipedia I now see) had it that the well-known Dijkstra's algorithm for the shortest path could be adapted for longest paths (but only in a DAG) by using the negative of the weights. However, a test of a couple of libraries for doing so indicated that negative weights were not supported (segfault with R for example), and I didn't want to code Dijkstra myself, so I decided to hunt around the internet for a direct solution to the longest path problem.

Geeksforgeeks (appropriately enough I guess) had an lucid description of a 3-step algorithm which I implemented. The only snag was that Step 1 involved a topological sort. Again, I didn't want to spend any time figuring this out so I looked for an implementation in Python and found one at ActiveState's Python recipes. I note that another popular Python implementation is available at PyPI.

Here's the code (if you are using an RSS reader to read this, you may have to click through to the original blog post):