Friday, 5 May 2017

Using WebLogo3 to create a sequence logo from Python

A sequence logo is a way to display the variation at particular positions of multiple-aligned sequences. Here I used WebLogo 3 to do this. The documentation is not great, even once you find it, and working out how to do my specific use case required a certain amount of looking at the source code of the library. Hence this post for future me.

My sequences are peptides, and use lowercase to indicate D forms of the amino acids. So I needed to create my own 'alphabet' as all of those provided by the library uppercase everything. I also wanted to highlight a reference sequence in a particular colour. This is made easy by the RefSeqColor rule, but I needed to override it, as again it wanted to uppercase everything.
import os
import StringIO
import weblogolib as w

class RefSeqColor(w.ColorRule):
    """
    Color the given reference sequence in its own color, so you can easily see 
    which positions match that sequence and which don't.
    """

    def __init__(self, ref_seq, color, description=None):
        self.ref_seq = ref_seq
        self.color = w.Color.from_string(color)
        self.description = description

    def symbol_color(self, seq_index, symbol, rank):
        if symbol == self.ref_seq[seq_index]:
            return self.color

baserules = [
            w.SymbolColor("GSTYC", "green", "polar"),
            w.SymbolColor("NQ", "purple", "neutral"),
            w.SymbolColor("KRH", "blue", "basic"),
            w.SymbolColor("DE", "red", "acidic"),
            w.SymbolColor("PAWFLIMV", "black", "hydrophobic")
        ]

protein_alphabet = w.Alphabet('ACDEFGHIKLMNOPQRSTUVWYBJZX*-adefghiklmnopqrstuvwybjzx', [])

def plotseqlogo(refseq, mseqs, name):
    fasta = "> \n" + "\n> \n".join(mseqs)
    seqs = w.read_seq_data(StringIO.StringIO(fasta), alphabet=protein_alphabet)

    colorscheme = w.ColorScheme([RefSeqColor(refseq, "orange", "refseq")] + baserules,
                                alphabet = protein_alphabet)

    data = w.LogoData.from_seqs(seqs)
    options = w.LogoOptions()
    # options.logo_title = name
    options.show_fineprint = False
    options.yaxis_label = ""
    options.color_scheme = colorscheme
    mformat = w.LogoFormat(data, options)

    fname = "%s.pdf" % name
    with open(fname, "wb") as f:
        f.write(w.pdf_formatter(data, mformat))

if __name__ == "__main__":
    testdata = ["ACDF", "ACDF", "ACDE", "CCDE"]
    plotseqlogo("ACDF", testdata, "testdata")
Notes: The code above writes the logo out to a PDF file, which I subsequently converted to SVG with Inkscape at the commandline:
"C:\Program Files (x86)\Inkscape\inkscape.com" --without-gui --file=fname.pdf --export-plain-svg=svg/fname.svg

No comments: