Bioinformatics: Multiple Sequence Alignment and the MAFFT software (Week 4: Laplace and Fourier Transform)

According to bioinfo.org.cn

A sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin.

In oversimplified terms, if you have two strings like ASPOTATOLOL and ASDTOMATOLL, you want to align them in such a way that they are most “similar”. In sequence alignment the strings aren’t just random strings, they are representations of biological sequences like DNA.

Analyzing the “similarities” in the DNA could give insights regarding the sequences: perhaps two similar protein sequences share similar functions? I’m not a biologist, so I’m not really sure about the applications of this in that field, however it seems to be a big thing.

What I’m more interested on is the perspective from a Computer Science background. From what I remember in high school biology, there are different ways for genes to mutate. For example, a sequence may be inserted, deleted, repeated, or even changed completely. This adds a lot of complexity to the problem, you don’t just “move” one string against another and compare, you have to consider other mutations.

For example, a “most similar” way of sequencing the two strings above might be:

A S __ P O T A T O L O L

A S D T O M A T O L _ L

The “_” above represents an insertion. In terms of complexity the search space is really large (I don’t know how much exactly), it scales with the length of the strings. If you add even more sequences as in Multiple Sequence Alignment, it would scale even more.

There are different ways of computing how “similar” sequences are, there may be different results depending on how you “score” similarities between them.

Because of the huge time complexity of this problem it isn’t feasible to compute for the most optimal solution (and it’s even hard to define “optimal” in the first place), so an approximation or heuristic may be used.

One software for Multiple Sequence Alignment is MAFFT (Multiple Alignment using Fast Fourier Transform ). The have a paper about their software and it’s performance measures (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC135756/) and their software can be downloaded for free (http://mafft.cbrc.jp/alignment/software/).

As the meaning of the acronym implies, this software uses FFT. I don’t know the exact details, but from what I understand the process is somewhat similar to using FFT for audio compression. In audio compression, FFT can be used to identify “important” parts of a sound file, the “unimportant” parts are usually scraped with minimal loss to quality. Similarly, MAFFT uses FFT to find the “important” parts of the sequence and instead of analyzing the whole sequence the more “important” parts are analyzed since they’re much shorter than the whole string.

References:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC135756/

http://www.bioinfo.org.cn/lectures/index-13.html

https://en.wikipedia.org/wiki/Multiple_sequence_alignment

Advertisements