## Eigen quasispecies model and isometry groups

Quite some time ago Yura Semenov and I uploaded yet another paper on the quasispecies theory (this is a continuation of this research), here is an archive link. The title of the paper is “On Eigen’s quasispecies model, two-valued fitness landscapes, and isometry groups acting on finite metric spaces”, which means that we are using some group theoretic methods to analyze the quasispecies model, whose description I gave in several posts. The paper went through several rounds of reviewing, and still is not accepted. However, one of the reviewers’ request was to explain in non-technical  terms what is done in the paper. We added quite a long explanation, which is not in the arXiv text. So I decided to put this section in my blog (a brief introduction which can help reading the text can be found in this post).

A biologist’s guide to the mathematical results and biological implications

The quasispecies model has a unique position in mathematical biology because of its transparent and powerful theoretical predictions at a qualitative level, and, at the same time, because of its intrinsic complexity that allows us to use it as a quantitative tool. There are a number of detailed reviews on the biological implications of the quasispecies model, see, e.g., the recent volume Quasispecies: From Theory to Experimental Systems in the series Current Topics in Microbiology and Immunology for many impressive connections with various biological systems. The quasispecies mathematical theory becomes even more important biologically today because of next-generation sequencing data, which allows direct assessment of the complete fitness landscape of RNA-like molecules. That is one of the main reasons we feel that the mathematical results we present in this text have much more than just pure mathematical value.

To be specific, the vast majority of the old and recent analytical results about the quasispecies theory, including the so-called maximum principle, which is probably one of the main tools nowadays in analyzing particular reincarnations of quasispecies model, rely on two simplifying assumptions: first, it is assumed that the fitness landscape is permutation invariant, and second, that the limiting form of the fitness, which becomes a function of a real number when the genome length tends to infinity, is a continuous function. The latter assumption is convenient analytically but is known to yield erroneous conclusions if not carefully checked. Our approach, which may not look as analytically attractive as the existing ones, is the exact one: we do not require any approximations at least at the initial stages of the analysis, and hence it can be used when other approximations fail. We still do not know the exact characteristics of real fitness landscapes, however, the general consensus that at least in some cases the evolution proceeds with huge leaps implies that the underlying fitness landscapes are essentially discontinuous.

More importantly, we are also able to weaken, at the expense of concentrating on some special landscapes, the assumption of the fitness landscapes to be permutation invariant.

The assumption on the fitness landscape being permutation invariant may look like a reasonable first approximation if one considers the genome as a “bag of genes,” which can be in two (“on” and “off”) states, but undeniably breaks down with the classical molecular interpretation of the genome as the sequence of nucleotides (purines and pyrimidines for the binary representation). In this case all the intricate machinery applied in the permutation invariant case is of little use. A very important biologically question “Can we use the intuition we gained working with permutation invariant fitness landscapes to discuss realistic non-permutation invariant landscapes?” cannot be answered within the existing analytical approaches because there are very few examples to compare with. In this text we present an algebraic technique to tackle quasispecies models with a specific general family of permutation non-invariant fitness landscapes (particular examples of the analysis of such models exist in the literature).

To explain non-technically what kind of fitness landscapes we are capable to analyze, we start with the notion of a permutation, for which we use the most transparent two line notation of the form

$\displaystyle \sigma=\begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 6 & 2 & 4 & 1 & 0 & 3 & 5 & 7 \\ \end{pmatrix},$

where the first line consists of the elements in a set being permuted and the second line represents the images of the elements of this set under this permutation. For example, if we assume that we deal with genomes of length ${N=8}$ composed of zeroes and ones then, for instance, for the genome ${g=[1,1,1,1,0,0,0,0]}$ we get
$\displaystyle \sigma(g)=[0,1,1,0,1,0,1,0].$

Note that originally ones had the positions ${0,1,2,3}$ in the genome ${g}$. Now, in ${\sigma(g)}$, they are at the positions ${\sigma(0)=6, \sigma(1)=2, \sigma(2)=4,\sigma(3)=1}$, in accordance with ${\sigma}$. Our permutations can act on all possible genomes of the given length.
The important case for us is when the permutations we consider form a group. A set of permutations ${G=\{\sigma_1,\ldots,\sigma_k\}}$ forms a group if for any ${\sigma_i,\sigma_j\in G}$ the composition (consecutive application, the permutation ${\sigma_j}$ is applied first, ${\sigma_i}$ next) ${\sigma_i\circ\sigma_j}$ also belongs to ${G}$, the identity permutation ${\sigma_{\textrm{Id}}}$ (this permutation leaves all the elements at its places) is in ${G}$, and for any ${\sigma_i\in G}$ there is the inverse permutation ${\sigma_j\in G}$ such that ${\sigma_i\circ\sigma_j=\sigma_{\textrm{Id}}}$.

Here are several examples.

Example 1 Let ${G=\{\sigma_{\textrm{Id}}\}}$. Obviously this is a (trivial) example of a group.
Example 2 All possible permutations (there are ${N!}$) on a set of ${N}$ elements form, by definition, the symmetric group ${S_N}$.
Example 3 Consider the following two permutations
$\displaystyle \sigma_i=\begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 2 & 3 & 1 & 0 & 6 & 7 & 5 & 4 \\ \end{pmatrix},\quad \sigma_j=\begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 4 & 5 & 7 & 6 & 1 & 0 & 2 & 3 \\ \end{pmatrix}.$

Form the permutations
$\displaystyle \sigma_{-1}=\sigma_i\circ\sigma_i,\,\sigma_k=\sigma_i\circ\sigma_j,\, \sigma_{-i}=\sigma_i\circ\sigma_{-1}, \sigma_{-j}=\sigma_j\circ\sigma_{-1},\sigma_{-k}=\sigma_k\circ\sigma_{-1}.$

The set of 8 permutations
$\displaystyle Q=\{\sigma_{\textrm{Id}},\sigma_{-1},\sigma_{\pm i},\sigma_{\pm j},\sigma_{\pm k}\}$

actually is a group (we leave checking this fact as a simple but tedious exercise).
Now, having at our disposal the notion of a permutation group, we can define actually the fitness landscapes, for which our theory works. If ${g}$ is some particular genome, then we define the orbit of ${g}$ under the action of a group ${G}$ as all genomes obtained from ${g}$ after applying all the permutations from ${G}$. For example, in the case of Example 1, the orbit is the genome ${g}$ itself, for Example 2 the orbit of ${g}$ is all possible permutations of zeroes and ones that ${g}$ is composed of. For Example 3, if we start with ${g=[1,1,1,0,0,0,0,0]}$ then the orbit will consist of eight genomes (again, the details are left to the reader):

${[1, 1, 1, 0, 0, 0, 0, 0]}$,
${[1, 1, 0, 1, 0, 0, 0, 0]}$,
${[1, 0, 1, 1, 0, 0, 0, 0]}$,
${[0, 1, 1, 1, 0, 0, 0, 0]}$,
${[0, 0, 0, 0, 1, 1, 1, 0]}$,
${[0, 0, 0, 0, 1, 1, 0, 1]}$,
${[0, 0, 0, 0, 1, 0, 1, 1]}$,
${[0, 0, 0, 0, 0, 1, 1, 1]}$.

Definition 1 We call a fitness landscape two-valued if for a given permutation group ${G}$ and fixed genome ${g}$ all the genomes in the orbit of ${g}$ under the action of ${G}$ have the fitness ${w+s}$ and all other possible genomes have fitness ${w}$, where ${w\geq 0,s>0}$.

It turns out that this mathematical definition yields such quasispecies models that allow us to achieve a definite progress in the analysis. We remark that we choose the fitness landscape to be two-valued to simplify the computational side of the theory, actually the situation when each orbit has its own fitness value can be also considered (see below).

Note that the single peaked permutation invariant fitness landscapes are a particular case of all possible fitness landscapes defined in terms of action of permutation groups (Example 2). In addition to these landscapes we also include in our definition more general landscapes. For example, the orbit in Example 3 is not permutation invariant (it does not include, e.g., the genome ${[0,0,1,1,1,0,0,0]}$). Our first result in the present text (prepared and proved in Sections 2 and 3) shows that for such fitness landscapes the analysis of the population of sequences of length ${N}$ boils down to the analysis of the roots of a polynomial of degree at most ${N}$. We also present the explicit form of this polynomial (see a number of examples in Section 4). Theoretically, this is a significant step forward because for an arbitrary fitness landscape the degree of this polynomial is generally ${2^N}$.

Here we also remark that the general permutation invariant fitness landscapes, even those that have only two values of fitnesses, are not covered by the presented theory. For example, the so-called mesa-landscapes were analyzed, which are permutation invariant and defined in terms of the number of mismatches, such that the sequences that have fewer than ${k_0}$ mismatches have the fitness ${w+s}$ and all the other sequences have fitness ${w}$. If ${k_0>1}$ then these fitness landscapes are defined on several orbits of permutation groups, and formally are not covered by the presented theory. We announce here that the approach we present in this text is a first step in the analysis of more general models. To wit, if we assume that the whole space ${X}$ of sequences can be decomposed as

$\displaystyle X=A_0\sqcup (\bigsqcup_{i=1}^k A_i),$

where ${A_j}$ are some orbits under the action of a given group, and the fitnesses are defined as ${w(A_0)=w,\,w(A_i)=w+s_i,\,i=1,\ldots,k}$, (in particular, the mesa landscapes are of this form) then a similar approach works (under preparation).

How biologically relevant the fitness landscapes we analyze are? It is probably naive to expect that the specific examples of such highly symmetric fitness landscapes will be found in nature. However, we hope that these examples will help to “ask the right questions” (“There has been a considerable amount of study of systems where the community matrix has diagonal symmetry or antisymmetry or has other rather special properties, where general results can be given about the eigenvalues and hence the stability of the steady states. This has had very limited practical value since models of real situations do not have such simple properties. The stochastic element in assessing parameters mitigates against even approximations by such models. However, just as the classical Lotka–Volterra system is not relevant to the real world, these special models have often made people ask the right questions. Even so, a preoccupation with such models or their generalizations must be avoided if the basic aim is to understand the real world”, J.D. Murray, Mathematical Biology ). As we also mentioned above, these fitness landscapes are a first step in the analysis of more general problems, they were chosen because they are transparent and simple enough to see how the algebraic methods work.

Additionally, along with bringing up first general family of non permutation invariant fitness landscapes that can be analyzed analytically, our examples also highlight why the strategy in the existing literature to focus the attention on the permutation invariant fitness landscapes was so successful. Our answer: exactly because such landscapes are built on orbits of some groups acting on the underlying geometry of the sequence space, which is an ${N}$-dimensional hypercube, where the vertices (the genomes) are connected if they are different at only one position.

Since the very beginning of the mathematical analysis of the quasispecies model most analytical and numerical results were obtained by simplifying the structure of the fitness landscapes alone. For example, the assumption on the landscape to be additive leads to a full analytical solution, assuming that we deal with the single peaked landscape yields very impressive results of numerical experiments, taking a quadratic function as a continuous approximation when the genome length tends to infinity allows analytical analysis of the epistatic effects, and so on. No attempt was made (to our knowledge) to simplify or replace the structure of the mutational landscape, which describes how one individual in the population can mutate to another. Having at our disposal the methods to analyze two-valued fitness landscapes for the classical quasispecies model we are in the position to formulate the mutational landscapes and the corresponding fitness landscapes such that the resulting mathematical problem is amenable to subsequent analysis. This is the second main result of the present manuscript, where, biologically speaking, we consider various mutational landscapes, whose geometry is defined in terms of some now abstract group. The details and definitions are given in Section 6.

To give just one example (analyzed in detail in Section 6.3), consider a tetrahedron (simplex), when all the individuals of a given population can mutate (we consider the term “mutation” here in a general way, meaning “change of the state”) to each other. Such mutational landscape can describe, e.g., switching of the antigenic variants for some bacteria. The results that we obtained for the classical quasispecies model allow us immediately analyze such generalized quasispecies model, in which case actually all the properties of the model are inferred from the solution of a quadratic equation (independently of the number of different individuals in the population), see Section 6.3.

Finally, we would like to note concluding this non-technical section that the notion and appearance of the so-called error threshold, i.e., the phenomenon of delocalization of the quasispecies through the whole sequence space, can also be studied by the language of the group theory. Since this particular analysis lays on a side of the main findings of the present text and appears to be more technically involved we decided to postpone it to Appendix (Section 5 in arXiv paper). However, we believe that this analysis, that helps to characterize the error threshold in terms of the sequences of orbits of some groups, highlights the geometric nature of the error threshold.