How not to analyze data

Approximately a month ago biologist Alexander Markov, who is the author of the book “Inception of complexity” (this is my translation of the Russian title “Рождение сложности”) and well known popularizer of modern theories of Evolutionary biology, presented in his live journal several papers, which he published in various scientific journals in 2010. One of the papers attracted my attention, and in this post I’d like to say a few words on the subject of this piece of work, titled “Relationship Between Genome Size and Organismal Complexity in the Lineage Leading from Prokaryotes to Mammals” by A.V. Markov, V.A. Anisimov, A.V.Korotaev (Journal of Paleontology, 2010, 4:3–14).

The paper deals with the topic of the growth of genome complexity on earth during the evolutionary process from the first prokaryotes to the present day. It is clear from the very beginning that there are numerous issues with the notion of complexity, as well as with the evolutionary historical data. However this is not what I am going to discuss. In a nutshell, Markov et al. take 9 points that allegedly should correspond to the real evolutionary dynamics of the minimal genome size, which I denote in this post as $MGS$, (this notion is taken as the measure of genome complexity), and using these 9 points speculate on the possible modes of evolution, concluding that evolution of complexity is a self-accelerating process with a positive nonlinear feedback that produces the hyper-exponential growth of complexity. Their data look like this (the data are the blue points in the figure):

Here the dots are the data. I don’t discuss the validity of these points, it is obvious that we don’t (most probably, won’t) have reliable estimates on the time and the minimal genome size. The orange line is the exponential fit of the data:

$\displaystyle MGS_1=A\exp\{-B(4000-t)\},$

where I found that $A=999.224,\,B=0.0019$ (which is, by the way, slightly different from Markov’s numbers, even taking into account the number rounding, and this is very strange because in this case any minimization procedure finds the global minimum, it’s just a linear system); the red line is the generalized hyperbolic function

$\displaystyle MGS_2=\frac{A}{(B+(4000-t))^C}\,,$

where for the data $A=1.57\times 10^{12},\,B=318.4,\,C=3.35$; the red line is the double exponential curve

$\displaystyle MGS_3=A\exp\{B\exp(-C(4000-t))\},$

where $A=1.066,\,B=8.41,\,C=0.000797$ (I presume there is a typo in Markov et al. text for the first number). The total sums of residuals

$R_k=\sqrt{\sum_i(data_i-MGS_k(t_i))^2}$

are

$R_1=1813,\,R_2=1539,\,R_3=1133.$

Just a side remark: none of these numbers was I able to find in Markov et al. paper. After similar calculations and after inspecting the plot it was conjectured that the minimal genome size is growing not exponentially, but with hyper-exponential growth (the hyperbolic curve is very close to the hyper-exponential growth because this is just the first term in the Taylor series of the hyper-exponential function). Therefore, the genome complexity is also increasing hyper-exponentially, and hence the growth is self-accelerating (please read the original text for details). There are a lot of flaws in this approach, but I personally is going to comment only on two of them. However, I would like to mention that the analysis in Markov’s paper follows the steps in the paper by A. Sharov, “Genome increase as a clock for the origin and evolution of life”, Biology Direct, 17, 2006.

Biology Direct is an unusual journal in that the reviewing process is open, you can find the reviews and the author(s) answers just at the end of the text and before the References section. Please read the comments on Sharov’s paper along with the paper itself, they perfectly explain why you cannot use the discussed approach to test different hypotheses on the genome size increase, I don’t want to reproduce them here. Actually, the title of this post is taken from one of the reviewers comments.

Here my own five cents to the critique. Why only these functions? Why nothing else was tested? Why there is no comparison between the models, taking into account that one model has two free parameters, and the other two have three of them? Here is the major point: you are looking at the picture in the logarithmic scale, it is extremely counter-intuitive. Let us look at the same picture in the usual coordinates:

Don’t you see the reason for not that good exponential fit? This is all about the first point! And I have to say, the most unreliable one. Remove it, and the exponential fit would be almost comparable with the hyper-exponential. Can you see something else? If you ever had to deal with the logistic curve, it is quite obvious that this curve, given by

$\displaystyle MGS_4=\frac{A}{B+\exp\{C(4000-t)\}}\,,$

gives much better fit. Indeed, having estimated $A=30034,\,B=11.7,\,C=0.0086$, I found $R_4=265$ (cf. with the above!). Mystery is solved. Not exponential, not hyper-exponential, not anything else, this is the logistic curve. I am pretty sure that Raymond Pearl would be happy. Let us look at the figure, where the logistic fit is shown in magenta.

What is the conclusion? The conclusion is extremely simple: you cannot infer from these 9 points what law governs the dynamics of the minimal genome size, and, hence, genome complexity. Or, in more exact terms, the null model of the exponential growth (how it would be if the genome complexity increases by means of stochastic neutral process, please read Michael Lynch, he already said a lot on this) cannot be rejected in favor of other, more sophisticated, hypotheses. The data show what they show: the genome size is indeed growing, and nothing more; and this conclusion is hardly surprising.

And the second point, which I’d like to mention. Throughout the text Markov et al. discuss the hyper-exponential growth and claim that this means that there is a positive nonlinear feedback (this actually traces back to the works on human population growth, for that matters). Only one point: hyper-exponential growth is not equivalent to the autocatalytic law of grows, there are other models that easily produce hyper-exponential curve without any nonlinear feedbacks (see, e.g., the paper by G. Karev, “Dynamics of inhomogeneous populations and global demography models”).

This would be all if not one last remark. A. Sharov used his data (he had 5 points, not 9) to extrapolate the exponential curve into the past and concluded that the life had appeared 7 billion years ago. Therefore, he argued that we should seriously consider panspermia. No doubt, we should seriously consider panspermia, but one cannot use five points to make such conclusions (once again, read the reviews). Markov et al. in their paper have promised to publish a separate paper on somewhat similar issues. I just hope that this won’t happen, since, in my humble opinion, such scientific papers, which are basically grounded on zero scientific evidence, serve for the evolutionary theory worse than hundreds stories of Intelligent Design proponents, which A. Markov so productively fights against in his popular lectures, articles and notes.

______________________________________________________________

References:

1. А. В. Марков, В. А. Анисимов, А. В. Коротаев. Взаимосвязь размера генома и сложности организма в эволюционном ряду от прокариот к млекопитающим // Палеонтол. журн. 2010.

2. A. Sharov. Genome increase as a clock for the origin and evolution of life, Biology Direct, 2006.

4. Some of the explanations on the hyperbolic growth, Raymond Pearl story and history of modeling of population growth can be found in my (overpriced) book: Bratus’, A.S., Novozhilov, A.S., Platonov, A.P.: Dynamical Systems and Models in Biology, Moscow: FizMatLit, 2010, 400 pages, in Russian, ISBN: 978-5-9221-1192-8