A Genetic Algorithm for DNA?

Introduction - Modifications - Steps - RNA and simple life - Possible applications - Conclusion



1. Introduction


With a background in microbiology and several years work experience in the IT sector, I stumbled upon Genetic Algorithms (GA). My first though was "hey, there are some researchers trying to model evolutionary processes!". Not so. GA and evolutionary computation researchers 'borrowed' some ideas from genetics, like crossover, mutation and fitness (in their own terms, not its biological meaning), but, much to my surprise, applications are in the field of timetabling schedules for organising an exam period, job-shop scheduling, game playing behaviour and attempts in management science.
Reading more about the details of GA, I have the idea that with some (minor?) changes to the algorithms used, it would be possible to use GA in the field of genetics. However, I'm not a mathematician and therefore I'd like to hear if the theoretical exercise outlined on this page could indeed be achieved with the proposed adjustments.


back to IT stuff

2. Modifications


First, one needs to be able to represent the genetic sequence as a string of bits. Due to difficulties with RNA (see below), I'll start with plain DNA. A chromosome is built up of a long chain of four different nucleotides. Genome databases at present contain parts or whole representations in the format of ACTG sequences, thus can be converted using a simple function by allocating two bits for each nucleotide: 00, 01, 10 and 11. This poses the first hurdle, because the present formulas used in GA mathematics take actions on single bits and not two at a time.

Question 1: To what extend is it reasonable to ask if two bits can be used as one 'atomic' entity?

You could argue that with representing the four nucleotides with two bits each, you would not need to group two bits. But in that case it is possible that a mutation occurs twice in the same nucleotide in one generation, which is incorrect from a biological point of view. In addition to that: with the one-bit approach the matter is 'if a 0 make it a 1', but in the 2-bits based situation, there are three alternatives. The second aspects involve crossover and mutation, which uses simple bit manipulation operations. Lets assume Q1 is answered with 'yes, it is possible to treat two bits as one entity'. This solves the GA-mutation computation implementation; rest the percentage of occurrence of mutation. Some organisms appear to have a higher degree of mutation than others (e.g. the more flexible bacteria) and on top of that, the current estimates of mutation occurrences vary slightly (certain mutations resulting in instant death). Translating this into 'practice': the degree of mutation entered as a parameter will be to some degree a trial-and-error process.
However, the mutation parameter is of minor importance compared to crossover. Crossover is certainly not a one-bit, or one-nucleotide, operation, or occurring more often than mutations. During crossover, a random group (varying in size) of nucleotides are swapped (see Figure 1).


Figure 1. Crossover. The red and green lines are DNA strings.


So this occurs less than mutation, but more bits/nucleotides are involved in the happening. One way is to keep the crossover as an atomic action and using a larger value of the parameter than mutation, but:

Question 2: is it possible to swap groups of adjacent bits for another group of randomly taken bits?

Where the group size varies, for example, between 100 and 200, or for the sake of simplicity a fixed group size and predefined sequence of bits.
In Genetic Programming (GP), the crossover operation is implemented by taking randomly selected sub-trees in the individuals (selected according to fitness) and exchanging them, or: "an integer position k along the string is selected uniformly at random between 1 and the string length less one [1,l-1]. Two new strings are created by swapping all characters between positions k+1 and l inclusively"(8). Although GP usually does not use any mutation as a genetic operator, this sounds, at least to me, more closely related to real crossover than the single-bit operations and 'might' be the answer to question 2.

Third, the fitness levels which are normally assigned to the strings. There exist so-called conservative regions and genes that code for vital proteins in a DNA sequence, they ought not to change, or very rarely. If changes do occur in those sections, they 'must' lead to 'string death', i.e. failing the fitness test. To me, this sounds as potential candidates to define the fitness levels on. However:

Question 3: Currently fitness levels are assigned to different possible strings, but ideally they should apply to parts of one string. Can that be coded?

A simplification could be, to take as input converted strings of e.g. the various forms of the influenza or HHV (human herpes virus) and assign fitness levels in accordance with the prevalence of those viruses in nature. In this way, it is possible to assign fitness levels to whole strings instead of parts of a string. Combining two levels of fitness levels would be really exciting: having a first fitness level defined for the DNA string as a whole, and a sub-level for regions within the string.

Though there are of course more factors ideally to be taken into account, like the situation of exchange and additions of genes, the ones mentioned above are the core aspects. If that is not possible at all, I think it will be extremely difficult to even remotely resemble any evolutionary process in a model.


back to IT stuff

3. Steps


To narrow down, or simplify, a practical attempt it would be easiest to start with a DNA virus like the T4 bacteriophage, SV40 and polyoma virus or to convert the influenza virus or TMV (tobacco mosaic virus) from RNA back to DNA and use that sequence, because they are relatively simple and well researched. TMV contains only 6400 nucleotides (6 genes) and the SV40 a mere 5243 base pairs, which can speed up the calculations. Further, to be able to check the outcome, a virus that exists in more than one flavour (i.e. more existing sequences in time) would be ideal.
So the process could be:
  1. Pick a DNA-based virus of choice from one of the genome databases. In case of RNA: calculate back to DNA (G for C and A for T and vice versa), see next subsection.
  2. Substitute the A, C, G and T for 00, 01, 10 and 11.
  3. In case of a DNA-based sequence set the value for crossover relatively higher than mutation, RNA-based viruses tend to be more prone to mutation.
  4. Set fitness levels in accordance with known conservative regions and regions that code for 'life critical' proteins.
  5. Run for different amount of generations and a statistically valid amount of times (after all, it is not sure if the 'evolution', or DNA changes, are truly random, whereas the GA is)
  6. Convert the bit-stream outputs back to the DNA sequence (and subsequently to RNA, if chosen, but not [yet] advisable) and compare the sequence with the genome database, which returns the percentages of (close) matches of your computed DNA strings.



back to IT stuff

4. RNA and 'simple' life forms


Well, this may sound simple and straight forward, but there are considerably more interfering parameters to be taken into account, and if it doesn't quite work out with DNA, there's not much sense in spending time on RNA or simple life forms like the 'humble' bacteria. First, RNA contains Uracil, instead of Thymine, both pairing with Adenine, which means that the 11 designated to represent Thymine (see paragraph 3 step 2) in DNA, has to be interpreted as Uracil when dealing with RNA. Secondly, there is a group called retroviruses, of which HIV is a striking example, which have a relatively very high mutation rate in certain regions (the same patient can have different 'versions' of HIV at different times), hence will likely complicate fine-tuning of the crossover and mutation parameters. Third, bacteria as not as simple as people would like to believe. Yes, they do have relatively short chains, but the problem lies in the fact that they are eager to swap genes amongst each other, regularly via plasmids (sizes can vary from 2k bases to several hundred) or transposons. Translated to modelling: you'll have to figure out a way, come up with equations, to expose a limited amount of generations to 'new' DNA (bit strings of at least 2 * 2000), devise a probability that the DNA sequence under test will incorporate the new bit string and that at least some of the resulting generations will pass the fitness test. 'Higher' life forms hardly show this type of behaviour, if at all. But the problem with higher life forms is that the DNA/bit strings will be very long, thus negatively affecting performance.


back to IT stuff

5. Possible future applications


This has more to do with using my imagination, because it is based on the assumptions that the hurdles as outlined in paragraphs 2 and 3 actually could become reality, within a scientifically acceptable margin at least. Maybe (likely) there are people who already have tried it, without fruitful results. Ignoring that for the moment: Then again, we may be too ignorant to even understand the basic processes involved in evolutionary changes to be able to capture it in (mathematical) models. Besides, all this is based on the presumption that there is some sort of structure in the observed changes in the molecular design of life, which may be wholly wrong. We just don't know; well, at least not yet.


back to IT stuff

6. Conclusion


I have not found any indication that GAs are being used with DNA as input data, but only in population dynamics within the field of biology. On this page I proposed two changes that would need to be made to the algorithms before one can make a reasonable attempt to test the GA with DNA: two bits will have to be treated as one atomic entity, with a knock-on effect of having three alternatives in a mutation, a crossover with a group of adjacent bits instead of the single bit swap currently implemented and different fitness levels to be assigned for different parts of the string. If these hurdles can be overcome, some highly interesting research results may be obtained.