Wednesday, May 13, 2009

Revision Notes: Evolutionary Computation

Introduction

Principles of Evolutionary Computing: · Evolution – Problem solving technique · Environment – The problem that is to be solved · Individual – A candidate solution · Fitness – Degree of quality Applications of Evolutionary Computing: · Optimization (Travelling Salesman · Modelling (Stock market prediction) · Simulation (Ecology) Evolutionary Algorithm Fitness Landscape Free Optimization Problem Genetic Programming Genotype Learning Classifier System Local Search Mutation Memetic Evolution No Free Lunch Theorem This addresses the efficiency of problem-solving procedures with respects to the specifics of the problem domain stating that on average two algorithms will perform well over all possible problem domains. Therefore a problem solving technique which incorporates knowledge of a specific problem domain may perform much better than other generic problem solvers, but with this mind, the technique may perform much worse on other problems. Objective Function Parameter Control Although static setting of EA parameters is simple, both the candidate solutions evaluated and the final solution is fairly stochastic in nature and so allowing parameters to change throughout the EA run allows suitable self-tuning to the current fitness landscape. Deterministic Parameter is altered independent of the performance of the EA and often as a function of time or epochs. There is no feedback from EA itself and so there lies the possibility of driving the EA away from an optimal solution. Mutation Control: Is the current generation, is the total number of generations allowed Adaptive Changes in parameters are often a function of both the current and historic performance of the EA i.e. poor performance, high mutation/crossover rates. Self-Adaptive Parameter is encoded into candidate chromosomes and thus is subject to the recombination operators like the solutions they encode.

Evolutionary computation uses principles from biological evolution which occurs by natural selection. Algorithms in this field maintain a population of candidate solutions which are iteratively refined by recombining and mutating existing solutions. By tending to select individuals which maintain a high degree of fitness, optimal characteristics will develop
Individuals can only populate if they survive to adulthood The fitness of an individual is a function of its environment Survival is probabilistic and not guaranteed

Genetic Algorithm

Genotype

The genotype encodes an internal representation of the candidate solution. Phenotypic mapping is required in order to produce a candidate solution. A suitable genotypic representation allows the encoding of all possible candidate solutions although it is possible to remove unfeasible solutions by careful design. Additional attention must be paid with regards to how the mutation and recombination operators interact with the genotype. Poor genotypic representation can lead to phenotypic mapping inefficiencies, unrestricted search space representation and the allowance of invalid candidates to exist. It is therefore important for significant study of the problem domain to take place.




Binary - Useful for representing natural binary or boolean (descision) functions but also for high-level data structures although complex mapping maybe required. Positions along the chromosome may have certain significance with regards to how they encode the phenotype and so it's recommended specific schemes like Gray Coding are employed to minimise the hamming distance.

Integer - Useful for reducing genotype-phenotype abstraction and also simplifying representation for enumerated types (e.g. ordinal sets). Genetic operators for recombination and mutation must be defined which maybe more complex than simply splicing sections of the chromosome together.

Real - Useful for representing continuous values such as vectors. One should respect that computers can only represent a finite range of real values and so not all possible candidate solutions maybe represented. As with integer representations, genetic operators specifically designed for dealing with real numbers are required.

Phenotype

For the purposes of discussion, the specifics of the phenotype don't matter, all that should be known is that the phenotype exists as an 'uncompressed' version of the genotype to which the environment directly interacts with. In other words it is the materialisation of a candidate solution.

Recombination Operators

The purpose of the recombination operators is to produce new offspring with the secondary aim of taking fitter characteristics from two or more parents to produce a child which outperforms them. This process is generally quite probablistic and greatly dependant on the method of population selection employed but has a tendancy to inhibit the reproductive capacity of sub-optimal candidates increasing the population fitness over several generations. Quite simply a child is created by merging the chromosomes of contributing parents using several techniques (although a specific implementation may require adaptions of these) :



N-Point Crossover - N partitions are located along the chromsome of each parent and then regions between partitions are swapped. To maintain chromosome length, the partition points are at the same location of each contributing chromosome.

Cut and Splice - This is a special case of N-Point Crossover where N = 1 and the partion points are not equal across the parental chromosomes. This method may not work for all implementations since it alters the length of the chromosome possibly rendering it invalid.

Uniform Crossover - Rather than swapping regions of the chromosome like the previous two methods, the uniform crossover swaps each position along the chromosome with probability p.

Cut and Splice - This is a special case of N-Point Crossover where N = 1 and the partion points are not equal across the parental chromosomes. This method may not work for all implementations since it alters the length of the chromosome possibly rendering it invalid.
Population Selection

Evolutionary Programming

Genetic Programming

Hybrid Algorithms I: Ant Colony Optimization
Hybrid Algorithms II: Swarm Intelligence

Motivated by the behaviour of insects, fish and birds where each individual dynamically optimizes its position within the swarm and the environment. Although the natural occurrence of swarm intelligence use positions and velocities rooted in two dimensional space, for practical usage they can represent any n-dimensional parameter space. Bi-directional communication between each individual and the swarm as a whole, allows the propagation of optimal solutions (positions). Candidate solutions are comprised of the following: · individuals current position and velocity · best solution it has seen so far by the individual · best solution in the individual’s neighbourhood · best solution in the entire swarm It should be noted that a random component helps the exploration of solution space remains stochastic. for k = 1 to number of iterations to dofor I = 1 to number of particles n dofor J = 1 to number of dimensions m doR1 = uniform random numberR2 = uniform random numberV[I][J] = w*V[I][J]+ C1*R1*(X_lbest[I][J] - X[I][J])+ C2*R2*(X_gbest[J] - X[I][J])X[I][J] = X[I][J] + V[I][J]enddoenddoenddo Case Study: Boids Boid was an application used to simulate flocking behaviour in birds which comprised of three basic rules for optimization: · Flock Centralization – Each Boid seeks to position itself in the middle of neighbouring Boids. · Collision Avoidance – Each Boid seeks to distance its self from all other Boids. · Velocity Consistency – Each Boid seeks to match the velocities of neighbouring Boids. Further advanced rules were also proposed to address the following conditions: · Wind · Positional Affinity · Velocity Constraints · Search Space Bounding · Perching · Anti-Flocking

Hybrid Algorithms III: Learning Classifier Systems

Zeroth Level Classifier
XCS Classifier

Wednesday, February 4, 2009

Snowdays

Those who live on the south of the England, will no doubt be aware of the abundance of white stuff which deposited itself at the begining of this week.

I thought I'd share with you a few fantastic photos taken by my good friend Phil around the University of Reading.




Sudoku Genetic Algorithm: Introduction

What Is Sudoku?
Sudoku is a recently introduced, Japanese number puzzle where the objective is to fill a 9x9 grid so that each row, column, and 3x3 sub grid, contains numbers 1 through 9. The grids are usually partially filled and so it's the job of the solver to complete the remaining empty cells, without violating these. Sudoku problems range from easy to hard, where easy grids have more than one possible solution, and hard, where the grids have only one unique solution.


Sudoku puzzle with solution numbers marked in red. Source: Wikipedia

Why Use A Genetic Algorithm?

...because curiosity killed the cat, that's why!

Genetic Algorithms to which I'll now abbreviate to GA, is an interesting means to solving a problem. Though in this domain, they certainly don't provide a performance increase, they're often used in engineering to solve problems to which the solution space is so vast, there exists no algorithmic approach and Monte Carlo methods simply skim the surface of all possible solutions.

Gross Over-Simplification:
We maintain a population of ideally diverse solutions, to which each individuals fitness (effectiveness of solving a given problem) is evaluated. The highest ranking individuals are chosen to seed the next generation, with low ranking individuals discarded. This process repeats until some given threshold has been breached. This process effectively refines good solutions in an attempt to increase their fitness, so much so that they solve the problem. Note that this is a Gross Over-Simplification and much work goes into the degree of seeding, mutating solutions and even the evaluation function... GAs are not a one-stop-shop to solve world hunger.

In general, we use a GA to solve Sudoku puzzles because we can and it provides a more holistic approach to solving a problem.

In this series of posts, we'll explore the various aspects of implementing a Sudoku GA including the pitfalls and misconceptions of GAs in general.

Friday, November 28, 2008

Shiny Workspace...


Recently I took the plunge and ordered my new PC with the aim to provide a powerful enough sandbox for CUDA development etc. Unfortunately, my vanity took over and spent a bit more than I should of done on the aesthetics (relative to what I would've payed).

For those of us who appreciate this stuff and will find it meaningful, here are the specs.

  • AMD Phenom Quad-core 9950 Black Edition
  • OCZ 4 GB DDR2 @ 1066MHz
  • Samsung SpinPoint F1 HD103UJ 1TB Hard Drive SATAII with 32MB Cache
  • MSI K9N2 Diamond nForce 780a SLI Socket AM2+
  • BFG GTX280 OC2 Edition 1GB
    Currently returned to supplier due to getting hot enough to cook breakfast on within 10 seconds of running 3D Mark 06.
  • Antec Fusion MAX VERIS Black ATX Media Center Case
Many of the people I've spoken to about my choice of processor and memory, responded with either distain or mostly 'you could've done better' kind of attitude. Let me take the opportunity to explain myself:
  1. I brought an AMD processor because all my previous processors have been Intel and I fancied a change irrespective of marginal performance losses.
  2. The memory was DDR2 because most the DDR3 mainboards are hideously expensive and you don't get many 'nice' extras with them. My choice of mainboard enabled me to have a high-end Nvidia chipset with a Creative Sound-Blaster X-Fi bundled in.
The rest of the components require no justification.




Tuesday, November 18, 2008

Empty Designs


As a massive proponent of minimalist design, I came across this article today on Smashing Magazine... stick it in your inspiration scapbook!

Sunday, November 16, 2008

Introduction to Phylogenetics: Preface

Series Contents
  1. Introduction to Phylogenetics: Preface
  2. Introduction to Phylogenetics Part I: Evolutionary Model
  3. Introduction to Phylogenetics Part II: Stationary Probabilities of Mutation
  4. Introduction to Phylogenetics Part III: Phylogenetic Construction by Maximum Likelihood
  5. Introduction to Phylogenetics Part IV: Multiple Sequence Alignments
  6. Introduction to Phylogenetics Appendix A: Entrez SOAPing
  7. Introduction to Phylogenetics Appendix B: CUDA in C#
  8. Introduction to Phylogenetics Appendix C: CUDA Optimisation
In this series of posts, I'm going to give the surface of bioinformatics a good scratching, written from the perspective of a non-biologist. The main focus is to implement many of the algorithms we'll encounter in this area, in parallel, or more specifically in CUDA.NET. During the journey I'll add code snipets where possible either in a specific language or in pseudocode. Pop over to Google Code to grab the latest version of the source which much of the research in this area has led to.

Friday, November 14, 2008

Introduction to Phylogenetics Part I: Evolutionary Model

Series Contents
  1. Introduction to Phylogenetics: Preface
  2. Introduction to Phylogenetics Part I: Evolutionary Model
  3. Introduction to Phylogenetics Part II: Stationary Probabilities of Mutation
  4. Introduction to Phylogenetics Part III: Phylogenetic Construction by Maximum Likelihood
  5. Introduction to Phylogenetics Part IV: Multiple Sequence Alignments
  6. Introduction to Phylogenetics Appendix A: Entrez SOAPing
  7. Introduction to Phylogenetics Appendix B: CUDA in C#
  8. Introduction to Phylogenetics Appendix C: CUDA Optimisation

Introduction


For constructing a phylogenetic tree based on DNA data, we need to have an idea of the evolutionary trends or how specific bases mutated over time. It is likeley that with high evolutionary rates result a tree with short branches will be produced, low evolutionary rates would result in a tree with long branches. Evolutionary models which don't impose an equal probability of mutation on all bases (not unlike the Jukes-Cantor model), require some parameters on the likelihood of such a mutation.


Scope of Article

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in ante in sapien placerat accumsan. Sed non turpis nec ipsum dapibus molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Donec sem neque, gravida sed, dapibus a, laoreet quis, justo. Ut semper neque vel ligula. In placerat consequat dui. Praesent consectetuer feugiat metus. Phasellus consequat semper diam. Suspendisse massa. Praesent vel risus. Donec sodales posuere risus. Donec scelerisque ante sagittis erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi vitae lectus sit amet nisi ultricies ullamcorper. Duis lobortis. Aenean dignissim risus sit amet sem. Donec mollis diam id est. Vestibulum pulvinar leo id diam.

Predicting Evolution


We construct a Markov Chain representing all possible changes of state DNA can undergo (a generalized model is shown below). The assumption of a constant rate of change is heavily tied to the concept of a molecular clock and is highly contraversial. The idea that evolution occurs at a constant rate is a gross oversimplification of the process but to ease computation and also generate a sense of direction, it is often inherant in phylogenetic reconstruction.

Figure I -Markov Chain representing the change of states for all bases of DNA


Jukes-Cantor Model

Developed in 1969, is the simplest evolutionary model, available in descrete and continuous time flavours. The model assumes a constant rate of evolution which remain equal for all bases.


Kimura Model

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in ante in sapien placerat accumsan. Sed non turpis nec ipsum dapibus molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Donec sem neque, gravida sed, dapibus a, laoreet quis, justo. Ut semper neque vel ligula. In placerat consequat dui. Praesent consectetuer feugiat metus. Phasellus consequat semper diam. Suspendisse massa. Praesent vel risus. Donec sodales posuere risus. Donec scelerisque ante sagittis erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi vitae lectus sit amet nisi ultricies ullamcorper. Duis lobortis. Aenean dignissim risus sit amet sem. Donec mollis diam id est. Vestibulum pulvinar leo id diam.


HKY Model
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in ante in sapien placerat accumsan. Sed non turpis nec ipsum dapibus molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Donec sem neque, gravida sed, dapibus a, laoreet quis, justo. Ut semper neque vel ligula. In placerat consequat dui. Praesent consectetuer feugiat metus. Phasellus consequat semper diam. Suspendisse massa. Praesent vel risus. Donec sodales posuere risus. Donec scelerisque ante sagittis erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi vitae lectus sit amet nisi ultricies ullamcorper. Duis lobortis. Aenean dignissim risus sit amet sem. Donec mollis diam id est. Vestibulum pulvinar leo id diam.


Felsenstein Model

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin in ante in sapien placerat accumsan. Sed non turpis nec ipsum dapibus molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Donec sem neque, gravida sed, dapibus a, laoreet quis, justo. Ut semper neque vel ligula. In placerat consequat dui. Praesent consectetuer feugiat metus. Phasellus consequat semper diam. Suspendisse massa. Praesent vel risus. Donec sodales posuere risus. Donec scelerisque ante sagittis erat. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi vitae lectus sit amet nisi ultricies ullamcorper. Duis lobortis. Aenean dignissim risus sit amet sem. Donec mollis diam id est. Vestibulum pulvinar leo id diam.



References

  1. Ewens W. J., Grant G. R., Statistical Methods in Bioinformatics, Springer 2005