CharlieAI: symbolic regression

An important role of this blog is to act as a journal of my experiments. Onwards from now I will seek to document my experiments with enough detail to allow others to compare results, and so that I can compare results myself at a later date. So here we go...

I managed to dig out some notes from my early Charlie tests - at first I experimented using the following setup:

Function nodes	Add, Subtract, Multiply and Divide
Terminal nodes	0, 1, 2 and x
Crossover probability	90%
Reproduction probability	9%
Mutation probability	1%
Mutation depth	2
Max tree size (number of nodes)	unlimited
Max tree depth	unlimited

Below is a table of experiments I made with the above configuration. The function that I was trying to evolve is on the left. Population size and maximum generations are also listed along with some comments on the results:

Function	Population Size	Generations	Result
7.2x³+x²-3x	10000	50	No perfect solution and struggling.
7x³+x²-3x	10000	50	Perfect solution in 16 generations
7x+x²-3x	10000	50	Perfect solution in 1 generation
7.2x+x²-3x	10000	50	No perfect solution
7.2x	10000	50	No perfect solution. Best fitness score: ~442
7.2	10000	50	No perfect solution

I was surprised that the system struggled to generate a sub tree equal or close to 7.2. It seemed to evolve integers just fine but struggled with fractions. I therefore tried introducing two further terminal nodes: the constants 4 and 8 in the hope of guiding the evolution to the right ballpark figures. I repeated the last experiment above, but still got no perfect or even approximate solution:

Function	Population Size	Generations	Result
7.2	10000	50	No perfect solution

I got desperate and decided to introduce ephemeral constants, which was on my to-do list anyway. I also removed constants 4 and 8 to reduce the search space a bit again. This resulted in the following figures:

Function	Population Size	Generations	Result
7.2	10000	50	No perfect solution, but very close (~0.0001)
7.2x³+x²-3x	10000	50	No perfect solution but no longer struggling.
7x³+x²-3x	10000	50	No perfect solution.
7x+x²-3x	10000	50	Perfect solution in 1 generation
7.2x+x²-3x	10000	50	No perfect solution
7.2x	10000	50	No perfect solution. Best fitness score: ~13
7.2x	100000	43	No perfect solution but very close
7.2x	10000	100	No perfect solution. Best fitness score: ~12
sqrt(x)	10000	50	No perfect solution. Best score: ~34
7x+x²-3x+x³	10000	50	Perfect solution in 6 generations.

The results were better but by no means amazing. I then came across a subtle bug that meant some arguments were not forwarded to sub trees, and after fixing that I got the following results:

Function	Population Size	Generations	Result
7.2	10000	50	No perfect solution, but very close (~0.00001)
7.2x³+x²-3x	10000	50	No perfect solution but no longer struggling.
7x³+x²-3x	10000	50	Perfect solution in 12 generations.
7x+x²-3x	10000	50	Perfect solution in 1 generation
7.2x+x²-3x	10000	50	No perfect solution, but very close (~0.0001)
7.2x	10000	50	No perfect solution, but very close: ~10e-6.
7.2x	100000	31	No perfect solution but very close: ~10e-6.
sqrt(x)	10000	50	No perfect solution. Best score: ~0.036
7x+x²-3x+x³	10000	50	Perfect solution in 3 generations.

That was more like it! :) It was time to try the Tic-tac-toe experiments with Charlie...

When I first started on my project, a few years ago, I decided to save time by making use of an existing tree-based GP system. I did some research into the various alternatives and by far the best framework I came across was Christian Gagné's Open BEAGLE. It is well-designed, flexible, easily extendable and very efficient.

Open BEAGLE also has a Yahoo mailing list offering help and support along the way.

As an initial test of the system, and my understanding of it, I played around with symbolic and numerical regression, where the input and output to a mathematical function is known but the function itself must be evolved.

All you need to do is loop over a series of input data (e.g. the integers 1 to 100), run some function (e.g. x²+x+1) on the input, store the output for fitness evaluation, and then provide only the input data to your population along with some basic operators (e.g. addition, subtraction, multiplication and division). Fitness is determined as the negated magnitude of the error when comparing an individual's output with the known output - the closer the error is to zero, the fitter the individual. Thus, if the error is exactly zero, you have a perfect fitness case.

If everything works you will end up with a program that matches the function exactly, after a few generations of evolution, e.g. for x²+x+1:

Next, I tried if this system could evolve more elaborate functions like 4x³+2x-27. I played around with how many generations were required and how large a population was necessary to get exact solutions. I also tried using a square root function without exposing a square root operator to the GP system, and was amazed when the GP system found very good approximate solutions within a few dozen generations of a few thousand individuals!

CharlieAI

Monday, April 11, 2011

Symbolic regression results

Friday, February 25, 2011

Open BEAGLE