Nowadays we can witness the more and more growing interconnection between the research fields of genetics and computer science. The amazing results on human genome sequencing could not have been achieved without today's effective computers, the internet and the advanced data mining techniques. At the same time, both the concept and the working of artificial evolutionary systems imitate natural evolution.

Accordingly, on one side computers help natural genetics to develop, and on the other side, the lessons learned from natural evolution are applied in artificial evolutionary systems.

The field of evolutionary computation was formed in the 1970's when three independent research groups began to investigate the possibilities of creating artificial systems that solve certain problems through artificial evolution. However, the research in this area made a very rapid progress only in the last decade. Genetic programming is the newest form of evolutionary computation that was conceived in the late 1980's as a possible means for automatic programming. Genetic programming performs an evolutionary search in the space of computer programs and selects the program that solves a given task according to certain criteria.

The possibility of automatic programming has been a central question in computer science from the beginning. Genetic programming provides a partial answer. Genetic programming is an extension of genetic algorithms, where the structures undergoing adaptation are not strings but hierarchical computer programs of varying size and shape. Genetic programs are built from functions and terminals, referred together as nodes. Thus, the search space of genetic programming is the space of all trees that can be formed using the chosen function set and terminal set. Genetic programs are evaluated on a set of so-called fitness cases consisting of (input, output) pairs. Each program is tested against all the fitness cases and is assigned a fitness value accordingly. In a genetic programming system (like in every evolutionary method) one must carefully tune the representation, the evaluation method and the parameters in order to obtain worthwhile results.

Genetic programming raises many intriguing questions: How can one assure that evolution explores the best regions of the program space? How can one define and measure efficiently the diversity of programs? We do not know in advance the size and form of a good program. Then how can we restrict or guide the genetic search in order to keep it off from unnecessary extensive calculations? The benchmark problems seem quite simple. How can one apply the genetic programming paradigm to difficult, real world problems? In the case of complex tasks, one can encounter difficulties that did not cause much trouble in the case of simple tasks. How can the results on test problems be transmitted to more complex problems (i.e., is the method scalable)? Can genetic programming improve the performance of a system when added as a learning component?

In this work we try to provide answers to the above questions.

2. Outline of the dissertation

The dissertation is structured in three parts. In the first part we give an overview of evolutionary computation and in particular genetic programming. In Chapter 2 we give short descriptions for the different forms of evolutionary computation. Then we show in Chapter 3 the specific features of genetic programming. We raise key issues for genetic programming: code growth, diversity, real world applications.

In the second part we present our contribution to the theory of genetic programming. In Chapter 4 we demonstrate two methods for limiting the code growth. The first method consists in applying an additional mutation operator that simplifies the structure of a genetic program without altering its behavior. The second method applies multiobjective optimization for the objectives of fitness and program size. We show that both methods are successful in reducing code growth without significant loss of accuracy. In Chapter 5 we define a distance metric for genetic programs and use it for applying the fitness sharing technique.¹ We propose a simple diversity measure based on our metric and study the effects of fitness sharing with the help of this diversity measure.

In the third part we show the application of genetic programming in two complex real world problems. The first problem comes from mechanical engineering. Four bar mechanisms play a very important role in practical mechanism design.² We describe in Chapter 6 our four bar mechanism design system. We demonstrate how genetic programming can be a vital component of a complex design system. We integrate genetic programming with decision trees ³ into a powerful learning machine.

The second problem belongs to the decision support domain of economics. The decision-makers have to make many subjective decisions. Consequently, the final decision is sensitive to even small changes in these subjective values. We present in Chapter 7 our genetic programming system that helps the decision-makers to arrive at stable decisions. That is, for small variations in the values of the involved variables, the final decision remains unchanged.

3. New results

The contributions of this work to the advance of genetic programming are two-fold: we present here both theoretical and application results. Our results can be grouped in four distinct categories, the first two categories summing up theoretical developments and the last two categories showing application results.

Thesis I

Two new methods for moderating code growth in genetic programming have been developed: the simplifying mutation method and the Pareto selection method.

The first method has added a new simplification step to genetic programming systems. Simplification has been applied to probabilistically selected individuals in certain generations. Thus, the control of code growth has been independent of the fitness-based selection mechanism of genetic programming. It has been shown that code growth can be controlled without significant loss in performance.

The second method has been based on multiobjective optimization for the two objectives of fitness and size. We have used variants of the Pareto nondomination criterion in the selection step. We have shown that selection based on the Pareto nondomination criterion reduces code growth and processing time without significant loss of solution accuracy. Furthermore, the size of programs in the final population is independent of the maximum program size allowed in the initial population.

The new methods have been compared with original genetic programming. Both methods reduced code growth and produced solutions of similar accuracy as those of original genetic programming. In the simplification method the mechanism for controlling code growth has been independent of the fitness-based selection mechanism. The Pareto method has been very simple and it has not included program size in fitness computation.

Thesis II

A distance function reflecting the structural difference between genetic programs has been defined. The metric has allowed the application of fitness sharing between genetic programs and thus the control of the programs' diversity.

We have defined a new distance metric for genetic programs that can be efficiently computed. By using this metric, we have applied fitness sharing to genetic programming. The accuracy of solutions found by fitness sharing has been better than the accuracy of solutions found by original genetic programming.

We have applied fitness sharing for maintaining the diversity of the evolving population of genetic programs. For monitoring the diversity of the population throughout evolution, we have defined a new diversity measure for genetic programs. The new diversity measure has been based only on the structural differences of programs within the population, since the diversity of a population does not depend on the performance of its programs on a given data set.

By using tournament selection and fitness sharing, we have maintained the diversity of the population at a certain level depending on the size of program neighborhood.

Thesis III

Genetic programming has been combined with decision trees in a novel hybrid machine learning method for efficient learning of rules containing algebraic expressions. Decision trees have constituted the inductive learning engine and genetic programming has created new attributes.

This method has been successfully applied for finding meaningful descriptions of classes of four bar mechanisms. The data set has contained more than 7000 examples, from which less than 1% have belonged to the preferred class.

We have developed a preprocessing step for the best selection of the examples in the training set, and we have reduced the training set to 14% of the given data set with little loss in accuracy. This preprocessing step can also be applied in other machine learning problems where the data set contains very few examples of one class and many examples of other class(es).

Thesis IV

The genetic programming paradigm has been applied for generating stable decision principles in multicriteria decision problems. The new decision principles have evolved as genetic programs and our best decision principle has been more stable than the classical ones.

We have defined new decision principles as combinations of elementary decision principles. The evaluation of the new decision principles has consisted in the sensitivity analysis of the multicriteria decision problem.

We have analyzed the situation when the values of alternatives with respect to subjective criteria are allowed to vary. By using genetic programming, we have obtained decision principles with much better stability when compared to the widely used decision principles.

4. Publications on the topic of the dissertation

A. Ekárt: A New Trend in Machine Learning : Genetic Programming. CompNews'97, Information Technology Conference, Kolozsvár, October 1997, pp. 22-26.

A. Ekárt: Generating Class Descriptions of Four Bar Linkages. John R. Koza (Editor). Late Breaking Papers at the Genetic Programming 1998 Conference, University of Wisconsin, 22-25 July 1998, pp. 42-47.

A. Ekárt: Machine Learning Methods in Synthesis of Four Bar Linkages. Conference of PhD Students in Computer Science, Szeged, 18-22 July 1998, pp. 38-39.

A. Ekárt: Controlling Code Growth in Genetic Programming by Mutation. W. B. Langdon, R. Poli, P. Nordin, T. Fogarty (Eds.). Late Breaking Papers of EUROGP'99, Göteborg, 26-27 May 1999, pp. 3-12.

A. Ekárt; A. Márkus: Decision Trees and Genetic Programming in Synthesis of Four Bar Mechanisms. Life Cycle Approaches to Production Systems, Proceedings of the Advanced Summer Institute-ASI'99, Leuven, 22-24 September 1999, pp. 201-208.

A. Ekárt: A General Framework for Obtaining Useful Design Features of Mechanisms. B. Katalinic (Editor). Annals of DAAAM for 1999, Proceedings of the 10th International DAAAM Symposium, Vienna, 21-23 October 1999, pp. 149-150.

A. Ekárt: Shorter Fitness Preserving Genetic Programs. C. Fonlupt, J.-K. Hao, E. Lutton, E. Ronald, M. Schoenauer, (Eds.). Artificial Evolution. 4th European Conference, AE'99 Dunkerque, 3-5 November 1999. Selected Papers, LNCS volume 1829, pp. 73-83 , 2000.

A. Ekárt; S. Z. Németh: A Metric for Genetic Programs and Fitness Sharing. R. Poli, W. Banzhaf, W. B. Langdon, J. Miller, P. Nordin, T. Fogarty (Eds.). Genetic Programming, Proceedings of EUROGP'2000, Edinburgh, 15-16 April 2000, LNCS volume 1802, pp. 259-270.

A. Ekárt; S. Z. Németh: Evolving Decision Principles for Multicriteria Decision Problems. Working Paper of the Laboratory of Operation Research and Decision Systems, WP 2000-9, November 2000.

A. Ekárt; S. Z. Németh: Selection Based on the Pareto Nondomination Criterion for Controlling Code Growth in Genetic Programming. Journal of Genetic Programming and Evolvable Machines, volume 2, issue 1, 2001, pp. 61-73.

A. Ekárt; S. Z. Németh: A Noncontinuous Generalization of the Arithmetic-Geometric Mean. Applied Mathematics and Computation, 2001 (In print).