GA Vibes

https://en.wikipedia.org/wiki/Chromosome#/media/File:Chromosome_structure.png

I have been interested in evolutionary computation (EC) for quite a while. Evolutionary computation is a family of algorithms for global optimization inspired by biological evolution. Genetic algorithms (GA) are a form of EC commonly used for the solution of optimization problems. The method is inspired by natural selection.

In a genetic algorithm, a population of candidate solutions to the optimization problem are encoded as chromosomes, an analogy to biological chromosomes. Each chromosome is an encoding of the parameters needed to generate a possible value of the function being optimized. The encoding might be a bitmap, a vector, program code, or any representation that can be used to calculate the objective function. For each chromosome, a fitness or score is calculated. The score is the value of the objective function for that chromosome. The full evaluation of the set of chromosome is called a generation. A selection method chooses candidate chromosomes to form the next generations. A crossover operation is applied to pairs of chromosomes to form new child chromosomes. A mutation operator can be applied to the new population to provide more variation in the solutions generated. The process is repeated for a fixed number of generations or until the objective function stops changing.

Genetic algorithms are not algorithms. They are heuristics. For any given problem, a representation must be chosen and mutations and crossover operations defined. Defining these can be the difficult part of applying a GA to a specific problem. In addition, there are hyperparameters that must be defined including the population size, the number of generations for the process to run, and depending on the encoding method other possible parameters.

GAs can be useful for problems that have a complex fitness landscape. This is because mutation and crossover, can move the population away from local optima where a traditional algorithm might get stuck.

Change Point

The change point problem is finding the positions where the underlying probability distribution changes occur in a timeseries. I have looked at the change point problem previously, for example here. A genetic algorithm is not a good method for finding change points. If you really want to find the change points in series, use something like the ruptures library for python or one of the many R packages for change point detection. GAs tend to be more computationally expensive than methods specifically designed specifically for change point detection. 

Nevertheless, I want to try two things: play with a GA and see if an LLM could generate satisfactory genetic algorithm code. I needed a test problem and since I have had experience with change point detection, it seemed like a relatively easy problem to tackle.

In what follows, we will use the terms change point and breakpoint interchangeably. 

Vibe Coding

Vibe coding is a development method that uses AI  to generate code from natural language prompts. It usually involves attacking a problem with less planning than traditional software development methods. It can be very effective for smaller self-contained projects. It may be a an inappropriate approach for complex mult-developer projects.

I asked ChatGPT 5 to generate python code to find the change points in a timeseries using a GA. To my surprise, it generated working code. I made extensive changes to the original code, but the framework came from ChatGPT.

GA

Here is some pseudocode for a typical GA. We will need to fill in the details of the chromosomes, population size, operators, etc.

  initialize population
  gen = 0  
  best = None
  fitness = -Inf 
  while gen < total_generations  
    calculate fitness for each chromosome  
    new_pop = []  
    while size new_pop < size previous generation  
      select 2 chromosomes  
      apply crossover operator  
      apply mutation operator  
      add 2 new chromosomes to new_pop  
    population = new_pop  
    if best fitness in population > best; best = best in population  
    gen = gen + 1  
  report best   

Data


We will test the GA with a timeseries that has some obvious breakpoints. We will assume that the timeseries is piecewise linear with normally distributed noise.

\[{{\mathbf{y}}_i} = {c_i} + {m_i}{{\mathbf{x}}_i} + {\varepsilon _i}\]

i is the segment number, ${c_i}$, ${m_i}$ are the intercept and slope of the linear model. ${\varepsilon _i} \sim N(0,{\sigma ^2})$ is normally distributed noise.

If we simply assume that the model for each segment is linear, we could choose change points and then use linear regression to estimate the slopes and intercepts. If we assume that the segments are piecewise linear, we impose additional constraints on the intercepts. 

\[segment\,j > 0\]
\[{c_j} = {c_{j - 1}} + \left( {{m_{j - 1}} - {m_j}} \right)\]

These constraints make the optimization problem slightly more difficult.

Here's code to generate a time series subject to the constraints on the segment intercepts.


  def generate_piecewise_linear(n: int = 200, K: int = 3, x_range: tuple[float, float] = (0, 10), 
                              slopes: np.ndarray | None = None, 
                              intercept: float = 0.0, 
                              noise_std: float = 0.1, 
                              random_state: int | None = None)  -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
 
    """
    Generate synthetic piecewise linear data with K segments.
    
    Partially generated by ChatGPT 5.
    
    Parameters
    ----------
    n : int
        Number of x points.
    K : int
        Number of breakpoints (=> K+1 segments).
    x_range : tuple
        Range of x values (min, max).
    slopes : list or None
        Slopes for each segment. If None, random slopes are drawn.
    intercept : float
        Starting intercept (at x = x_range[0]).
    noise_std : float
        Standard deviation of Gaussian noise.
    random_state : int or None
        Random seed.
    
    Returns
    -------
    x : ndarray
        Sorted x values.
    y_true : ndarray
        True (noise-free) piecewise linear values.
    y : ndarray
        Noisy observations.
    breakpoints : ndarray
        Breakpoint x-locations (excluding endpoints).
    """
    seed = random_state
    if seed is None:
        seed = int(time.time())
    rng = np.random.default_rng(seed)
  
    rng = np.random.default_rng(random_state)
    x = np.linspace(x_range[0], x_range[1], n)
    
    # choose K breakpoints randomly within range
    breaks = np.sort(rng.uniform(x_range[0], x_range[1], size=K))
    
    # slopes
    if slopes is None:
        slopes = rng.uniform(-1.5, 1.5, size=K+1)
    
    # compute true y values segment by segment
    y_true = np.empty_like(x)
    current_intercept = intercept
    seg_start = x_range[0]
    
    for j in range(K+1):
        if j < K:
            seg_end = breaks[j]
        else:
            seg_end = x_range[1]
        idx = (x >= seg_start) & (x <= seg_end)
        y_true[idx] = slopes[j] * (x[idx] - seg_start) + current_intercept
        # update intercept for continuity
        current_intercept = slopes[j] * (seg_end - seg_start) + current_intercept
        seg_start = seg_end
    
    # add Gaussian noise
    y = y_true + rng.normal(0, noise_std, size=n)
    
    print("Breakpoints:", breaks)
    print("Slopes:", slopes)
    print("c0:", intercept)
    
    return x, y_true, y, breaks

Here's is the generated data that we will use.


As you can see, there are three distinct change point in the series. There are 400 data points.

The Problem


We will want to find a set line segments that fit the data. The objective function will be the mean squared error (MSE). 

\[\frac{1}{n}\sum\limits_i {{{\left( {{{\widehat y}_i} - {y_i}} \right)}^2}}  + \lambda K\]

${{{\widehat y}_i}}$ is the estimated value of y from the linear model. ${{y_i}}$ is data element i. n is the number of data points. $\lambda $ is an optional penalty on the number of segments. K is the number of change points (breaks). In practice, we will usually set $\lambda  = 0$.

Our problem is to find the locations of the breaks, the slopes of the linear model for each segment, and the slope of the initial segment. We can calculate the intercepts for the other segments from the constraints.

The Population

We will represent the chromosome, i.e. possible solutions to the optimization problem, as an array of real numbers. For a solution with K changepoints (K+1 segments), the chromosome would be:

\[\left[ {brea{k_0},brea{k_1}...brea{k_{K - 1}},slop{e_0},slop{e_1}...slop{e_K},{c_0}} \right]\]

To start the GA process we generate an array of n chromosomes.

Evolution

The evolutionary process follows the pseudocode show above. A model, stored as a dictionary is returned at the end of the evolution.

def evolve(x: np.ndarray, y: np.ndarray, 
           rng: np.random._generator.Generator,
           K: int = 3,
           lam: float = 0.0,
           pop_size: int = 200, 
           generations: int = 400, 
           elite: int = 5,
           eta = 100) -> dict:
    """Evolve a population of chromosomes

    Parameters
    ----------
    x : np.ndarray
        sequence of x values
    y : np.ndarray
        the seuences of y values
    rng : np.random._generator.Generator
        the random number generator
    K : int, optional
        the number of change points (breaks), by default 3
    lam : float, optional
        _description_, by default 0.0
    pop_size : int, optional
        population size, by default 200
    generations : int, optional
        number of generations to evolve, by default 400
    elite : int, optional
        the number of top chromosomes to carry to next generation, by default 5
    eta : int, optional
        parameter for the SBX crossover, by default 100

    Returns
    -------
    dictionary with key-value pairs
            "breakpoints", 
            "slopes",
            "c0",
            "predict": a function for calculating the objective function,
            "loss": mse loss (best fitness), 
            "BIC": Bayesian Information Criterion
    """
    slope_scale = float(np.std(y)) / max(float(np.std(x)), 1e-8)
    pop = init_population(rng, 
                          pop_size = pop_size,
                          K = K,
                          slope_scale = slope_scale)
    fitness = np.array([loss(g, x, y, K, lam) for g in pop])
    
    for _ in range(generations):
        # Elitism
        elite_idx = np.argsort(fitness)[:elite]
        new_pop = [pop[i].copy() for i in elite_idx]

        # Tournament selection
        def tournament():
            i, j = rng.integers(0, pop_size, 2)
            return pop[i] if fitness[i] < fitness[j] else pop[j]

        # Create offspring
        while len(new_pop) < pop_size:
            p1, p2 = tournament(), tournament()
            c1, c2 = crossover(p1, p2, rng, eta = eta)
            c1 = mutate(c1, rng, K=K)
            c2 = mutate(c2, rng, K=K)
            new_pop.extend([c1, c2])

        pop = np.array(new_pop[:pop_size])
        fitness = np.array([loss(g, x, y, K, lam) for g in pop])
        
        # print(fitness[np.argmin(fitness)], pop[np.argmin(fitness)])

    best = pop[np.argmin(fitness)]
    x_min, x_max = x.min(), x.max()
    b, m, c0 = decode_genome(best, K, x_min, x_max)
    mse = float(np.min(fitness))
    rss = mse * len(y)
    bic = len(y) * np.log(float(rss)/len(y)) + (2 * K + 2) * np.log(len(y))
    
    return {"breakpoints": b, 
            "slopes": m, 
            "c0": c0, 
            "predict": lambda X: piecewise_predict(np.asarray(X), b, m, c0),
            "loss": mse, 
            "BIC": bic}

predict is a reference to a function to calculate the predicted piecewise linear fit. decode_genome returns a tuple of the chromosome elements.

Crossover

The idea behind crossover is to exchange pieces of the chromosome in the hope that good parts of the two parent chromosomes will be combined in the children to improve the objective function. 

Crossover for bitmap encoded chromosomes is simple; sample a point in the bitmap and swap the corresponding pieces to produce two new children. Encoding the chromosome as real numbers is convenient in this case, but makes crossover a bit more difficult.

Instead of a simple swap of corresponding pieces of a bitmap list, we will use Simulated Binary Crossover (SBX). The SBX operator produces offspring ${c_1},{c_2}$ similar to the parents ${p_1},{p_2}$, but still allows some diversity. We want two children that are symmetric around the midpoint of the parents, with some spread to provide diversity in the next generation.

To perform SBX, we generate a random number $u \sim U(0,1)$ and define a spread factor:

\[\beta  = {(2u)^{\frac{1}{{{\eta _c} + 1}}}}if\,u < 0.5\]
\[\beta  = {\left( {\frac{1}{{2(1 - u)}}} \right)^{\frac{1}{{{\eta _c} + 1}}}}if\,u \geqslant 0.5\]

${\eta _c}$ is called the distribution index. 

We compute the children as:

\[{c_1} = 0.5\left[ {(1 + \beta ){p_1} + (1 - \beta ){p_2}} \right]\]
\[{c_2} = 0.5\left[ {(1 - \beta ){p_1} + (1 + \beta ){p_2}} \right]\]

Note that 
\[\frac{{{c_1} + {c_2}}}{2} = \frac{{{p_1} + {p_2}}}{2}\]

Large ${\eta _c}$ tends to generate children close to the parents. Smaller ${\eta _c}$ moves the children further from the parents. 

The following plot shows the effect of changing ${\eta _c}$.


Here's the python code.


def crossover(p1: np.ndarray, p2: np.ndarray, 
              rng: np.random._generator.Generator, 
              eta: float = 10.0, 
              prob: float = 0.8) -> tuple[np.ndarray, np.ndarray]:
    """SBX crossover

    Parameters
    ----------
    p1 : np.ndarray
        a chromosome
    p2 : np.ndarray
        a chromosome
    rng : np.random._generator.Generator
        random number generatoe
    eta : float, optional
        distribution index by default 10.0
    prob : float, optional
        crossover probability, by default 0.8

    Returns
    -------
    tuple[np.ndarray, np.ndarray]

a pair of chromosomes,eith the original pair or the new children
""" if rng.random() > prob: return p1.copy(), p2.copy() c1, c2 = p1.copy(), p2.copy() for i in range(len(p1)): u = rng.random() if u <= 0.5: beta = (2*u)**(1/(eta+1)) else: beta = (1/(2*(1-u)))**(1/(eta+1)) c1[i] = 0.5*((1+beta)*p1[i] + (1-beta)*p2[i]) c2[i] = 0.5*((1-beta)*p1[i] + (1+beta)*p2[i]) return c1, c2

Mutation


We will use a low probability of mutation. The mutations are Gaussian noise on slopes and intercept; a small jitter is applied on breakpoints at the boundaries.

def mutate(g: np.ndarray, 
           rng: np.random._generator.Generator, 
           sigma_knots: float = 0.05, 
           sigma_slopes: float = 0.1, 
           sigma_c0: float = 0.1, 
           p: float = 0.2, 
           K: int =1 ) -> np.ndarray:
    """Gaussian mutation.

    Parameters
    ----------
    g : np.ndarray
        chromosome
    rng : np.random._generator.Generator
        random number generator
    sigma_knots : float, optional
        st dev of Gaussian probability for joints, by default 0.05
    sigma_slopes : float, optional
        st dev of Gaussian for slopes, by default 0.1
    sigma_c0 : float, optional
        st dev of Gaussian for c0, by default 0.1
    p : float, optional
        probability of a mutation, by default 0.2
    K : int, optional
        number of breakpoints, by default 1

    Returns
    -------
    np.ndarray
        a new chromosome

    """
    out = g.copy()
    for i in range(len(g)):
        if rng.random() < p:
            if i < K:  # breakpoint gene
                val = out[i] + rng.normal(0, sigma_knots)
                # reflect to keep in (0,1)
                if val < 0: val = -val
                if val > 1: val = 2 - val
                out[i] = np.clip(val, 0, 1)
            elif i < K + (K+1):  # slope gene
                out[i] += rng.normal(0, sigma_slopes)
            else:  # c0
                out[i] += rng.normal(0, sigma_c0)
    return out

Getting Stuck


GAs are heuristics and not guaranteed to find a global optimum. They can get stuck in suboptimal solutions. One work around this problem, is to run the GA several times and report the chromosome with the best fitness from all of the restarts. This approach is easy to run in parallel, since the individual runs do not need to communicate with one another.

How Many Change Points?


The analysis and code above assumes that we know how many change points, the parameter K in the code, there are in the timeseries. In our toy data, it's obvious that there are three change points. We fixed the number of change points when we generated the data.

One simple approach to finding K is to run separate GA instances with varying values of K. Again, this is amenable to being run in parallel.

We can't just assume that the solution with the best fitness from runs with different values of K is the best solution. Increasing K increases the the degrees of freedom. We need to take that effect into consideration in order to compare solutions.

One approach is to calculate the Bayesian Information Criterion (BIC) for each solution and choose the solution with the minimum BIC. The BIC takes the change in the number of degrees of freedom into account. 

In our case, the BIC is defined

\[RSS = n\log (MSE)\]
\[BIC = n\log (\frac{{RSS}}{n}) + p\log (n)\]

n is the length of the timeseries, and p is the degrees of freedom, i.e. K.


Results

I ran the GA with K values from 0 through 7 with 5 restarts at each K value. On the toy problem, it nailed the change points.




$ python src/ga_cp_bic.py /mnt/c/Users/bbth/OneDrive/analytic_garden/GA/data/test2.csv -R 5 -m 7

Num Breakpoints: 0
Restart: 1 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 2 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 3 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 4 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 5 MSE: 0.768200131276766 BIC: -93.49906777840822

Train MSE: 0.768200131276766
Random number seed: 1757530319
Breaks: []
Slopes: [0.87498511]
Initial slope: -1.062104184879057
BIC: -93.49906777840822

Num Breakpoints: 1
Restart: 1 MSE: 0.26713128422814697 BIC: -504.0401578908922
Restart: 2 MSE: 0.26713135999662774 BIC: -504.0400444358617
Restart: 3 MSE: 0.2671315303845627 BIC: -504.0397892986428
Restart: 4 MSE: 0.26713128688267707 BIC: -504.0401539160226
Restart: 5 MSE: 0.2671313067687818 BIC: -504.04012413874835

Train MSE: 0.26713128422814697
Random number seed: 1757530319
Breaks: [3.03255148]
Slopes: [-0.10502121  1.15256772]
Initial slope: 0.7821082655597978
BIC: -504.0401578908922

Num Breakpoints: 2
Restart: 1 MSE: 0.07598078790615184 BIC: -994.9611172618687
Restart: 2 MSE: 0.07603601788638055 BIC: -994.6704652783811
Restart: 3 MSE: 0.07604884651200713 BIC: -994.6029838720873
Restart: 4 MSE: 0.07597941263212685 BIC: -994.9683574419693
Restart: 5 MSE: 0.07601333506921201 BIC: -994.7898097774975

Train MSE: 0.07597941263212685
Random number seed: 1757530319
Breaks: [3.34024974 7.51526035]
Slopes: [-0.18064767  1.49689334  0.30740931]
Initial slope: 0.861732960505384
BIC: -994.9683574419693

Num Breakpoints: 3
Restart: 1 MSE: 0.010960306302913442 BIC: -1757.4583038258043
Restart: 2 MSE: 0.07604734720171356 BIC: -982.6279408939248
Restart: 3 MSE: 0.07596282888136476 BIC: -983.0727444260882
Restart: 4 MSE: 0.010562970908833194 BIC: -1772.2285653062725
Restart: 5 MSE: 0.010432261329197197 BIC: -1777.209172976474

Train MSE: 0.010432261329197197
Random number seed: 1757530319
Breaks: [1.33971739 3.00878892 7.51557945]
Slopes: [ 1.12909401 -1.00764137  1.49937372  0.30578179]
Initial slope: -0.07847355995515726
BIC: -1777.209172976474

Num Breakpoints: 4
Restart: 1 MSE: 0.07622839759249472 BIC: -969.6938400876858
Restart: 2 MSE: 0.011545398555128799 BIC: -1724.6726804846996
Restart: 3 MSE: 0.010440330102803444 BIC: -1764.9169857229363
Restart: 4 MSE: 0.07604291515918285 BIC: -970.6683244955417
Restart: 5 MSE: 0.07601372148631955 BIC: -970.82191817677

Train MSE: 0.010440330102803444
Random number seed: 1757530319
Breaks: [0.80667405 1.33855579 3.05499137 7.48528168]
Slopes: [ 0.93920416  1.18604746 -0.94027751  1.51162942  0.31542384]
Initial slope: 0.013945172373571815
BIC: -1764.9169857229363

Num Breakpoints: 5
Restart: 1 MSE: 0.010644732063791634 BIC: -1745.178485874649
Restart: 2 MSE: 0.012666979023629087 BIC: -1675.6051248217907
Restart: 3 MSE: 0.011837283458410761 BIC: -1702.702870828842
Restart: 4 MSE: 0.010654131471211147 BIC: -1744.8254376281354
Restart: 5 MSE: 0.01087133533109335 BIC: -1736.7527213735639

Train MSE: 0.010644732063791634
Random number seed: 1757530319
Breaks: [1.33809471 2.99997902 5.39033556 7.10730355 7.50918906]
Slopes: [ 1.15367754 -1.01715291  1.50646275  1.46105877  1.61661661  0.3051271 ]
Initial slope: -0.11025318081226192
BIC: -1745.178485874649

Num Breakpoints: 6
Restart: 1 MSE: 0.011239317804984892 BIC: -1711.4543482865681
Restart: 2 MSE: 0.013798197447378584 BIC: -1629.4064223827509
Restart: 3 MSE: 0.0134995356186553 BIC: -1638.1594934394984
Restart: 4 MSE: 0.019203054716150737 BIC: -1497.193861462054
Restart: 5 MSE: 0.011973759948817077 BIC: -1686.1345740932345

Train MSE: 0.011239317804984892
Random number seed: 1757530319
Breaks: [1.1595025  1.46699623 3.0337865  7.62772597 8.77150888 9.86875214]
Slopes: [ 0.97415186  0.58236226 -0.97489684  1.49773889  0.13340346  0.42822295
 -0.27696669]
Initial slope: -0.005901498216313833
BIC: -1711.4543482865681

Num Breakpoints: 7
Restart: 1 MSE: 0.012618147112265589 BIC: -1653.1842689184527
Restart: 2 MSE: 0.015023512816917873 BIC: -1583.3920808586547
Restart: 3 MSE: 0.01117985689526189 BIC: -1701.5932118075225
Restart: 4 MSE: 0.016076479643951244 BIC: -1556.2957539720474
Restart: 5 MSE: 0.015254660080210364 BIC: -1577.2846646775888

Train MSE: 0.01117985689526189
Random number seed: 1757530319
Breaks: [1.31181149 2.40743245 2.67010934 3.0885786  3.33149654 7.12396452
 7.68387703]
Slopes: [ 1.14648308 -0.93700508 -0.90226644 -0.61927955  1.38019676  1.51134006
  1.14414582  0.2976773 ]
Initial slope: -0.11001194593707966
BIC: -1701.5932118075225

====================================
Best Model
Train MSE: 0.010432261329197197
Random number seed: 1757530319
Breaks: [1.33971739 3.00878892 7.51557945]
Slopes: [ 1.12909401 -1.00764137  1.49937372  0.30578179]
Initial slope: -0.07847355995515726
BIC: -1777.209172976474
$

The code and data are available on GitHub.

No comments:

Post a Comment