https://en.wikipedia.org/wiki/Chromosome#/media/File:Chromosome_structure.png
I have been interested in evolutionary computation (EC) for quite a while. Evolutionary computation is a family of algorithms for global optimization inspired by biological evolution. Genetic algorithms (GA) are a form of EC commonly used for the solution of optimization problems. The method is inspired by natural selection.
In a genetic algorithm, a population of candidate solutions to the optimization problem are encoded as chromosomes, an analogy to biological chromosomes. Each chromosome is an encoding of the parameters needed to generate a possible value of the function being optimized. The encoding might be a bitmap, a vector, program code, or any representation that can be used to calculate the objective function. For each chromosome, a fitness or score is calculated. The score is the value of the objective function for that chromosome. The full evaluation of the set of chromosome is called a generation. A selection method chooses candidate chromosomes to form the next generations. A crossover operation is applied to pairs of chromosomes to form new child chromosomes. A mutation operator can be applied to the new population to provide more variation in the solutions generated. The process is repeated for a fixed number of generations or until the objective function stops changing.
Genetic algorithms are not algorithms. They are heuristics. For any given problem, a representation must be chosen and mutations and crossover operations defined. Defining these can be the difficult part of applying a GA to a specific problem. In addition, there are hyperparameters that must be defined including the population size, the number of generations for the process to run, and depending on the encoding method other possible parameters.
GAs can be useful for problems that have a complex fitness landscape. This is because mutation and crossover, can move the population away from local optima where a traditional algorithm might get stuck.
Change Point
The change point problem is finding the positions where the underlying probability distribution changes occur in a timeseries. I have looked at the change point problem previously, for example here. A genetic algorithm is not a good method for finding change points. If you really want to find the change points in series, use something like the ruptures library for python or one of the many R packages for change point detection. GAs tend to be more computationally expensive than methods specifically designed specifically for change point detection.
Nevertheless, I want to try two things: play with a GA and see if an LLM could generate satisfactory genetic algorithm code. I needed a test problem and since I have had experience with change point detection, it seemed like a relatively easy problem to tackle.
In what follows, we will use the terms change point and breakpoint interchangeably.
Vibe Coding
Vibe coding is a development method that uses AI to generate code from natural language prompts. It usually involves attacking a problem with less planning than traditional software development methods. It can be very effective for smaller self-contained projects. It may be a an inappropriate approach for complex mult-developer projects.
I asked ChatGPT 5 to generate python code to find the change points in a timeseries using a GA. To my surprise, it generated working code. I made extensive changes to the original code, but the framework came from ChatGPT.
GA
Here is some pseudocode for a typical GA. We will need to fill in the details of the chromosomes, population size, operators, etc.
initialize population
gen = 0
best = None
fitness = -Inf
while gen < total_generations
calculate fitness for each chromosome
new_pop = []
while size new_pop < size previous generation
select 2 chromosomes
apply crossover operator
apply mutation operator
add 2 new chromosomes to new_pop
population = new_pop
if best fitness in population > best; best = best in population
gen = gen + 1
report best
Data
We will test the GA with a timeseries that has some obvious breakpoints. We will assume that the timeseries is piecewise linear with normally distributed noise.
\[{{\mathbf{y}}_i} = {c_i} + {m_i}{{\mathbf{x}}_i} + {\varepsilon _i}\]
i is the segment number, ${c_i}$, ${m_i}$ are the intercept and slope of the linear model. ${\varepsilon _i} \sim N(0,{\sigma ^2})$ is normally distributed noise.
If we simply assume that the model for each segment is linear, we could choose change points and then use linear regression to estimate the slopes and intercepts. If we assume that the segments are piecewise linear, we impose additional constraints on the intercepts.
\[segment\,j > 0\]
\[{c_j} = {c_{j - 1}} + \left( {{m_{j - 1}} - {m_j}} \right)\]
These constraints make the optimization problem slightly more difficult.
Here's code to generate a time series subject to the constraints on the segment intercepts.
def generate_piecewise_linear(n: int = 200, K: int = 3, x_range: tuple[float, float] = (0, 10),
slopes: np.ndarray | None = None,
intercept: float = 0.0,
noise_std: float = 0.1,
random_state: int | None = None) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
"""
Generate synthetic piecewise linear data with K segments.
Partially generated by ChatGPT 5.
Parameters
----------
n : int
Number of x points.
K : int
Number of breakpoints (=> K+1 segments).
x_range : tuple
Range of x values (min, max).
slopes : list or None
Slopes for each segment. If None, random slopes are drawn.
intercept : float
Starting intercept (at x = x_range[0]).
noise_std : float
Standard deviation of Gaussian noise.
random_state : int or None
Random seed.
Returns
-------
x : ndarray
Sorted x values.
y_true : ndarray
True (noise-free) piecewise linear values.
y : ndarray
Noisy observations.
breakpoints : ndarray
Breakpoint x-locations (excluding endpoints).
"""
seed = random_state
if seed is None:
seed = int(time.time())
rng = np.random.default_rng(seed)
rng = np.random.default_rng(random_state)
x = np.linspace(x_range[0], x_range[1], n)
# choose K breakpoints randomly within range
breaks = np.sort(rng.uniform(x_range[0], x_range[1], size=K))
# slopes
if slopes is None:
slopes = rng.uniform(-1.5, 1.5, size=K+1)
# compute true y values segment by segment
y_true = np.empty_like(x)
current_intercept = intercept
seg_start = x_range[0]
for j in range(K+1):
if j < K:
seg_end = breaks[j]
else:
seg_end = x_range[1]
idx = (x >= seg_start) & (x <= seg_end)
y_true[idx] = slopes[j] * (x[idx] - seg_start) + current_intercept
# update intercept for continuity
current_intercept = slopes[j] * (seg_end - seg_start) + current_intercept
seg_start = seg_end
# add Gaussian noise
y = y_true + rng.normal(0, noise_std, size=n)
print("Breakpoints:", breaks)
print("Slopes:", slopes)
print("c0:", intercept)
return x, y_true, y, breaks
Here's is the generated data that we will use.
As you can see, there are three distinct change point in the series. There are 400 data points.
The Problem
We will want to find a set line segments that fit the data. The objective function will be the mean squared error (MSE).
\[\frac{1}{n}\sum\limits_i {{{\left( {{{\widehat y}_i} - {y_i}} \right)}^2}} + \lambda K\]
${{{\widehat y}_i}}$ is the estimated value of y from the linear model. ${{y_i}}$ is data element i. n is the number of data points. $\lambda $ is an optional penalty on the number of segments. K is the number of change points (breaks). In practice, we will usually set $\lambda = 0$.
Our problem is to find the locations of the breaks, the slopes of the linear model for each segment, and the slope of the initial segment. We can calculate the intercepts for the other segments from the constraints.
The Population
We will represent the chromosome, i.e. possible solutions to the optimization problem, as an array of real numbers. For a solution with K changepoints (K+1 segments), the chromosome would be:
\[\left[ {brea{k_0},brea{k_1}...brea{k_{K - 1}},slop{e_0},slop{e_1}...slop{e_K},{c_0}} \right]\]
To start the GA process we generate an array of n chromosomes.
Evolution
The evolutionary process follows the pseudocode show above. A model, stored as a dictionary is returned at the end of the evolution.
def evolve(x: np.ndarray, y: np.ndarray,
rng: np.random._generator.Generator,
K: int = 3,
lam: float = 0.0,
pop_size: int = 200,
generations: int = 400,
elite: int = 5,
eta = 100) -> dict:
"""Evolve a population of chromosomes
Parameters
----------
x : np.ndarray
sequence of x values
y : np.ndarray
the seuences of y values
rng : np.random._generator.Generator
the random number generator
K : int, optional
the number of change points (breaks), by default 3
lam : float, optional
_description_, by default 0.0
pop_size : int, optional
population size, by default 200
generations : int, optional
number of generations to evolve, by default 400
elite : int, optional
the number of top chromosomes to carry to next generation, by default 5
eta : int, optional
parameter for the SBX crossover, by default 100
Returns
-------
dictionary with key-value pairs
"breakpoints",
"slopes",
"c0",
"predict": a function for calculating the objective function,
"loss": mse loss (best fitness),
"BIC": Bayesian Information Criterion
"""
slope_scale = float(np.std(y)) / max(float(np.std(x)), 1e-8)
pop = init_population(rng,
pop_size = pop_size,
K = K,
slope_scale = slope_scale)
fitness = np.array([loss(g, x, y, K, lam) for g in pop])
for _ in range(generations):
# Elitism
elite_idx = np.argsort(fitness)[:elite]
new_pop = [pop[i].copy() for i in elite_idx]
# Tournament selection
def tournament():
i, j = rng.integers(0, pop_size, 2)
return pop[i] if fitness[i] < fitness[j] else pop[j]
# Create offspring
while len(new_pop) < pop_size:
p1, p2 = tournament(), tournament()
c1, c2 = crossover(p1, p2, rng, eta = eta)
c1 = mutate(c1, rng, K=K)
c2 = mutate(c2, rng, K=K)
new_pop.extend([c1, c2])
pop = np.array(new_pop[:pop_size])
fitness = np.array([loss(g, x, y, K, lam) for g in pop])
# print(fitness[np.argmin(fitness)], pop[np.argmin(fitness)])
best = pop[np.argmin(fitness)]
x_min, x_max = x.min(), x.max()
b, m, c0 = decode_genome(best, K, x_min, x_max)
mse = float(np.min(fitness))
rss = mse * len(y)
bic = len(y) * np.log(float(rss)/len(y)) + (2 * K + 2) * np.log(len(y))
return {"breakpoints": b,
"slopes": m,
"c0": c0,
"predict": lambda X: piecewise_predict(np.asarray(X), b, m, c0),
"loss": mse,
"BIC": bic}
predict is a reference to a function to calculate the predicted piecewise linear fit. decode_genome returns a tuple of the chromosome elements.
Crossover
The idea behind crossover is to exchange pieces of the chromosome in the hope that good parts of the two parent chromosomes will be combined in the children to improve the objective function.
Crossover for bitmap encoded chromosomes is simple; sample a point in the bitmap and swap the corresponding pieces to produce two new children. Encoding the chromosome as real numbers is convenient in this case, but makes crossover a bit more difficult.
Instead of a simple swap of corresponding pieces of a bitmap list, we will use
Simulated Binary Crossover (SBX). The SBX operator produces offspring ${c_1},{c_2}$ similar to the parents ${p_1},{p_2}$, but still allows some diversity. We want two children that are symmetric around the midpoint of the parents, with some spread to provide diversity in the next generation.
To perform SBX, we generate a random number $u \sim U(0,1)$ and define a spread factor:
\[\beta = {(2u)^{\frac{1}{{{\eta _c} + 1}}}}if\,u < 0.5\]
\[\beta = {\left( {\frac{1}{{2(1 - u)}}} \right)^{\frac{1}{{{\eta _c} + 1}}}}if\,u \geqslant 0.5\]
${\eta _c}$ is called the distribution index.
We compute the children as:
\[{c_1} = 0.5\left[ {(1 + \beta ){p_1} + (1 - \beta ){p_2}} \right]\]
\[{c_2} = 0.5\left[ {(1 - \beta ){p_1} + (1 + \beta ){p_2}} \right]\]
Note that
\[\frac{{{c_1} + {c_2}}}{2} = \frac{{{p_1} + {p_2}}}{2}\]
Large ${\eta _c}$ tends to generate children close to the parents. Smaller ${\eta _c}$ moves the children further from the parents.
The following plot shows the effect of changing ${\eta _c}$.
Here's the python code.
def crossover(p1: np.ndarray, p2: np.ndarray,
rng: np.random._generator.Generator,
eta: float = 10.0,
prob: float = 0.8) -> tuple[np.ndarray, np.ndarray]:
"""SBX crossover
Parameters
----------
p1 : np.ndarray
a chromosome
p2 : np.ndarray
a chromosome
rng : np.random._generator.Generator
random number generatoe
eta : float, optional
distribution index by default 10.0
prob : float, optional
crossover probability, by default 0.8
Returns
-------
tuple[np.ndarray, np.ndarray]
a pair of chromosomes,eith the original pair or the new children
"""
if rng.random() > prob:
return p1.copy(), p2.copy()
c1, c2 = p1.copy(), p2.copy()
for i in range(len(p1)):
u = rng.random()
if u <= 0.5:
beta = (2*u)**(1/(eta+1))
else:
beta = (1/(2*(1-u)))**(1/(eta+1))
c1[i] = 0.5*((1+beta)*p1[i] + (1-beta)*p2[i])
c2[i] = 0.5*((1-beta)*p1[i] + (1+beta)*p2[i])
return c1, c2
Mutation
We will use a low probability of mutation. The mutations are Gaussian noise on slopes and intercept; a small jitter is applied on breakpoints at the boundaries.
def mutate(g: np.ndarray,
rng: np.random._generator.Generator,
sigma_knots: float = 0.05,
sigma_slopes: float = 0.1,
sigma_c0: float = 0.1,
p: float = 0.2,
K: int =1 ) -> np.ndarray:
"""Gaussian mutation.
Parameters
----------
g : np.ndarray
chromosome
rng : np.random._generator.Generator
random number generator
sigma_knots : float, optional
st dev of Gaussian probability for joints, by default 0.05
sigma_slopes : float, optional
st dev of Gaussian for slopes, by default 0.1
sigma_c0 : float, optional
st dev of Gaussian for c0, by default 0.1
p : float, optional
probability of a mutation, by default 0.2
K : int, optional
number of breakpoints, by default 1
Returns
-------
np.ndarray
a new chromosome
"""
out = g.copy()
for i in range(len(g)):
if rng.random() < p:
if i < K: # breakpoint gene
val = out[i] + rng.normal(0, sigma_knots)
# reflect to keep in (0,1)
if val < 0: val = -val
if val > 1: val = 2 - val
out[i] = np.clip(val, 0, 1)
elif i < K + (K+1): # slope gene
out[i] += rng.normal(0, sigma_slopes)
else: # c0
out[i] += rng.normal(0, sigma_c0)
return out
Getting Stuck
GAs are heuristics and not guaranteed to find a global optimum. They can get stuck in suboptimal solutions. One work around this problem, is to run the GA several times and report the chromosome with the best fitness from all of the restarts. This approach is easy to run in parallel, since the individual runs do not need to communicate with one another.
How Many Change Points?
The analysis and code above assumes that we know how many change points, the parameter K in the code, there are in the timeseries. In our toy data, it's obvious that there are three change points. We fixed the number of change points when we generated the data.
One simple approach to finding K is to run separate GA instances with varying values of K. Again, this is amenable to being run in parallel.
We can't just assume that the solution with the best fitness from runs with different values of K is the best solution. Increasing K increases the the degrees of freedom. We need to take that effect into consideration in order to compare solutions.
One approach is to calculate the
Bayesian Information Criterion (BIC) for each solution and choose the solution with the minimum BIC. The BIC takes the change in the number of degrees of freedom into account.
In our case, the BIC is defined
\[RSS = n\log (MSE)\]
\[BIC = n\log (\frac{{RSS}}{n}) + p\log (n)\]
n is the length of the timeseries, and p is the degrees of freedom, i.e. K.
Results
I ran the GA with K values from 0 through 7 with 5 restarts at each K value. On the toy problem, it nailed the change points.


$ python src/ga_cp_bic.py /mnt/c/Users/bbth/OneDrive/analytic_garden/GA/data/test2.csv -R 5 -m 7
Num Breakpoints: 0
Restart: 1 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 2 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 3 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 4 MSE: 0.768200131276766 BIC: -93.49906777840822
Restart: 5 MSE: 0.768200131276766 BIC: -93.49906777840822
Train MSE: 0.768200131276766
Random number seed: 1757530319
Breaks: []
Slopes: [0.87498511]
Initial slope: -1.062104184879057
BIC: -93.49906777840822
Num Breakpoints: 1
Restart: 1 MSE: 0.26713128422814697 BIC: -504.0401578908922
Restart: 2 MSE: 0.26713135999662774 BIC: -504.0400444358617
Restart: 3 MSE: 0.2671315303845627 BIC: -504.0397892986428
Restart: 4 MSE: 0.26713128688267707 BIC: -504.0401539160226
Restart: 5 MSE: 0.2671313067687818 BIC: -504.04012413874835
Train MSE: 0.26713128422814697
Random number seed: 1757530319
Breaks: [3.03255148]
Slopes: [-0.10502121 1.15256772]
Initial slope: 0.7821082655597978
BIC: -504.0401578908922
Num Breakpoints: 2
Restart: 1 MSE: 0.07598078790615184 BIC: -994.9611172618687
Restart: 2 MSE: 0.07603601788638055 BIC: -994.6704652783811
Restart: 3 MSE: 0.07604884651200713 BIC: -994.6029838720873
Restart: 4 MSE: 0.07597941263212685 BIC: -994.9683574419693
Restart: 5 MSE: 0.07601333506921201 BIC: -994.7898097774975
Train MSE: 0.07597941263212685
Random number seed: 1757530319
Breaks: [3.34024974 7.51526035]
Slopes: [-0.18064767 1.49689334 0.30740931]
Initial slope: 0.861732960505384
BIC: -994.9683574419693
Num Breakpoints: 3
Restart: 1 MSE: 0.010960306302913442 BIC: -1757.4583038258043
Restart: 2 MSE: 0.07604734720171356 BIC: -982.6279408939248
Restart: 3 MSE: 0.07596282888136476 BIC: -983.0727444260882
Restart: 4 MSE: 0.010562970908833194 BIC: -1772.2285653062725
Restart: 5 MSE: 0.010432261329197197 BIC: -1777.209172976474
Train MSE: 0.010432261329197197
Random number seed: 1757530319
Breaks: [1.33971739 3.00878892 7.51557945]
Slopes: [ 1.12909401 -1.00764137 1.49937372 0.30578179]
Initial slope: -0.07847355995515726
BIC: -1777.209172976474
Num Breakpoints: 4
Restart: 1 MSE: 0.07622839759249472 BIC: -969.6938400876858
Restart: 2 MSE: 0.011545398555128799 BIC: -1724.6726804846996
Restart: 3 MSE: 0.010440330102803444 BIC: -1764.9169857229363
Restart: 4 MSE: 0.07604291515918285 BIC: -970.6683244955417
Restart: 5 MSE: 0.07601372148631955 BIC: -970.82191817677
Train MSE: 0.010440330102803444
Random number seed: 1757530319
Breaks: [0.80667405 1.33855579 3.05499137 7.48528168]
Slopes: [ 0.93920416 1.18604746 -0.94027751 1.51162942 0.31542384]
Initial slope: 0.013945172373571815
BIC: -1764.9169857229363
Num Breakpoints: 5
Restart: 1 MSE: 0.010644732063791634 BIC: -1745.178485874649
Restart: 2 MSE: 0.012666979023629087 BIC: -1675.6051248217907
Restart: 3 MSE: 0.011837283458410761 BIC: -1702.702870828842
Restart: 4 MSE: 0.010654131471211147 BIC: -1744.8254376281354
Restart: 5 MSE: 0.01087133533109335 BIC: -1736.7527213735639
Train MSE: 0.010644732063791634
Random number seed: 1757530319
Breaks: [1.33809471 2.99997902 5.39033556 7.10730355 7.50918906]
Slopes: [ 1.15367754 -1.01715291 1.50646275 1.46105877 1.61661661 0.3051271 ]
Initial slope: -0.11025318081226192
BIC: -1745.178485874649
Num Breakpoints: 6
Restart: 1 MSE: 0.011239317804984892 BIC: -1711.4543482865681
Restart: 2 MSE: 0.013798197447378584 BIC: -1629.4064223827509
Restart: 3 MSE: 0.0134995356186553 BIC: -1638.1594934394984
Restart: 4 MSE: 0.019203054716150737 BIC: -1497.193861462054
Restart: 5 MSE: 0.011973759948817077 BIC: -1686.1345740932345
Train MSE: 0.011239317804984892
Random number seed: 1757530319
Breaks: [1.1595025 1.46699623 3.0337865 7.62772597 8.77150888 9.86875214]
Slopes: [ 0.97415186 0.58236226 -0.97489684 1.49773889 0.13340346 0.42822295
-0.27696669]
Initial slope: -0.005901498216313833
BIC: -1711.4543482865681
Num Breakpoints: 7
Restart: 1 MSE: 0.012618147112265589 BIC: -1653.1842689184527
Restart: 2 MSE: 0.015023512816917873 BIC: -1583.3920808586547
Restart: 3 MSE: 0.01117985689526189 BIC: -1701.5932118075225
Restart: 4 MSE: 0.016076479643951244 BIC: -1556.2957539720474
Restart: 5 MSE: 0.015254660080210364 BIC: -1577.2846646775888
Train MSE: 0.01117985689526189
Random number seed: 1757530319
Breaks: [1.31181149 2.40743245 2.67010934 3.0885786 3.33149654 7.12396452
7.68387703]
Slopes: [ 1.14648308 -0.93700508 -0.90226644 -0.61927955 1.38019676 1.51134006
1.14414582 0.2976773 ]
Initial slope: -0.11001194593707966
BIC: -1701.5932118075225
====================================
Best Model
Train MSE: 0.010432261329197197
Random number seed: 1757530319
Breaks: [1.33971739 3.00878892 7.51557945]
Slopes: [ 1.12909401 -1.00764137 1.49937372 0.30578179]
Initial slope: -0.07847355995515726
BIC: -1777.209172976474
$
The code and data are available on
GitHub.