All, similar to my post last year: here's a detailed explanation of how to achieve **top 100 (in fact 90-ish) with an extremely simple policy-gradient based approach**. @mathias @scottlegrand @julskast @harikmenon1 and others have at some time expressed interest - i hope others find it useful too.

This sort of technique is interesting because it is neither an artisanal handcoded bot, nor is it a supervised AI trained to emulate good human bots- it is entirely ex-nihilo reinfocement learning (ie, trained via unsupervised self-play)

It is not my *best* bot but is certainly so far my most *elegant* - the core code is less than 10 lines.

I personally like ex-nihilo RL techniques because they produce "alien" non-human play, give great return for effort, can go on self-improving , and are the only viable strategy in many real-world problems where there are no precedents to emulate.

For example, if one had to design a rover for a never-before-explored environment, a trading algorithm for a totally new market/product, or a communications protocol for an interstellar probe to talk to aliens, a self-building RL bot is likely the only choice.

I hope by the end you will also see just how incredibly simple and non-mystical RL is.

The basic idea is to express ship behaviour as a *parametrized function* . Then, one tries to find good parameters. If the functional form is chosen well, with minimal parameters, not only will it be feasible to optimize but a strategy is likely to generalize well and not merely memorize the games it has experienced.

(btw if anyone knows how to inline tex or mathjax in here please let me know)

- Imagine you are a tiny captain of a tiny ship. As a general heuristic, you want to move toward free slots, (whether on unowned, friendly or enemy planets), toward enemy ships, and
*away from*friendly ships. We do this by inducing a

**vector field**on the game arena, such that the field gives the desired direction of movement. Put another way, we induce a**scalar potential function**on the game arena such that if each ship was a ball bearing , it would roll in a good direction. Ie, the negative gradient of this scalar function is the vector field we wantIntuitively, we want "hills' centered at friendly ships, and "valleys" centered at free slots and enemy ships.

We want some way to control the height/depth , and radius of these hills or valleys. A natural way of inducing these is via radial basis functions, specifically gaussians. For a ship s, and an object x, the contribution*of x*to the scalar function*at s*is*Wx***Exp*(-*Gx**d(x,s)*^2) ) where*d(s,x)*is the euclidean distance between s and x. If*Wx*is positive, this is a hill centered at x & vice versa.We want to be able to tune affinity to objects separately, so for each class of object (free slot on

*u*nowned planet,*f*riendly planet,*e*nemy planet,*e*nemy*s*hip,*f*riendly*s*hip) , define kernel widths*Gu*,*Gf*,*Ge*,*Ges*,*Gfs*. The higher a*G*, the narrower the corresponding gaussian is . Then, the total scalar function seen at a ship location s is

*Sum*(*over all objects x*)*Wx***Exp*(-*Gx**d(x,s)*^2) )

For simplicity, lets keep*Wfs*= 1 and all other weights = -1 , ie friendly ships induce hills, everything else induces round valleys.below are 2 pics from frame 40 of an example game, with good values of the kernel widths. showing the scalar function, game arena and vector field (which is simply the negative gradient of the sum of all the hills and valleys ). A ship at any point would tend to move like a ball rolling downhill . You can see that ships are drawn to capturable slots, and also drawn via the interesting central region into battle with the enemy .

the surface would slightly change at every turn, as game objects appear or change their dispositions. But it is continuous for any given turn, and hence ships close together will tend to move in similar ways

the actual algorithm is

if ship can dock to nearest planet

dock to nearest planet

else

try to move in direction induced by surface

That's it. Five parameters- you dont even need conflict resolution code , as friendly ships repel each other. this will reliably get you to around rank 90 ish . I can't paste the precise parameters here (mods- or can I ?) but it should be easy to estimate them from the pictures.

**Other considerations and extensions**

- The gradient is very fast to calculate and you know exactly which direction to move in, no need to try 360 directions
To actually find good values , i simply run a few hundred games over and over and use LBFGS to try and find good values of the parameters. As the bot improves i run it against older versions of itself, plus a simple control that greedily colonizes the nearest free docking slots

The blended sum of potential functions contains a lot of useful info - about how many friendly and enemy ships are in the vicinity, how many docked and undocked ships, how far away the action is, etc etc. It seems that most of what you need is encoded in a) what is out there b) how far away it is. No need to memorize sequences or predict velocities or ...

A refinement that gives an additional boost of approx 20 places (ie to 70ish) was to make the friendly ship potential more complex, with a tunable minimum. That makes friendly ships drift toward each other (but not too close). You can see that it likes putting ships a touch above 1 from each other- of course, the actual positons depend on what else is going on .

Similar refinements on the potential functions yield behaviours similar to rabbit and defection

In this example I have deliberately not used fancy techniques or deep nets or blockchain or [buzzword du jour] ... because experience / occams razor show that the vast majority of practical machine learning is just some form of constrained algorithmic optimization. It's all just applied math anyway. The parametrization is the most important conceptual leap.

In a couple of days I may post a sequence of games showing the evolution of such a bot. Because there are so few parameters, you dont need to train for hundreds of thousands of games: if you find a set of 5 that works well for say 72 games, you're pretty sure those 5 params generalise well.

A merry xmas and happy holidays in advance to all of you !

-Arj