Concrete info: top 100 with RL/ policy gradients


All, similar to my post last year: here's a detailed explanation of how to achieve top 100 (in fact 90-ish) with an extremely simple policy-gradient based approach. @mathias @scottlegrand @julskast @harikmenon1 and others have at some time expressed interest - i hope others find it useful too.

This sort of technique is interesting because it is neither an artisanal handcoded bot, nor is it a supervised AI trained to emulate good human bots- it is entirely ex-nihilo reinfocement learning (ie, trained via unsupervised self-play)

It is not my best bot but is certainly so far my most elegant - the core code is less than 10 lines.

I personally like ex-nihilo RL techniques because they produce "alien" non-human play, give great return for effort, can go on self-improving , and are the only viable strategy in many real-world problems where there are no precedents to emulate.

For example, if one had to design a rover for a never-before-explored environment, a trading algorithm for a totally new market/product, or a communications protocol for an interstellar probe to talk to aliens, a self-building RL bot is likely the only choice.

I hope by the end you will also see just how incredibly simple and non-mystical RL is.

The basic idea is to express ship behaviour as a parametrized function . Then, one tries to find good parameters. If the functional form is chosen well, with minimal parameters, not only will it be feasible to optimize but a strategy is likely to generalize well and not merely memorize the games it has experienced.

(btw if anyone knows how to inline tex or mathjax in here please let me know)

  • Imagine you are a tiny captain of a tiny ship. As a general heuristic, you want to move toward free slots, (whether on unowned, friendly or enemy planets), toward enemy ships, and away from friendly ships.
  • We do this by inducing a vector field on the game arena, such that the field gives the desired direction of movement. Put another way, we induce a scalar potential function on the game arena such that if each ship was a ball bearing , it would roll in a good direction. Ie, the negative gradient of this scalar function is the vector field we want

  • Intuitively, we want "hills' centered at friendly ships, and "valleys" centered at free slots and enemy ships.
    We want some way to control the height/depth , and radius of these hills or valleys. A natural way of inducing these is via radial basis functions, specifically gaussians. For a ship s, and an object x, the contribution of x to the scalar function at s is Wx*Exp(-Gx d(x,s)^2) ) where d(s,x) is the euclidean distance between s and x. If Wx is positive, this is a hill centered at x & vice versa.

  • We want to be able to tune affinity to objects separately, so for each class of object (free slot on unowned planet, friendly planet, enemy planet, enemy ship, friendly ship) , define kernel widths Gu,Gf ,Ge, Ges,Gfs . The higher a G, the narrower the corresponding gaussian is . Then, the total scalar function seen at a ship location s is
    Sum (over all objects x) Wx*Exp(-Gx d(x,s)^2) )
    For simplicity, lets keep Wfs = 1 and all other weights = -1 , ie friendly ships induce hills, everything else induces round valleys.

  • below are 2 pics from frame 40 of an example game, with good values of the kernel widths. showing the scalar function, game arena and vector field (which is simply the negative gradient of the sum of all the hills and valleys ). A ship at any point would tend to move like a ball rolling downhill . You can see that ships are drawn to capturable slots, and also drawn via the interesting central region into battle with the enemy .

  • the surface would slightly change at every turn, as game objects appear or change their dispositions. But it is continuous for any given turn, and hence ships close together will tend to move in similar ways

  • the actual algorithm is
    if ship can dock to nearest planet
    dock to nearest planet
    try to move in direction induced by surface

That's it. Five parameters- you dont even need conflict resolution code , as friendly ships repel each other. this will reliably get you to around rank 90 ish . I can't paste the precise parameters here (mods- or can I ?) but it should be easy to estimate them from the pictures.

Other considerations and extensions

  • The gradient is very fast to calculate and you know exactly which direction to move in, no need to try 360 directions
  • To actually find good values , i simply run a few hundred games over and over and use LBFGS to try and find good values of the parameters. As the bot improves i run it against older versions of itself, plus a simple control that greedily colonizes the nearest free docking slots

  • The blended sum of potential functions contains a lot of useful info - about how many friendly and enemy ships are in the vicinity, how many docked and undocked ships, how far away the action is, etc etc. It seems that most of what you need is encoded in a) what is out there b) how far away it is. No need to memorize sequences or predict velocities or ...

  • A refinement that gives an additional boost of approx 20 places (ie to 70ish) was to make the friendly ship potential more complex, with a tunable minimum. That makes friendly ships drift toward each other (but not too close). You can see that it likes putting ships a touch above 1 from each other- of course, the actual positons depend on what else is going on .

  • Similar refinements on the potential functions yield behaviours similar to rabbit and defection

  • In this example I have deliberately not used fancy techniques or deep nets or blockchain or [buzzword du jour] ... because experience / occams razor show that the vast majority of practical machine learning is just some form of constrained algorithmic optimization. It's all just applied math anyway. The parametrization is the most important conceptual leap.

In a couple of days I may post a sequence of games showing the evolution of such a bot. Because there are so few parameters, you dont need to train for hundreds of thousands of games: if you find a set of 5 that works well for say 72 games, you're pretty sure those 5 params generalise well.

A merry xmas and happy holidays in advance to all of you !


Thank you so much!

Yesterday I was reading up on flocking behavior in a project that has suspiciously similar thrust commands to Halite-2's ships: (and code: )

Flocking might be an interesting refinement to think about with friendly ship behavior. Or it might not be useful. My thinking was that enemies are only likely computing the distance to nearest enemy, and so they don't see if your ships are in a flock or not. I couldn't find a good replay so I just made a fake diagram: (Note that I'm talking about other bots -- your approach means that a group of enemy ships could appear as more of a hill to avoid on the map!)

Again, whether or not "flocking" ships close together as they move from one destination or another needs some testing, to see if it regularly fools enemies.

It is also worth calling out here that the flocking article has a Optimizations section that helped me (and the optimizations page is actually what I first found through google: )

Happy holidays!


Fantastic writeup @arjunvis! Do I sense a sort of Lennard-Jones potential for the friendly ships (with a minimum epsilon to keep it from ever hitting infinity)? I really love the simplicity of this and the algorithm for Halite I you posted. I can't help but think that your model would make a fantastic front-end to the autoencoder-based RL described here:

My bots got to 220 or so with each ship just targeting the nearest dockable/attackable thing using the canned collision code. Fixing that code so it avoided most collisions and found a way to achieve most moves got me to 90 or so. Bug refinements from watching it play got me from there to 60, and adding in early attacks to wipe out the nearest player got me to where I am now (~40).

Where I think there's room for improvement is in coordinating behavior at close range. The top ~20 bots seem to form squadrons that consistently beat my squadrons. I'm not clear yet what exactly they're doing differently. The only thing I have noticed is that it looks like when a ship takes damage it tries to flee in the direction it came from for one turn and that seems to help. Also, my code fails when all my ships have the same nearest neighbor at the expense of defending a planet under attack. Your vector field could help address this. Finally, I clearly need to put in some sort of Brave Sir Robin(tm) strategy when all hope of victory is lost in a 4-player game.

@mathias Here's the article I used to improve collision detection: it's similar to the ones you posted:

Finally, as we get into the final stretch on Halite, I have noticed a dramatic slowdown in the convergence of the rating for submitted bots. Until a few days ago, a bot's rating would have been based on 800+ games within 24 hours. In contrast, my latest bot (which may or may not be a regression) has only 298 games over the past 48 hours. I attribute this to the sudden surge in contests from ~2600 to ~3700 in the past few days. If this persists, the ability to do offline simulation will become more and more important.


This is what I was hinting at, for bots that simply list enemies by distance and get min distance from that list, above. The vector field approach above would help you to see a "squadron." I also suspect, but can't confirm, that using small neighborhoods (within 14-by-14 units? You'd want to experiment) to calculate targets (either planets or enemy ships) will help your ships to arrive at the same time and not be a "line of ships at the same target" as in:

This replay has some of that, in the upper right corner during early play, and both bots seem to implement it, so it takes many more rounds for one of them to die than the "charge straight for the enemy" attack:

In fact this really weird replay 4-player game seems to be all fight-and-flee and no planet docking:

(plus two defectors in the upper right)
That replay and another one featuring FakePsyho made me wonder if there was a bug in the engine where their bots couldn't dock and so continued to flee and fight. But since it appears in other replays, and FakePsyho does have games where its bot docks, I suspect is a special early-game strategy when no enemies have docked -- and it may only work when the other players only implement the "Go to nearest unowned planet and dock" strategy? Because when everyone chooses to attack early game instead of dock, you get the weird deadlock like in this replay.

Sorry to get off on a tangent. Speculating about strategy on top of something like @arjunvis's vector field approach seems like a valuable area for discussion, though.


Thanks @arjunvis. This is amazing!



thanks for sharing!

I am very interested in that approach and tried reimplementing, my current bot is ~600th and more or less simple so I wanted to try your approach and see if I can beat myself.

1) How do you cope with your ships not being in a "steep" region? Especially at start your usually far away from everything except your ships and if I just use the gradient to move around I find my ships to be very slow.

2) You mentioned that the navigating can now be done easier and faster by just following the gradient/ If I manually adjust the direction to "max speed" which would solve 1) I will now full speed crash into planets and other objects, since I just follow a gradient I don't necessarily hit the 'closest point to'

3) In general if on your picture your ships would conquer the outside planets isn't there a chance that they will then simply be pushed against the corner of the map? Or if a opponent straight up attacks you but you have a lot of ships docked against that planet wouldn't the planet "shield" the attackers so that your ships would be repelled and float away from the attacker?

Did it ever happen to you that your ships get trapped in a valley in between your planets?

4) You mention it being 5 parameters to be tuned, do really only tune the Gx and you never change the 'attraction' Wx values in order to prefer attacking or conquering planets etc? Then the Gx themselves will be different for planets depending on how many ships they can dock for both my/empty/enemy is that correct?

My current version is moving and doing things but plagued by above illnesses and just getting crushed by my current bot. Would appreciate any pointers.



Thanks for taking the time for the wonderful writeup @arjunvis! I hope it didn't take too much time out of your holiday season :slight_smile:


Heh, I call the early raiding "Marauding" and marauding has all sorts of complications associated with it. If you maraud in a 2-player game, you can insure your victory rate is 50% at least by just making a decent a priori prediction of whether you can get to the other guy's ships before they can start producing reinforcements. But then you need a fallback for when you choose to produce and the other guy chooses to maraud. My favorite countermove came from @cdurbin wherein we both decided to maraud, but right before we engaged in combat he switched to fleeing and wiped me out by my guys hitting a bug in collision detection where they slammed into the map boundaries (bug fixed BTW thanks @cdurbin!). And without that bug fix, I couldn't implement runner scumming/defecting properly so he did me a solid kicking my butt.

What's much more complicated is marauding your nearest neighbor in a 4-player game, a trick I learned from getting my butt kicked early and often by @awesomelemonade. This one's still a work in progress for me though and the drastically slowed down convergence of new bot ratings has me sticking to offline testing for now.

Finally, I am now trying to detect when someone is marauding me (fleet of 3 undocked ships closing on my planets) and whether I can build ships in time to kill him. I might move on to picking planets that give me the maximum amount of time to construct reinforcements, but this then allows the other 2 players to get a leg up on building their first ships.


Quick checkin - glad to see this thread has a few replies :slight_smile:
responses to everyone in roughly posting order -

@mathias Yup Boids are a beautiful model (and pretty solid, having been tested for 30 yrs, i remember hearing about them in the 80s). Starling swarms in the evening in european cities follow a very similar pattern. A friend told me similar local-global effects were used in the LOTR crowd scenes where for example an orc getting agitated would spill over , agitating a bunch of orcs, before the commotion gradually dies out . Effectively, boids are using a 1/r/^2 radial basis function , but they also have the concept of a "squad leader" which is nifty- might be worth playing with. Good suggestion, tks!

@scottlegrand great find! i had no idea about Lennard-Jones (i did math not chem) but its fascinating that the a similar shape appears for essentially the same problem : defining a potential sink for 2-body interactions. I used a difference-of-2-gaussians, and thats why my potential doesnt go to infinity. Neat- great find, TIL lennard jones, thanks

@hsperr Apologies- I said "move in the direction induced by the surface" but i should have also specified "as fast as you can". (generally, thats a good pattern- move as fast as you can- most bots do so i took it as given) I also assume users have a good navigation function with collision detection already (the starter kits have one for example). When i said no need for conflict resolution, i meant amongst ones own ships.

So, the more detailed algorithm is
if ship can dock with nearest planet
dock with nearest planet
move in direction of surface as fast as you can, upto 7.0 speed, subject to not colliding with a planet or docked ship

(you can generally handle this with something like speed = min(7.0,distance to nearest planet or docked ship in the direction i want to travel)

that will sort out issues in your questions 1,2,3 . As the gradient is never truly zero, (or extremely unlikely to be totally zero, youd need to be trapped in a perfectly symmetrical region) - i never get trapped . Also, (try it) if you have ships on the edge of the map they will repel other ships from the edge of the map, back towards the interesting interior. Now eventually you want ships to escape to the edge of the map and refinements to the potential functions will give you rabbit/defect behaviour.

Your q 4 deserves a deeper dive

  • yes! you can get to 90-ish (given the field as of dec 22 ish) only tuning the gxs with wx = 1 for friendly ships and -1 otherwise. But that leaves the q : how to tune efficiently? given what you said about your bots , i think i know what may help:

  • Important: if you train by marking your param-bot against another-bot, care must be taken (just like in GAN training) that the two adversaries are not totally mismatched.

  • eg lets say you train your param-bot on 72 games vs an adversary bot. If the adversary is too weak, you will already be winning 72 out of 72 and wont be able to improve your score no matter how you locally bump the 5 params. If your adversary is too strong, same problem: you will win 0 out of 72 for a large neighbourhood of your params.

  • So, you want to tune your adversary so it wins about 50% of the games, and then tune your params so your param bot goes from losing 50% to doing better, say losing 10%. Then repeat. An easy way to do it would be to cripple or randomize your existing adversary bot until it wins around 50% of test games

Optimization is a bit of a craft (art+science) and balancing-adversaries is analoguous to maximizing the derivative of the function to be minimized- so you always have somewhere to move to .

Hope that makes sense, feel free to pm me directly for more, or pm me on the thread if it is of general usefulness




happy to! just took a while to find a free hour or so .
I should play up how difficult and tedious it was :wink: but i can't in all honesty: halite is just fun !
Plus a holiday halite snap will convince you to not shed tears for me ...


@arjunvis thank you so much for this detailed write up, and sorry to take up your glorious holiday but I hope you can answer a few questions:

I'm just learning policy gradients and don't quite understand how it applies in the current implementation?

I'm probably doing this wrong, but I generated the vector field based on the scalar potential so each point in the arena has dx and dy. So the agent belonging to a point just thrusts (unless docking) in the direction based on dx, dy.

However, doesn't this make the policy deterministic? My naive understanding of PG is that the policy would be more of a distribution of dx, dy which gets shifted around as we play more and more games. In the current approach it seems we instead just tune the Gx's. It seems like the "exploration" benefit that PG is lost...

This leads to my 2nd question which I think you addressed but maybe can shed a little more color: when we use something like SGD for tuning weights in a neural net (say the loss is softmax cross entropy), my understanding is convergence is guaranteed (even if its to a local max).

Here, what would be the gradient to tune the Gx's? Or maybe there's another systematic way, since you mentioned convergence in just a few hundred games.

  • edit: it just occurred to me that Gx's could be drawn from a distribution to make the policy stochastic, but is that a reasonable approach?


@arjunvis I was wondering how your gradient calculation was done based on the scalar value function. I was attempting to implement a similar system and am trying to use the gradient calculated from the scalar value function where the variables x and y are present in the exponent for the distance calculation. Is that similar to the approach you used? One issue that I'm running into is that the ships tend to run towards the edge of the grid and then destroy themselves by falling off.


@wenbochang sure - short but i hope helpful answers below :smile:

  • A policy can absolutely be deterministic. The texts (notably iirc Sutton and Barto) call a policy p a map from state to (distribution of actions) but that is merely in context a function from state-> action. That is, a policy simply reads the state of the universe and then tells you what to do . So any function P from the space S of states to the space A of actions is a policy . TLDR: deterministic or stochastic, doesnt matter

  • We have two gradients which maybe makes it a bit confusing (I dont think @wenbochang has this confusion but stating for completeness : one gradient is the slope of the scalar potential function. That simply tells you which way to move. The second , totally different gradient is the gradient of the reward function wrt the parameters .

  • With a bit of math you can prove that for a fixed set of games, a simple reward function like "proportion of games lost" is a sum of piecewise - flat functions that tends to a continuous (and clearly, bounded between 0 and 1) function. By the extreme value theorem, certainly for any nontrivial convex set in parameter space, the function must have a maximum and a minimum and attain both of those at least once

  • but you dont even need that sort of math. Ask yourself why wouldnt there be an extreme value of the reward function over any convex parameter set. (note- this doesnt require the function to be continuous, why?)

  • Most of my thought indeed went into the objective function. Now, as we know there aint no such thing as a free lunch (although there are cheap ones aplenty if one knows where to look). By which i mean, if you used deep nets you would expend thought and compute on the network architecture. There isnt any magic method (as far as i know) that gives you a closed form optimal soluton in the same way that noughts-and-crosses or iterated prisoners dilemma have optimal solutions. So you need to fall back on a plausible reward function . A very solid start is simply "proportions of games won" (or proportion lost, if you are minimizing)

  • do take care in your test set of games to follow the same size and numplayer distributuion as the halite engine itself, from line 194 of main.cpp in the halite env codebase it is
    std::vector mapSizeChoices =
    { 80, 80, 88, 88, 96, 96, 96, 104, 104, 104, 104, 112, 112, 112, 120, 120, 128, 128 };

  • So we dont know what the magic perfect reward function is. Indeed desigining one seems to be combinatorially hard, But, as you design this reward function, you will have a secondary reward function that tells you "how good is my base reward function at giving me decent halite params?" Since the base reward function could be built up from stacks of subfunctions like number of games won, how fast you kill opponents, avg_frame_response_time etc, it looks way complicated to design.

  • But ! If you make sure the secondary reward function is submodular and monotonic you can then get quite far along the way to a global optimum by simply greedily optimizing. The math is beyond the scope of this post but have a look at Jeremy Kun's page (which is excellent for many other reasons ). This will allow you to (semi-automatically) design a good reward metric.

  • so finally, say you fix a set of 72 test games , half 2 player, half 4 player , and your (well-designed ) reward function is a well-behaved quasi-continuous function of 5 params. You can calculate the 5x5 hessian of this function by running these 72 games 15 times, and you get the 5x1 gradient for free. Now with all this, you should be able to get to a good extremum within 5-10 iterations!


@udpatil Absolutely, but pragmatically you must make sure game rules are adhered to - ie you check for collisions while moving, you dont supply negative angles , you dont fall off the edge of the map.

However- separately- you should not be running towards the edge, unless you have a MASSIVE concentration of friendly ships (hills) in the interior.

From what you say I would make sure to check that the signs of your individual basis functions are positiive (hills) only for friendly ships, and negatgive (valleys) for everything else.

cheers and happy new year everyone in advance !


@arjunvis So yesterday my greedy Voronoi player touched platinum. All of the improvements I've made since describing exactly what it does earlier in the thread are either parts of a pre-game analysis that sets a few Booleans about what it thinks the other player(s) will likely do or in-game heuristics to flee from getting hurt and then head for the hills when the tide has turned against it.

But now that I'm playing against the top 2 tiers , I'm starting to see all their magic tricks used successfully against me. They seem to form implicit squadrons that cycle damage, they do a better job of distracting me (I suspect they're exploiting the sensitivity to outliers that Voronoi regions create), and some of them seem to follow a set plan as to which planets to attack and in which order. So from this point onward, I'm going to try to fit a slightly larger ML model than you are using to see if I can reproduce my best work (supervised initially with effectively infinite training data) and then set it on the merry road to self-play.

What I noticed playing against your best bot, is that my bot beats it when it falls into one of my heuristics, but your model does a better job of tactical planning as long as we're outside the realm of those heuristics. So I'm hoping to combine both approaches and see where that goes. Because beyond tuning my parameter values and/or a few more magic tricks, I'm all out of ideas here.


@arjunvis thanks again for all your time and insight, got a lotta work ahead of me :slight_smile:


@scottlegrand nice one ! I think you cracked the first page for a bit ! I do believe human ingenuity will win and so keep at those artisanal algos.

I am definitely puzzled by nontransitivity at the moment - still on hols , submitted a small tweak that I was sure would improve my bot (it beat my old best consistently) and instead I fall 15 places. Weird - I should just wait till I'm properly back. Happy new year everyone !


Traditional Holiday Halite pic. Happy 2018 all!


Hope you enjoyed that beach, @arjunvis! :smiley:
As a tangent, I was thinking about getting a Surface Laptop but it seems like Linux support is still hit or miss. My next machine might be a X1 Carbon.. we'll see..


Thanks for sharing such a detailed writeup

Just started learning ML this holiday! I think I'm going to try this approach and see how it goes :stuck_out_tongue:

How are you generating your vector field graphs? Tried my hand at it and while my ships move in approx the direction in Halite, the graph generated by plotting points certainly doesnt...