Thursday 28 April 2016

Understanding the mining example

My friend Samuel (@Samuel__GP) suggested me -in a comment- to try to explain how simple the idea used is compared with the rich an efficient behaviours it produces on the rocket, and it is not easy, but I will try...

So first have a look at this video:




The goal of the player is to pick up rocks with it's hook and then take them to the deploy area (the small circle filled with a grid). Once the rock is there, the hook releases and you can not trap the rock again until it leaves the big circle. In that moment, the rocket will try to catch it again and so on.


Something important to keep in mind: the hook is attached to the rocket with a rubber band, a rubber band is an oscillator, and in this case the oscillator movements are forced on one end, so the hook trajectories will follow a Lorentz's chaotic attractor... so it is a really hard task to plan in advance!

Pay attention to the variety of strategies that the rocket will use to full-fill its goal: it will pull from the rope, bounce the rock against the walls or even its body, or stretcth the rubber band using obstacles to beam the hook straight to the rock:

You may think a lot of work was needed to plan the movements prior to start the recording, a lot of code to control risk and so on, but it is not the case at all.

99.9% of all the code is just "base code": the simulation of your system is 80% (I coded it from scratch... old school way) and 19.9% is the general fractal AI, the same code I would use for any other case, so there is nothing in all this code about doing this or that that relates to our particular "mining rocket" case.

In fact, with this 99.9% of "base code" the AI could make the system (the rocket in this case, but you can change it with any other simulable system at hand) to survive indefinitely by flying around gracely and avoiding dangerous bottle-necks, without any other thing to care except keep on existing. This is the "common sense"part of the algorithm, and you can not switch it off like in some humans.

The 0.1% of the code that I had to implement to get all those rich behaviours to emerge was just defining "how good" a given situation was, as a real non-negative number. Zero is reserved for the worst of the cases: the rocket is dead (or, in a more general case, you are out of the potential function's domain) while numbers above zero mean "I would like to be on that state this much", so you can compare and decide if a future outcome is better or worst than another.

I call this number the "state potential" as it relates to an scalar physic field, like electric or gravitational potentials, that you could call "utility" or "gain" in your particular case.

The actual potential used is like this:

-Decide where -a 2D position- you want to get with your hook: if the hook is empty, the target point is the position of the nearest rock around, but if a rock is already on the hook, then the target point is the deploy area (the small circle).

-Compare you initial distance to the rock with the distance form your future position.

-If you get 20% nearer to the rock, the potential of this future state is 0.2

-If you are more distant that initially, potential is set to a small number, let say 0.000001

That is all, add this simple potential formula into the general fractal AI that is controlling the rocket so it keeps on existing, and the AI will "automagically" find all those strategies and change its behaviour to follow them.

Now I will show you a second video where you can watch the "debug traces" showing where futures evolved under the command of the AI.

To understand what you see, keep an eye on the colors of the traces: gray traces correspond to the rocket position on each futures, the more into the future, the darker. It is not important today, as we only care about the hook position, but helps you detect when the situation gets dangerous, as in those moments, gray paths seems to "collapse" (there is a "risk meter" on the upper-left corner, in %).

Green lines are the ones to watch: they represent the future positions of the hook. When the hook changes state (from empty to holding a rock or, inversely, when the rock is deployed) the traces get red, so it is good when the traces change color as it means the rocket is doing the job.

Some times the red traces become blue: this is more the good, as it means, in this future, the AI is able to trap a rock (green to red transition) and then take it into the deploy area (green to blue). A hat trick.






In the previous videos where the rockets were basically collecting energy drops to survive and trying to avoid damaging the hull, the potential formula was even more simple:

Potential = Health level * Energy level.

This small change makes the rocket to behave like a bee collecting nectar or like an asteroid miner.

So that is all, only one potential function makes the whole difference and makes the behaviour to totally mutate.

If you get the right potential formula, the fractal will make the rocket perform whatever you need of it.

2 comments:

  1. Buena entrada, Sergio. Ha quedado bastante claro ;-).

    Por cierto, ¿has pensado en usar tu algoritmo fractal básico a modo de rollout rápido y eficiente para entrenar redes neuronales mediante RL(Reinforce Learning)?

    La idea básica sería:

    1) Para cada estado actual xt, realizar un rollout (de n segundos, o hasta alcanzar un estado terminal) usando tu algoritmo fractal, de modo que se obtenga una valoración media para cada posible acción at del estado actual xt.

    2) Usar esa salida del fractal tras n segundos para estimar la recompensa "real" (i.e. una muy buena aproximación) para cada acción posible at en xt. El algoritmo fractal por tanto nos ofrece un vector de valoración yi.

    3) Luego se procede a utilizar ese yi (sería como digo un vector de tamaño igual al número de acciones posibles en st) para realizar un "gradient descent" en los pesos de la red neuronal del modo tradicional: min(L) = min((Oi - yi)^2). Oi sería el readout de la última capa de la red neuronal tras un Softmax, con un tamaño de salida por tanto igual al de yi, i.e, igual al número de acciones.

    4) Posteriormente se mueve usando la red neuronal de modo Greedy, y se repite el paso 1) en el nuevo estado xt+1 obtenido. De este modo, la red neuronal irá aprendiendo paulatinamente de un modo pseudo-supervisado por tu algoritmo fractal.

    5) Al cabo de M épocas (o iteraciones) la red neuronal habrá aprendido (si posee el suficiente número de capas y nodos) a prever el valor relativo de cada acción en xt de modo similar al algoritmo fractal, pero sin la necesidad de realizar iteraciones hacia el futuro.

    De hecho, yo creo sinceramente que algo similar a ésto es lo que la evolución biológica ha conseguido (mediante ensayo y error) con el sistema nervioso animal.

    En fin, yo creo que la idea es buena. Voy a intentar hacer un ejemplo pequeño en Python de esto que te comento, y si sale algo potable te lo comento :-).

    Un saludo, Sergio!!

    ReplyDelete
    Replies
    1. Hay algo del estilo en el horno, pero no es como comentas, es mas "fractal", aunque de momento no tenemos tiempo de hacer pruebas a esa idea, estamosa tope con otras cosas mas "mercantiles", ya te mantendré informado.

      Si picas tu idea en phyton igual se le puede tambien adaptar el fractal, de tu comentario no entendi una parte, asi que quien sabe...

      Delete