Native Compilation

To compile scheme to native code natively how would we go about? Generating assebler and have a native compiler is kind of difficult if we want to reach a system that can compile to many different targets. What we would like is to take a scheme file and output byte vectors representing the compiled code. The idea I have is to generate a tempory .c file, compile it, and then extract the code for inclusion in a .go file. That is possible but for the future though currently I would concentrate on generating a .so file in simply include it.

The crux is to get these functions to cooperate with vm scheme code and c-code at the same time. Also what do we do about tail calls. The solution that would make this work using gcc today is to compile a scheme function to a c function where the stack i smanaged as an vector sent to the c function. When there is a new call we always return and use a continuation someting like

ret_t f(SCM *fp1, SCM *sp) { ... }

fp1 is the frame pointer ownd by the new function, and sp is the stack pointer.ret_t` is struct:

struct ret_t
{
    int       kind;
    (SCM *)   fp;
    (SCM *)   sp;
}

kind cane be RETURN in case there is a return value. CALL in case there is a new call and TAIL-CALL, INTERUPTS If we will do a tail call. In case of a return the range [fp ... sp] will include the return variables. In case the it's a 'CALL' or 'TAIL-CALL' the range will be the. Currently the meta information in a frame is a link to the previous frame, an instruction pointer and the program object that contains the code and closure variables. We need to fill in this information but also be able to tell the kind of the frame. E.g. if it is a VM ip frame or if it is compiled scheme function. There is actually room to encode three bits in the first three bits of the frame pointer which is aligned. The drawback of this is that the link cannot be followed for e.g. bdw-gc.

How would we compile the functions? The basic method would be to use named gotos e.g.

ret_t f(*SCM fp, *SCM sp)
{
    if(fp[GOTO]) goto fp[GOTO];

    ...
    fp[GOTO] = &&next;
    return ret;
   next:       
}

where 'GOTO' is the same slot in 'fp' as the 'IP' pointer. Also when working with loops we need to add code to handle interupts e.g. in case we loop and jump back. It would be to constly to do this at each potential interupt. what we need to do is to have a counter and at e.g. each 100th backjump return a INTERUPT object and then continue. How can this look like? well we could just use the following code to count 1 + 2 + 1000000000 in about 0.7s

#include <stdio.h>

int counter = 0;

void ** f(long *vec, void **g)
{
  int i = 0;
    if(g) goto *g;  
    vec[0] = 1;
    vec[1] = 0;
    next:
    {
      long i = vec[1];
      long s = vec[0];

      for(; i < 1000000000; i++)
      {
        s += i;
        if(++counter & 0x100)
        {
          counter = 0;
          vec[0]  = s;
           vec[1]  = i;
           return  &&next;
        }
     }
  }

  return (void **)0;
}

int main()
{
  long vec[2];
  void **g = (void **) 0;
  do {
    g = f(vec,g);
  } while(g);

  printf("Res %p\n",(void*) vec[0]);
  return 1;

}

defining 'f' in stead as lead to 5x as slow code

void ** f(long *vec, void **g)
{
  int i = 0;
  if(g) goto *g;  
  vec[0] = 1;
  vec[1] = 0;
next:
  {
    for(; vec[1] < 1000000000; vec[1]++)
    {
      vec[0] += vec[1];
      if(++counter & 0x100)
      {
        counter = 0;
        return  &&next;
      }
    }
  }

  return (void **)0;
}

and without jumps this takes just as much time as the first code (without the counter it will use n*(n+1)/2 and exit emediately :-)

void ** f(long *vec, void **g)
{
  int i = 0;
  vec[0] = 1;
  vec[1] = 0;
  {
    for(; vec[1] < 1000000000; vec[1]++)
    {
      vec[0] += vec[1];
      if(++counter > 1500000000)
        return 0;
    }
  }

  return (void **)0;
}

We see that gcc have the features to put the values of the vectors in registers in order to speed up the loop by the precense of a label destroys this optimisation. The reason behind this is that the loop has a label and you can't proof that the label in the for loop is not moved to in the named goto. This should have an implication for the speed of VM code which typically is dispatched with a named goto. After som ivestigation I noted this is with -02, but compiling with -O3 results in about 1 billion loops per second which is 3x faster then with -O2. Note that such a loop would be about 10x faster than doing it on guile-2.2.

Now let's move on to the next step in complication. To compile prolog code natively. Here we come into new constraints. We have backtracking that will basically define a point where the stack needs to be compressed to a stored value. These backtracking objects is a tripple ( [fp],sp,goto) in case of native code and a thunk in case of a vm op. Also a good prolog gc will move frames out to the heap and compress the stack, and we could therefore find out that we link between stack and heap back and forth. In order to be able to move the memory all references to the frame must be via a dereference, hence the prolog frame will be of another kind and this dereference need to be stored as frame meta, another reason for moving stack to the heap is that we have saved a state. We simpy have a token saved in the frame and when the frame is unwinding that token is compared to the current one and in case there have been a continuation stored the prolog engine will know that the frame shouled be copied to the heap. There are two kinds of continuations. One is similar to a process or thread e.g. it stalls and continues. This means that if it is of a process kind we do not need to copy the frame when reentering it and hence will mean less overhead and less so here are the different frames GUILE-VM,GUILE-NATIVE,GUILE-PROLOG-STACK,GUILE-PROLOG-HEAP-1,GUILE-PROLOG-HEAP-N, that's fice We have three bits at disposal. So the frame looks like

[[FR]/Type,IP/GOTO,[ThisFrame],TOKEN,F,ARG1,...,ARGN,Local1,...,Localn]

It's a pity that we need an extra meta data. This means that when calling a prolog native function we need to shift '[F,ARG1,...,ARGN]' to the left which is expensive. Anyway with this we will allocate less from the heap and copy much less variables as is done now in guile-prolog where we solve many of these issues with the help of closures. That is not terrible but it does cost a significant amout of performance.

Happy Hacking

links

social