Printing Doubles

Printing reals in guile head has the following characteristics.

> (define x (make-vector 10000000 0.8888888888888888888888))
> (define p (open-file "/dev/null" "w"))
> time (begin (write x p) 1)

15.812557s real time, 15.808950s run time.  0.000000s spent in GC.

Now with my heavily optimized SCM/C write library found at

C_CODE SCM_CODE

We find,

> (use-modules (ice-9 write))
> (define x (make-vector 10000000 0.8888888888888888888888))
> (define p (open-file "/dev/null" "w"))

; 0.996400s real time, 0.995872s run time.  0.000000s spent in GC.

That's a fine improvement, we are talking about 15X increase in speed. Profiling shows that we are spending about 0.6s in the routine actually performing the actions down in C. But let's see for another case, take the same exercise but with 0.1 in stead. Then we get,

> (define x (make-vector 10000000 0.1))
> (define p (open-file "/dev/null" "w"))
> time (begin (write x p) 1)

5.157910s real time, 5.156215s run time.  0.000000s spent in GC.

And the new version,

> (use-modules (ice-9 write))
> (define x (make-vector 10000000 0.8888888888888888888888))
> (define p (open-file "/dev/null" "w"))
> time (begin (write x p) 1)

; 0.996400s real time, 0.995872s run time.  0.000000s spent in GC.

0.558187s real time, 0.557870s run time.  0.000000s spent in GC.

This is a 10X fold difference in speed. But more amazingly, if you investigate the actual time spend in the C-routine we have a time of 0.15s spend in the actual meat of the calculation and for this case the overhead in the guile's write function if it used the new version of it, would probably perform more in line with 0.15-0.20s than the scheme based version.

So what's behind the magic. Well the number one thing to note are that trying to squeeze as much precition as we can get we can simply a lot by focusing an getting 16 digits of precision in stead for a 17. More complicated methods could probably be used, but my personal view is that if one needs precision one should use a hex based write out, because for those the algorithm is much simpler and also the speed and accuracy as well as compaction of printouts i stop notch. Beats me why we do not have those options for writing/reading doubles. Anyhow A 15X increase is asking if it is guile that is awkward here and uses something that is slow in order to e.g. gain precision? I do not know, but let's try to see what speed sprintf has, 10 Millions of sprintf's to a byte buffer takes ... 3s, that's a 5X difference from the C based timings, and for the short version it is 1.93s, that's a whopping 10X difference between printf's performance and the implementation. So clearly we need to explain that algorithm.

So let's start with what we can get in just 1 ns. That's the representation of a real \(x\) as \(x = 1.x_1\ldots \cdot 2^m = y2^m\). To go from here we would like to find \(n\) such that

$$ 2^m = z 10^n $$

Because then we can calculate

$$ x = y2^m = yz10^n=(yz)10^n $$

Where the range for \(\text{range}(yz)=\text{range}(y)\text{range}(z)=[1,2]\cdot[1,10]=[1,20]\) e.g not perfect but close. Anyhow we pre calulate \(d=10^{-n+16}\) and if we multiply the in value with \(d\) we get,

$$ xd = (yz)10^n10^{-n+16} = (yz)10^{16}. $$

If we now take the integer part of this in a type cast we end up with a number \(v\) that has 16-17 number of digits. Now we loose some precition in the multiplication here that is fast as it is a double multiplication and we are left with som more uncertainties that leads to dropping 17 digits of precition as a goal. Anyhow the size can be checked and any re adjusting of the exponent can be done on the real number hance the final \(v\) will always be of a length of 17 digits (but the last one is only used to round up).

The masking of values from an integer can be done according to first find a guess and than adjust to the right value. This is my best take on how to do this,

#define RUN(i)
{
    //if v == 0 than we just jump out to the final postprocess
    if (!v) goto out;

    // M[i] = 100 ... to the right number of digits and is a constant
    // so there will not be any array lookup as the compiler knows it stuff
    q = (v - M[i])

    //If the digit is 0 then we just jump out of all checks
    if (q < 0)
    {
       ch = 0;
       goto set_i;
    }

    // N[i] is the index of the msb bit in the M[i]'s bit representation
    // we found that this expression is lead to close match and is never
    // larger than the real digit and most often a good match
    ch = (q >> (N[i]+1)) + 1;
    w  = M[i]*ch;

    if ((v - w) > M[i]))
    {
       ch++;
       w += M[i]

       if ((v - w) > M[i])
       {
         // and so on a coulple of times
         ...
       }
    }

  set_i:
    // We found the digit store it
    pt[n+17-i] = '0' + ch;

    // remove the digit form the value
    v -= w;

    // indicate for how many digits we have processed
    // as we need this in the post processing.
    m++;    
}

...

//We use the macro as

{
  RUN(17);
  ...
  RUN(1);
 out:

  //Post processing code comes here
  ...
}

links

social