Symbol trees

This is old stuff and maybe not that exciting. But still I like the subject and its been interesting to contemplate how to best design an library that handles symbols e.g. to construct a Trie.

Now symbols in common lisp has four logical values associed; an id, code, value and a property list. When interning symbols one can instead of the string consider using the identity instead that can be an address to the memory area or a position in an array of structs containing the other data. Effectively this means that the code, value and propertylist is just a pointer dereference away.

But consider the task of looking up a symbol identity from a string. The general approach is to allow for unicode characters. Also typically the number of characters used is quite small. A couple of hundreds of them is usually enough.

So what you do is to encode the strings in a tree. If everything was ascii you could have a table of pointers the size 128 for each node and quicly dive down the pointer to the nect node untill you found the unique string and have your identity.

But this is an expensive data structure. Another approach is to have a bit vector representing the ASCII values in a linked list or someting similar to python's list type. This means that you onely need a 2 uint64_t essentially or even a uin128_t. The bitvector can also be larger.

Now modern cpu's can quickly count the number of set bits. So the algorithm would then be to let the ASCII value be the bit vlaue and check if that bit is set or not.

If the bit is set, one then mask out the part of the bitvector below and including that bit and count the number of set bits. That will create the index in the list part and then you can lookup the next node in the tree.

If not, then you need to do the same, but now the index will be an insertion point in the list and you insert a new node into the list part of the node struct.

But not everything will fit this scheme as some characters are not ascii but a more or less random unicode value, that needs up to the order of 32 bits in general. However, we can introduce an inderaction as usually the number of characters used are quite low in many languages.

Hence for each new character that one finds, one associate the bit value to be the next bit in the bitvector and from the uint32_t unicode value we can binary search an ordered list of the current seen unicade values and from that calcualte the corresponding bit value. This can be a bit more expensive but not terrible and a simple insertion search is most likely enough and should not post a complexity problem. With the found bit index we can than continue as with ascii values only.

But we could even do better than this. We could optimize for ascii values. And let the first 128 bit's in the bitvector represent ASCII values. The simplest design however is simply, before the binary search, check if the unicode value is the same as the index in the sorted array. And then initiaty the system by feeding the ascii codes in order to the setup. This means that the common case where there there is mostly ascii letters the lookup speed is in priciple only array lookups. Swedish is a typical example of this with the alphabet mostly located in in the ASCII area.

If you want to do a system that have the same properties but for your language in stead of ascii one can note that usually your alphabet is located in a range \([i,j]\) in the unicode space and it is trivial to make a very fast mathematical transform back of the unicode space by placing that intervall first.

Another possible transformation is to calculate a hash value of the string and compress it byte by byte in such a tree and make sure that the probability of a clash is minimal. This automatically gives you a quite balanced tree, although.

Other possibilities is to change the order of the letter in the word.

Note that one can use a Patricia Trie and make sure that all nodes are a branch.

Consider the case of a binary tree. We need then in a branching node two addresses and a character value per branch and a type of the the node like, is it a leaf or is it a branch. This means essentially 3*8 bytes per node and the number of nodes is typically something like O(log2(N)), with N the number of distingt strings. The worst case is however terrible order O(N). Doing a hash before the insertion makes the worst case very ulikely.

This is a dense section, I will expand and explain mare later.

Consider the case where we use 64 bits instead for a bitvector, then each binary node will have 48 and we loose memory. But however if the first node fills the whole bit vector, then that one will use essentially 28 bytes per branch which is better than the tree and also there will be a quite fast fan out, at least in the top of the tree. We can also mix the two approaches. finally English can allow a bit vector with a extra bit for big or small letters and allow for some common punctuation letters like " , . : ; ! ? and compress everything in the first 64 bít together with the letter code. Finally we could note that the number of words usually is less than 16000 and use a 2 byte index into a node list in stead of pointers. This means that basically the once can fit a binary branch in 1 + 2*2 bytes at best if we allow for some inneficiend handling that is associated with packed data. The best bitwise memory overheas seam to be in the order of say 1 + 4 + 2. And hence both variants if we keep the smallest granularity of 4 bytes. But with much much higher fan out potential.

The next step is to code examples and study performance.