To tag or not to tag, that is the question

Integer implementation is one of the biggest challenges in any system. Integers are used a lot in software development. They are the building blocks for arithmetic operations, loops, constants, counters, etc. Choosing the right integer implementation is critical because it determines the overall behavior. This article describes how Huemul implements integers, and what aspects had been taken into consideration.

Smalltalk paradigm instructs that everything is an object, also SmallIntegers. SmallInteger is a class, its instances represents integer numbers from -1,073,741,824 to 1,073,741,823. In a normal Smalltalk image there are lots of instances of the SmallInteger class. It is in fact the class that has more instances in the system. Huemul version 0.5 has 1,034,480 objects, 54.84 % of them (567,336 objects) are SmallIntegers.

Every object in Huemul has a 3 word header of 32 bits each. The header contains very important information, such as the object's class, its size in bytes, the hash, etc. This information could be coded in several ways in order to let it occupy less space, but Huemul optimization prioritizes speed instead of size. With this approach, accessing the information located at the header is very easy, and therefore it is fast.

This is how a Huemul object looks like:

Class
Size in bytes
Information
First instance variable or unnamed element
Second instance variable or unnamed element
...

Supposing we were using this mechanism to hold a SmallInteger, for example the number 12,345,678, we would need fill the structure as shown here:

SmallInteger
4
Information
12,345,678

The result is that every instance of SmallInteger would occupy 4 words of 32 bits. As you can see, 567,336 instances of SmallInteger with 4 words each, implies the use of 8,6 Mbytes of memory just in SmallIntegers.

One of the advantages of this store mechanism is that we could (and should) have just one instance of SmallInteger for each number being used, is the same that what is done for Symbols. Small numbers like 0, 1, -1, 2, etc. are used a lot. There is only one instance of every SmallInteger in this way some of the memory is saved.

Another advantage of this approach is that we can use every one of the bits of the word occupied by the number itself, with 32 bits, we could have SmallIntegers ranging from -2,147,483,648 to 2,147,483,648. This is important, no only because we use SmallIntegers instead of using the large integer classes, but also because we have complete compatibility with integers from other programming languages.

Last of all, by using this mechanism we provide a uniform way of accessing all the objects, regardless of the type of object we are trying to use. This is good, because of the uniformity that the language itself promotes, and also because it is easier to implement in the virtual machine, because we don't have to choose the way that an object is accessed, depending on it class.

The solution to the size problem of SmallIntegers is to encapsulate this kind of objects, as shown in the Blue Book. The encapsulation of Small Integers as defined in the Blue Book, is done by using the least significant bit of the word as a signal that tells if this word is a Small Integer or if it is a pointer. If it is a Small Integer, the signal is 1, if it is a pointer, the signal is 0.

a small integer

an object pointer

Whenever an instance of a class is created, a pointer to the new object is returned to the caller. With this approach, instead of creating an object whose class is SmallInteger, a word is returned with the format specified above.

For example, the number 20 in decimal is 10100 in binary form, to get its internal representation, the number is shifted to the left, obtaining 101000, and lastly, it is “tagged” with a 1. The final representation of the number 20 as a 1 tagged integer is 101001, or 41 in decimal.

With the special storage of the instances of SmallIntegers we use just one word instead of four, this is a great gain. Besides, we don't need to allocate an object for that instance, because we store it in the same memory location where the normal object pointer would have been.

The first problem we met by changing how SmallIntegers are kept in memory is that we have to check most virtual machine operations if the argument is a SmallInteger or a normal object. One of the basics operations to perform on an object is to get its class. With the uniform method, we just retrieve the first slot of the header. But with tagged integers, we must first test the type of object. We do that by analyzing the last bit of the word, if the last bit is a 1, it is a SmallInteger, if it is not, we retrieve the first slot of the header as usual. This is a performance penalty.

One thing that has to be carefully selected is the alignment of the objects in memory. With tagged integers, no object can be misaligned. Every object in the system should start in an even memory position. If this is not accomplished, we can not distinguish between a SmallInteger an a normal object pointer.

Integer arithmetic is used heavily in any system. So any penalty in this area makes the system slow down noticeably. A Smalltalk system can not leave any kind of arithmetic error unattended. Every arithmetic operation is checked against some rules, like overflow, underflow, zero divide, etc. So Smalltalk integer arithmetic has to be a little slower than the unchecked arithmetic made by other languages like C.

With the uniform object memory approach, an arithmetic function consist of:

Check that object 1 and object 2 are SmallIntegers
Get number instance variable from both objects
Perform 32 bit integer arithmetic operation
Do error checking, choose the class of the result, create an object of the correct class
Store and convert result

On the other hand, with 1 tagged integer it becomes:

Check that object 1 and object 2 are SmallIntegers
Detag both SmallIntegers
Perform 32 bit integer arithmetic operation
Do error checking, overflow is 31 bits
Tag result

Lets compare those functions. The first step is to check whether both of the arguments are SmallIntegers. We could do something like:

if( getClass(object1) == SmallInteger && getClass(object2) == SmallInteger )

...continue

It would be enough for the first case, but with 1 tagged integers we can do something like:

if( ( object1 & object2 & 1) ~= 0 )

...continue

Note that the and operation between the arguments is a bit operation, not the logical and operation. The trick is that the only possibility to continue is that both of the objects were SmallIntegers (the rightmost bit set to 1).

In step 2, we have to obtain the number from the objects, in the first case:

int number1 = object1->number;

In the second case we have to detag the object. The procedure consist on the following line of C code:

int number1 = (int)specialObject1>>1;

Step 3 is the same for both methods, is the normal arithmetic operation between them. For example if we have to add, we should do:

int result = number1 + number2;

Step 4 is more or less the same in both of the cases, but, in the first case we can infer the type of object we need to use, in case the operation overflows. Then, with that information, we create the correct object for the result. In tagged integers we don't do that, we just return with an error.

The last step is to store the result, in the first case, we should do something like this for SmallInteger results:

newObject->number = result;

return newObject;

With tagged integers, we should do something like:

return (ObjectType) (result<<1) | 1;

The tagging of the number consist in shifting the number one place to the left, and flagging the rightmost bit with a 1, with a bit or operation.

The first case loose time because it has to allocate the result, the second case looses time because it has to tag/detag as part of the operation. This detag-arithmetic-tag operation can be optimized depending on the operation. For example for additions, we can do this trick:

Clear rightmost bit of one of the arguments
Perform 32 bits addition

The result of the operation is the right addition already tagged. A much faster technique than any of the previous ones. But, this trick only works for addition and subtraction. With other operations, similar tricks are used, but they are more complex.

Huemul implements tagged integers using another approach that helps to simplify its arithmetic, also making it easier to inline those algorithms in the very compiled methods. This approach is called 0 tagged integers. As opposed to what had been explaining about 1 tagged integers, 0 tagged integers inverts the use of the flags on the rightmost bit of the words. It uses a 0 flag for SmallIntegers and 1 flag for object pointers.

Small Integer

Object pointer

SmallInteger operations are simpler with 0 tagged integers, because the tag operation itself is simpler than the one used in 1 tagged integers.

Tagging integers in 1 tagged integers:

(ObjectType) (result<<1) | 1;

Tagging integers in 0 tagged integers:

(ObjectType) (result<<1);

This means two instructions in the first case and just one in the second. It seems not to be a big advantage but remember that SmallIntegers are used a lot, and lots of operations use them.

Detagging is the same for both cases, it is just:

(int)specialObject1>>1;

But improvement is not only seen in the tag mechanism. Arithmetic operations are easier to implement. This is how the SmallInteger addition primitive is implemented in Huemul, in x86 assembler code:

movl $1,%%edx

movl %[rcvr],%[result]

addl %[arg],%[result]

cmovol %%edx,%[result]

It uses the x86 instruction cmov. This instruction is a conditional move, the instruction means put this value here, only if the condition is met. This is what is happening in the code.

The first instruction loads the number 1 to the %edx register. The number 1 in Huemul is used as the error constant. Number 1 in a 0 tagged notation means that it is an object pointer (with address + 1) because it is an odd number, so the object we are pointing at is at address 0. We will never have an object at address 0, so this makes it a good candidate to use it as the error constant. In the virtual machine, this constant is defined with the symbol “generalError”.

The second instruction loads the receiver and stores it in the result variable (we had already checked that both the receiver and the only argument are SmallIntegers in a previous step, not showed in the code). Next instructions does the addition, a 32 bits addition with sign, the result of the addition is kept in the result register. The last instruction says, if the last instruction (the addition) signaled an overflow, put the the error flag that we previously loaded in %edx instead of the result. If we hadn't used the conditional move instruction, we would had used a normal compare and jump instruction pair. The problem with compare and jump is that CPUs are smart enough to try to predict if it needs to jump or not. If the prediction is right, everything is fine, no CPU cycles are wasted. But if it is wrong, a jump missprediction condition is met, and a lot of CPU cycles are wasted. With the conditional move we make sure that no mater what happens with the addition (if it overflows or not), no jump will we made, therefore there is no chance of missprediction, saving lots of CPU cycles.

With 0 tagged integers we can not access a pointer directly, because it is pointing to the next byte of where the object actually is. For example, the lonely instance of the True class, the true object, is stored at memory location: 0xA123B450. Every pointer in the system that points to the true object should have that address, but, as we are using 1 tagged pointers, every pointer points to the address of the object plus 1. In the example, an object would point to the address 0xA123B451. It does not mean that the object is in that address, it is still aligned to the address that ends with 0. Whenever we want to access something in the object, we have to detag the pointer. This is not really a problem, because in CPU's it is the same (in terms of speed) to do:

load the contents of pointer X

than:

load the contents of pointer X – 1.

because, load the contents of pointer X is in fact, load the contents of pointer X + 0.

The next piece of code is extracted from Huemuls virtual machine:

CObjectPointer primitiveExternalAddressToString( CObjectPointer extAddress ) {

STArray *externalAddress = ( STArray * )detagObject(extAddress);

char *cString = (char *)externalAddress->array[0];

return(tagObject((STObject * )newString(cString)));

}

Every time Huemul has to convert to a normal C pointer it issues a detagObject(), on the other hand, when a C pointer hast to be converted to an image object, a tagObject() is emitted.

The code for tagObject is:

static inline

CObjectPointer tagObject( STObject *object ) {

return (CObjectPointer)(((unsigned int)object)+1);

};

And the code for detagObject:

static inline

STObject *detagObject( CObjectPointer taggedObject ) {

return (STObject *)(((unsigned int)taggedObject)-1);

};

It should be noticed that when the final optimized version of Huemul is compiled, this code is embedded in the instruction, with no speed overhead in the use of this functions.

When Huemul switched the use of 1 tagged integers to 0 tagged integers, the overall performance gain was about 10%. This gain is not very accurate, but it is representative enough to demonstrate that it worthed the effort of converting the image footprint, the changes in the virtual machine and the embedded compiler.

I would like to give special thanks to Klaus D. Witzel for his advise and support while developing the 0 tagged integer solution for Huemul.