About long, int and the php unpack function – or “where is the sign”? [UPDATED]

UPDATE:
Thanks to Ilia for reviewing and commiting the patch.

I recently noticed bug #38770 from the php bugtracker. It describes that pack behaves different on different platforms.
The problem occures due to a byte-wise convert from int to long (C-Language).

A programmer would expect the following code to pack an [b]32 bit int[/b].

print_r(unpack(“N”, pack(“N”, -3000)));

So lets take a closer look to what php does.
The binary representation of 3000 as little endian (x86 and x86_64 are little endian machines):

bits: 27-32 26-17 9-16 0-8
value 00000000 00000000 00001011 10111000

Now if we want to have the negativ value -3000 we have to do a [b]complement on two[/b], as thats the way of how machines stores negativ values (german readers found a explanation here). -3000 as binary:

bits: 27-32 26-17 9-16 0-8
value 11111111 11111111 11110100 01001000

So the [b]pack[/b] does an little to big endian convert, which is done byte-wise.

bits: 27-32 26-17 9-16 0-8
value 01001000 11110100 11111111 11111111

Now the pack methods returns that representation.
The problem is the unpack function.
It does the same conversation, but the unpack methods works on long, so [b]it takes a long value to store the result[/b] and do the byte-wise convert. But longs are [b]8byte on x86_64[/b] and [b]4byte on x86[/b].

The long value in little endian filled with the allready to little endian converted int, got from the
pack method:
bits: 57-64 49-56 41-48 33-40 27-32 26-17 9-16 0-8
value 00000000 00000000 00000000 00000000 1111111 11111111 11110100 01001000

But the correct representation of -3000 as long (8byte) on x86_64 would be:

bits: 57-64 49-56 41-48 33-40 27-32 26-17 9-16 0-8
value 1111111 1111111 1111111 1111111 1111111 11111111 11110100 01001000

on x86 it would be:

bits: 27-32 26-17 9-16 0-8
value 1111111 11111111 11110100 01001000

So we see that by unpacking the int and using [b]a long to store[/b] the result the information about the sign is lost on x86_64 machines. As long is 4byte and as big as int on x86, the information is saved on x86 machines.

To solve this problem we can either do a sign to unsigned convert in pack but as this might change the pack behaviour and might break BC, I decided to check the pack options handling ints (N, I, V) if the int is signed (checking against the most significant byte) and fill all bits higher than 32 with 1 (means byte 33 to 64 on 64bit machines), to preserve the negativ information.

// check if long is bigger than int

issigned = input[inputpos + (machine_little_endian ? (sizeof(int) - 1) : 0)] & 0×80;
if (sizeof(long) > sizeof(int) && issigned) {
v = ~INT_MAX; // we fill just the byte 8 to 5 with 1, not the lower bits that will store our int
} else {
v = 0;
}

and finally we got the result from the unpack as long holding the int in the last 4 bytes:

// using the bitwise or operator we just fill the last 4bytes of v as the highter 4 bytes allready filled
// with 1 by the code example above wouldnt preserved
v |= php_unpack(&input[inputpos], sizeof(int), (type==’i')?issigned:0, int_map);

So what I do as binary if long is bigger than int on a machine:

v = ~INT_MAX would get :
bits: 57-64 49-56 41-48 33-40 27-32 26-17 9-16 0-8
value 1111111 1111111 1111111 1111111 00000000 00000000 00000000 00000000

php_unpack(&input[inputpos], sizeof(int), (type==’i')?issigned:0, int_map);
will return:
bits: 57-64 49-56 41-48 33-40 27-32 26-17 9-16 0-8
value 00000000 00000000 00000000 00000000 1111111 11111111 11110100 01001000

The or operator will get both values together to:
v |= php_unpack(&input[inputpos], sizeof(int), (type==’i')?issigned:0, int_map);

bits: 57-64 49-56 41-48 33-40 27-32 26-17 9-16 0-8
value 1111111 1111111 1111111 1111111 1111111 11111111 11110100 01001000
which is the binary long representation of -3000.

If a programer wants to pack an int, he gets the int value without beeing influenced by the long convert.

The php patch can be found at:
http://sqlbackup.net/data/pack.c.diff

Leave a Reply