eZecosystem / Mirror / Derick Rethans

Adverse ABIs

After last week's release candidate I started receiving some bug reports that Xdebug nearly instantaneously crashed PHP. Upon further investigation, it turned out to only be a problem on macOS, which is good as that's not affecting everybody, but bad as I don't have a macOS machine to try to reproduce things on.

I did some screen sharing with Frédéric Marand in order to dive down what the problem was, but we couldn't really get to the problem. It looked like there was nothing really wrong, until Xdebug called the php_addslashes C function (which is the internal implementation of addslashes). This matched the exact same location as David Vuorio showed in his stack trace in the bug tracker.

Still tearing my hairs out, I contacted Nikita Popov to see if he knew whether there was anything special about php_addslashes. His reply:

I have some suspicions...

And his suspicion turned out to be right.

PHP 7.3 has received many optimisations. One of them is the use of specific CPU instructions to speed up certain tasks. In this case, the PCMPESTRM instruction is used to find whether a string has characters that need to be escaped.

Upon build time, PHP's ./configure script checks whether the CPU it is built on supports these newer CPU instructions. If so, it then uses some header magic to use a specialised version of php_addslashes. This works fine if both PHP and the extension that uses php_addslashes are compiled on the same machine. However, if PHP is compiled on a machine without support, and the extension (Xdebug) is compiled on a machine with support, then the ABI (calling convention) of the php_addslashes C function is different. A crash will then occur when the extension tries to call the PHP php_addslashes function "in the wrong way". And that is exactly what happened here: the PHP binary that came from Homebrew uses a different ABI then the Xdebug binary that users compiled themselves. Modifying the installed headers made the problem go away.

Nikita soon fixed this in the PHP 7.3 branch with a hack to make sure that the ABI among PHP 7.3 versions stays the same, and a proper fix for PHP 7.4 and later to prevent this situation from occurring in the feature. Xdebug works around it for PHP 7.3.0 and PHP 7.3.1. Instead of relying on PHP's implementation of php_addslashes (xdebug_add_slashes), it uses its own. It also uses its own implementations of php_base64_encode and php_base64_decode which exhibit a similar issue. As Xdebug's implementations are slightly modified to work better for Xdebug, there is a minimal performance improvement too.

Time for Release Candidate 2!

02/19/2019 03:15 am   derickrethans.nl   Mirror   Link  

Memory Madness

I'm currently preparing Xdebug for its 2.7.0rc1 release, the first (and hopefully last) release candidate. There is one issue left to do, and that is the one I have been working on in the last days.

The crash popped up when running the Zeta Components test suite for the Document component. And it manifests itself through a Segmentation Fault:

derick@gargleblaster:~/dev/zetacomponents-document$ php  ~/phpunit-5.7.22.phar
PHPUnit 5.7.22 by Sebastian Bergmann and contributors.

.............................................................   61 / 1198 (  5%)
...............................S.............................  122 / 1198 ( 10%)
..............................................E......Segmentation fault

When running it from within gdb, we see:

.............................................................   61 / 1198 (  5%)
...............................S.............................  122 / 1198 ( 10%)
..............................................E..............  183 / 1198 ( 15%)
...............S.....S.....SS................................  244 / 1198 ( 20%)
...................................E.......F.................  305 / 1198 ( 25%)
..........E..................F..E............................  366 / 1198 ( 30%)
.............................................................  427 / 1198 ( 35%)
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff17a96fb in zval_get_type (pz=0x42) at /usr/local/php/7.3.0/include/php/Zend/zend_types.h:411
411             return pz->u1.v.type;
(gdb) bt
#0  0x00007ffff17a96fb in zval_get_type (pz=0x42) at /usr/local/php/7.3.0/include/php/Zend/zend_types.h:411
#1  0x00007ffff17ad2dc in xdebug_var_export (struc=0x7fffffff7c68, str=0x555558f36490, level=1, debug_zval=0, options=0x5555590c3cc0) at /home/derick/dev/php/xdebug-xdebug/xdebug_var.c:974
#2  0x00007ffff17ad812 in xdebug_get_zval_value (val=0x555559180120, debug_zval=0, options=0x5555590c3cc0) at /home/derick/dev/php/xdebug-xdebug/xdebug_var.c:1059
#3  0x00007ffff17a1f11 in add_single_value (str=0x7fffffff7dd0, zv=0x555559180120, html=0, collecton_level=4) at /home/derick/dev/php/xdebug-xdebug/xdebug_stack.c:371
#4  0x00007ffff17a23dd in xdebug_append_printable_stack (str=0x7fffffff7dd0, html=0) at /home/derick/dev/php/xdebug-xdebug/xdebug_stack.c:451
#5  0x00007ffff1782089 in xdebug_throw_exception_hook (exception=0x7fffe96ae200) at /home/derick/dev/php/xdebug-xdebug/xdebug.c:1586

But these errors can sometimes be misleading. As you can see, the error now happens a bit later during the test run.

In these cases, I usually use Valgrind instead, as it gives a warning as soon as a memory issue occurs, such as a use-after-free situation, or when reading uninitialised memory.

Running the tests with Valgrind however, gave very strange results:

derick@gargleblaster:~/dev/zetacomponents-document$ valgrind php ~/phpunit-5.7.22.phar --filter ezcDocumentConverterDocbookToRstTests::testLoadXmlDocumentFromFile
==38610== Memcheck, a memory error detector
==38610== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==38610== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==38610== Command: php /home/derick/phpunit-5.7.22.phar --filter ezcDocumentConverterDocbookToRstTests::testLoadXmlDocumentFromFile
==38610== Conditional jump or move depends on uninitialised value(s)
==38610==    at 0x4035D2A: ???
==38610==    by 0x13E352D7: ???
==38610==    by 0x13E352D7: ???
==38610==    by 0x13E352F6: ???
==38610==    by 0xB102CCF: ???
==38610==    by 0x13E352D7: ???
==38610== Conditional jump or move depends on uninitialised value(s)
==38610==    at 0x4035D2A: ???
==38610==    by 0x13EE3BA7: ???
==38610==    by 0x13EE3BA7: ???
==38610==    by 0x13EE3BC6: ???
==38610==    by 0xB102CCF: ???
==38610==    by 0x13EE3BA7: ???

Normally, these ??? items contain the name of C functions, but in this case, they are all unknown. That often means that Valgrind does not have the right debugging symbols available. After checking whether I had these installed, I found that I did have the debugging symbols available. Another point is that usually, at least there is some recognisable name present in stack traces for uninitialised value(s).

I first thought I had a bug in Valgrind, or perhaps the packaging was not done right for my Linux distribution. I know somebody who works on Valgrind, so I asked on IRC whether he knew what was up. Beyond the debugging symbols, he couldn't provide additional suggestions either. But Nicolás Álvarez, who was also hanging out in the channel, asked whether PHP has a JIT engine. I answered that it doesn't (yet), and then the light bulb went off: Although PHP itself does not have a JIT, the PCRE regular expression engine that PHP uses does.

After disabling PCRE's JIT with -dpcre.jit=0, the Valgrind output is now what I expect—with just the memory error that I thought was happening:

derick@gargleblaster:~/dev/zetacomponents-document$ valgrind php -dpcre.jit=0 ~/phpunit-5.7.22.phar --filter ezcDocumentConverterDocbookToRstTests::testLoadXmlDocumentFromFile
==38648== Memcheck, a memory error detector
==38648== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==38648== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==38648== Command: php -dpcre.jit=0 /home/derick/phpunit-5.7.22.phar --filter ezcDocumentConverterDocbookToRstTests::testLoadXmlDocumentFromFile
PHPUnit 5.7.22 by Sebastian Bergmann and contributors.

.....==38648== Invalid read of size 4
==38648==    at 0xB63E37E: xdebug_zend_hash_is_recursive (xdebug_compat.c:260)
==38648==    by 0xB65F1CF: xdebug_var_export (xdebug_var.c:965)
==38648==    by 0xB65F811: xdebug_get_zval_value (xdebug_var.c:1059)
==38648==    by 0xB653F10: add_single_value (xdebug_stack.c:371)
==38648==    by 0xB6543DC: xdebug_append_printable_stack (xdebug_stack.c:451)
==38648==    by 0xB634088: xdebug_throw_exception_hook (xdebug.c:1586)
==38648==    by 0x9BF803: zend_throw_exception_internal (zend_exceptions.c:166)
==38648==    by 0x9C3413: zend_throw_exception_object (zend_exceptions.c:1046)
==38648==  Address 0x1426f774 is 4 bytes inside a block of size 56 free'd
==38648==    at 0x4C2CDDB: free (vg_replace_malloc.c:530)
==38648==    by 0x95E07A: _efree (zend_alloc.c:2508)
==38648==    by 0x9A95C3: zend_array_destroy (zend_hash.c:1513)
==38648==    by 0x99043A: zend_array_destroy_wrapper (zend_variables.c:90)
==38648==    by 0x9902C5: rc_dtor_func (zend_variables.c:65)
==38648==    by 0x9ED835: zend_assign_to_variable (zend_execute.h:108)
==38648==    by 0xA54307: ZEND_ASSIGN_SPEC_CV_VAR_RETVAL_UNUSED_HANDLER (zend_vm_execute.h:45824)
==38648==    by 0x9FA722: ZEND_USER_OPCODE_SPEC_HANDLER (zend_vm_execute.h:1829)
==38648==  Block was alloc'd at
==38648==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==38648==    by 0x95ECFF: __zend_malloc (zend_alloc.c:2904)
==38648==    by 0x95DFD3: _emalloc (zend_alloc.c:2494)
==38648==    by 0x9AAAAA: zend_array_dup (zend_hash.c:1923)
==38648==    by 0xA5664A: ZEND_ASSIGN_DIM_SPEC_CV_UNUSED_OP_DATA_CV_HANDLER (zend_vm_execute.h:47041)
==38648==    by 0x9FA722: ZEND_USER_OPCODE_SPEC_HANDLER (zend_vm_execute.h:1829)
==38648==    by 0xA5F7D9: execute_ex (zend_vm_execute.h:55557)
==38648==    by 0xB634E90: xdebug_execute_ex (xdebug.c:1876)

And with the insight this gave me, I can now attempt to find and fix the bug, and finally release Xdebug 2.7.0rc1.

01/31/2019 07:11 am   derickrethans.nl   Mirror   Link  

The Xdebug Experience

Xdebug development is currently in its 17th year. I started working on it when PHP 4.2.0 had just been released. The first CVS commit added the (in)famous "maximum nesting level" functionality, to prevent PHP from crashing when your script would infinitely recurse. I had forgotten this, but apparently Xdebug has some origins in VL-SRM, a naive attempt to create an application server running PHP bananas (think of: Java beans). The VL prefix still lives on as part of my PHP internals debugging extension VLD.

Where Are We Now?

As with any legacy project, Xdebug has technical debt. Some of it I managed to address during the years by dropping support for older PHP versions. There is a fair amount of sub-optimal code, algorithmic, and design issues, which impact ease of configuration, usability, and performance.

Configuration Confusion

On the configuration side, take for example the 65 configuration settings. There are a few pointless ones (xdebug.extended_info, xdebug.remote_handler), a few that duplicate functionality (xdebug.trace_output_dir and xdebug.profiler_output_dir), and a few that should never be used together (xdebug.remote_enable and xdebug.profiler_enable).

Unproductive Usability

If we look at usability, there are situations where breakpoints don't break. Some of these I can likely address, whereas others I can not due to the way how PHP works internally (See also: The Mystery of the Missing Breakpoints). Having OPcache in the mix does not help either as it can create another set of problems where visible code has been optimised out.

To solve a third group of issues with breakpoints, the DBGp "breakpoint resolved" notification needs to be implemented in both Xdebug and IDEs.

Another issue is that both Xdebug and PHP-FPM use port 9000 as their default, although Xdebug's use of port 9000 precedes PHP-FPM's by about five years (2004 vs 2009).

And lastly, (potential) users often cite that it is "hard" to set Xdebug up. Although after talking to some of these users, it often becomes clear that the problem is not so much on the Xdebug side of things, but rather other tools: their IDE, Docker, SELinux, firewalls, etc. Improving this is a matter of education (and better error messages).

Poor Performance

On the performance side, there are some places with major issues, and some others with minor issues. Code coverage isn't particularly fast, and that needs an in-depth investigation. I have several ideas on where improvements in code coverage performance can be made—nonetheless, I would continue to put correctness over speed.

Even with Xdebug just loaded and enabled, there is too much of a performance impact. This is because Xdebug usually has every feature ready to go, even though you don't necessarily want to use all of these different features. For example, it is ludicrous to use single step debugging with the profiler turned on.

Although it's often possible to reduce the impact of feature sets by setting configuration options correctly, a better way to put Xdebug in a specific mode would be welcome.

What Needs to Be Done?

In order to address many of these problems, Xdebug must undergo a massive refactoring. As part of that effort, the refactoring must primarily focus on solving the above mentioned three categories, although improvements in code layout are also desirable.

As part of the refactoring process, the following primary tasks need to be completed, in preferably the listed order:

  • Finalize PHP 7.3 support (there is a nasty bug where Xdebug and a specific OPcache optimisation setting conflict).

  • Code needs to be reorganised, as the current location of source files and functions is detrimental to improvements.

  • The number of configuration options needs to be reduced.

  • Modes needs to be introduced, so that it is easier to turn off and on (internal) code to support different features. This will very likely improve general performance already.

  • Specific code paths and features need to be optimised. This includes primarily the basic features (stack traces, var_dump() improvements, etc.), as well as Code Coverage.

What is Needed to Get This Done?

I would love to be able to continue to work on Xdebug; to keep it in sync with changes in PHP itself, to implement the ideas from this article to improve the Xdebug Experience, and to produce educational material such as improved documentation, tutorials, and videos. (Dare I say a book?)

Therefore I am currently looking for ways that I can be funded for doing all the required work. It will take a lot of time to get Xdebug 3 out in a tip-top shape. Possibly, if not more, than 6 months. Although I have plenty of time now that I've left MongoDB, the bills and mortgage do need to get paid too. I can think of a few options:

  • Get more users to sign up to my Patreon account. There is a group of loyal users and companies that contribute towards the upkeep of the server. But in order to make this sustainable, the patron count need to increase about 30 fold. I struggle with setting up rewards to nudge people to support me.

  • A fundraiser (through a crowd funding site) for specific tasks and/or features from the plan.

  • Some functionality that would only be available under a commercial license. One of the ideas here is to add a recording feature that records the full execution of a script, which then later can be replayed back through an interactive single step debugging session in an IDE.

  • Work with IDE manufacturers to implement some of their requested features (such as the before mentioned "resolved breakpoint" notification), and to come up with new features to make the debugging experience better.

I am interested to hear whether you have a specific preference, or perhaps some additional suggestions for me to consider. I would absolutely want to give Xdebug all the love it still deserves after 17 years. Let me know what you think! Either in a comment, or by email.

cheers, Derick

01/22/2019 03:23 am   derickrethans.nl   Mirror   Link  

Moving On From MongoDB

The last seven years, I have worked with, and for, MongoDB. I have seen it grow from a small start-up, then called 10gen, with fewer than a hundred people, into a NASDAQ listed company with over a thousand employees. But now, after all these years, it's time for me to move on.

I've learned a lot, and I have worked with very good people. In these seven years I rewrote the PHP driver (twice), wrote an HHVM driver (remember HHVM?), and implemented time zone support in the MongoDB server. I will be leaving the PHP driver for MongoDB in Jeremy's and other colleagues' capable hands.

I have come to greatly admire both the engineers at MongoDB, as well as the technology that we all created. I am a big fan of MongoDB, and will continue to speak and write about it.

What's next for me? That I'll explain in my next post.

01/16/2019 10:23 am   derickrethans.nl   Mirror   Link  

The Confused C-Driver

Over the past few months I have been adding the MongoDB driver for PHP to Evergreen, MongoDB's CI system. Although we already test on Travis and AppVeyor, adding the driver to Evergreen also allows us to test on more esoteric platforms, such as ARM64, Power8, and zSeries.

While adding the zSeries architecture to our test matrix, we noticed that one specific group of tests was failing. In these tests we would do an insert with a write concern of {w: 0} followed by an insert with {w: 1}.

A write concern is used to enforce specific guarantees on how many MongoDB servers a write needs to be replicated to, before the primary responds to the client that the write has been acknowledged. A write concern of {w: 0} means that the client does not care about the result of the write operation, and henceforth does not need a reply from the server. These are also referred to as unacknowledged writes.

The test fails in such a way that the result for the second insert (with {w: 1}) seemed empty.

Figuring out what was going wrong was not particularly easy in this case. Single-stepping through a lot of connection and socket handling code with a fairly complicated protocol takes time. After several hours it was still unclear what was going wrong.

I tried to find out whether the data was sent correctly over the wire. As we only have a single shared zSeries development server without a GUI, I could not use Wireshark. Moreover, I could not use tcpdump either, as I didn't have any sudo rights:

$ tcpdump -i lo -nnXSs 0 'port 27017'
tcpdump: lo: You don't have permission to capture on that device
(socket: Operation not permitted)

After some time I figured out that you can also use strace for finding out what goes over the wire, albeit in a slightly annoying format. I used the following invocation:

strace -f -e trace=network -s 10000 -p 22506

The -f makes strace follow forked processes, the -e trace=network shows the system calls of all network related operations, the -s 10000 makes sure we see 10 000 bytes of data in strings, and the 22506 value for the -p argument is the process ID.

On the zSeries platform, the strace dump looks like the following. The first group is the insert with {w: 0}, and the second group the insert with {w: 1}}:

recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\252\0\0\0\2\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\2\0h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\322\0", 154}], msg_controllen=0, msg_flags=0}, 0) = 154
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"&\0\0\0v\0\0\0\2\0\0\0\335\7\0\0\0\0\0\0\0\21\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 38}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 38

recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\316\0\0\0\4\0\0\0\0\0\0\0\335\7\0\0", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
recvmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4\337{QQi\312A\245\213\270\210\376&+\247\260\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0\1,\0\0\0documents\0\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32\27\r\0200G%yF\323\0", 190}], msg_controllen=0, msg_flags=0}, 0) = 190
sendmsg(36, {msg_name(0)=NULL, msg_iov(1)=[{"-\0\0\0x\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 45}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 45

The first line is the header, with the last four bytes (\335\7\0\0) being the opcode 2013 (OP_MSG). After the opcode, on the second line follow the flagBits: \0\0\0\2. The flagBits field is \0\0\0\0 for the second insert.

On my local Linux machine, the strace dump looks a little more complicated as the data is split into multiple IOV packets, but similar data is present. The trace is also made with the client and server reversed, so sendmsg and recvmsg are also swapped:

sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\252\0\0\0", iov_len=4}, {iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
           {iov_base="\2\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="h\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3writeConcern\0\f\0\0\0\20w\0\0\0\0\0\0\0", iov_len=104}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\0\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203r\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 170

sendmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\316\0\0\0", iov_len=4}, {iov_base="\4\0\0\0", iov_len=4}, {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\335\7\0\0", iov_len=4},
           {iov_base="\0\0\0\0", iov_len=4}, {iov_base="\0", iov_len=1}, {iov_base="\214\0\0\0\2insert\0#\0\0\0server_server_executeBulkWrite_002\0\10ordered\0\1\2$db\0\7\0\0\0phongo\0\3lsid\0\36\0\0\0\5id\0\20\0\0\0\4 \314rM\25\362L\7\203\36O\2157\344\201V\0\3writeConcern\0\f\0\0\0\20w\0\1\0\0\0\0\0", iov_len=140}, {iov_base="\1", iov_len=1}, {iov_base=",\0\0\0", iov_len=4}, {iov_base="documents\0", iov_len=10}, {iov_base="\36\0\0\0\20wc\0\1\0\0\0\7_id\0\\\31\32=\343\231,\25\301\\\203s\0", iov_len=30}], msg_iovlen=11, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 206
recvfrom(4, "-\0\0\0", 4, 0, NULL, NULL) = 4
recvfrom(4, "i\0\0\0\4\0\0\0\335\7\0\0\0\0\0\0\0\30\0\0\0\20n\0\1\0\0\0\1ok\0\0\0\0\0\0\0\360?\0", 41, 0, NULL, NULL) = 41

The first big difference is that locally, there is no recvfrom for the first sendmsg, but there is one for the second insert. The other difference is that the flagBits (the second line in each sendmsg) is \2\0\0\0 for the first insert. This is different in the zSeries trace, where the value is \0\0\0\2.

The latest version of the MongoDB wire protocol uses OP_MSG for all operations, which is a typical request/response API. This means that generally every request is expected to generate a reply from the server. For unacknowledged writes, the client does not need a reply from the server, and the way to tell the server that is by setting the moreToCome flag in the OP_MSG packet. It's bit 1 in the 32 bit wide bit field.

The wire protocol requires the use of the little-endian byte order for numbers. libmongoc, which implements the connection aspects of the PHP driver, accomplishes this by running a "swap" in case the driver is run on a big-endian system such as zSeries.

It turned out that although this swap happened for all normal integers (such as the OP_MSG opcode 2013), it did not happen for the flagBits. This meant that instead of setting the moreToCome flag (bit 1), we set bit 25, which does absolutely nothing. Because of that, the server helpfully sent a reply (the first sendmsg) in the zSeries dump. Because the driver did not expect such a packet, it did not read it from the socket either.

This meant that when the driver read information from the socket in response to the second insert (the one with a write concern of {w: 1}) it read the response of the first insert. And then it got confused.

In the end, the fix was as easy as making sure that the flagBits field was also correctly swapped between big- and little-endian.

Confusion solved!

12/21/2018 03:21 am   derickrethans.nl   Mirror   Link  

The Mystery of the Missing Breakpoints

Occasionally I see people mentioned that Xdebug does not stop at certain breakpoints. This tends to relate to multi-line if conditions, or if/else conditions without braces ({ and }). Take for example the following artificial code sample:


If you set a breakpoint at either line 7, 11, or 12, you'll find that these are ignored. But why is that?

In order to investigate we employ vld, a tool that I wrote that can show PHP's internal bytecode (oparrays containing opcodes). They read very much like assembly code. For the above snippet, the vld dump looks as follows (after redacting useless information):

function name:  (null)
number of ops:  19
compiled vars:  !0 = $a, !1 = $b, !2 = $c, !3 = $d
line     #* E I O op             return  operands
   2     0  E >   EXT_STMT
         1        ASSIGN                 !0, 
   3     2        EXT_STMT
         3        ASSIGN                 !1, 
   4     4        EXT_STMT
         5        ASSIGN                 !2, 
   5     6        EXT_STMT
         7        ASSIGN                 !3, 
   8     8        EXT_STMT
         9      > JMPZ_EX        ~8      !0, ->11
   9    10    >   BOOL           ~8      !1
        11    > > JMPZ_EX        ~8      ~8, ->15
  10    12    > > JMPNZ_EX       ~9      !2, ->14
        13    >   BOOL           ~9      !3
        14    >   BOOL           ~8      ~9
        15    > > JMPZ                   ~8, ->18
  13    16    >   EXT_STMT
        17        ECHO                   'It%27s+true%21'
  15    18    > > RETURN                 1

The first column is the line number that PHP associates with each opcode, and you can see that there is no opcode for lines 7, 11, and 12. Additionally, Xdebug can only break on the EXT_STMT opcode, which you can see is only present for lines 8 and 13 in the logic section of the script. If a breakpoint is set on a line without an EXT_STMT opcode, Xdebug will not be able to interrupt the script on that line.

Sometimes it can be confusing where PHP thinks there is a line of code, especially since OPcache starts optimising more and more things out. In our example, it could really only leave line 13 around, as the rest is static. It doesn't quite do that yet however.

It is certainly frustrating that Xdebug cannot always stop where you want it to, but IDEs do have a possibility to interrogate where Xdebug could stop through a private DBGp commandxcmd_get_executable_lines. This can only be done when the script, or rather, a specific function is already running.

The DBGp protocol has provisions for signalling to IDEs whether it has resolved a breakpoint when a function gets entered into. Xdebug does currently not implement this functionality yet, but a ticket for it is scheduled for implementation (likely for Xdebug 3.0).

12/19/2018 03:43 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 8

On the previous trip I walked from Teddington Lock to the Golden Jubilee Bridges. On this walk, I am finishing the series by going to the Thames Barrier. Getting to the start this time was easy peasy, with just a Tube ride down to Embankment.

Day 10 — Golden Jubilee Bridges to Thames Barrier

When I exited Embankment station, the rain had just stopped, and it felt very cold. The weather forecast indicated that it would get sunny, but I was very sceptical. I crossed the Golden Jubilee Bridges slowly, taking my time taking some photos and video of the London Eye. I considered taking a timelapse, but realised I had done so earlier for London in Fives, when the weather was much better.

The moment I made it to the south bank of the Thames, the Sun came out, and I was happy that I had brought my sun glasses. Not only was the Sun low, it also reflected a lot on the puddles on the ground.

This part of the Thames and Thames Part are right in the middle of London, and hence usually very busy with both commuters and tourists. Once the Sun came out, it therefore quickly became slow going, with lots of people going around the puddles in zigzaggy fashion.

The tide was out, which meant that there were quite a few people out on the river bed as well, to seek fortunes. There are still many bridges here, and definitely many landmarks and tourist attractions.

The best known is probably the Tower of London and Tower Bridge. The Victorian Tower Bridge (1894) is significantly younger than the Norman Tower (1078), which was built by the Normans after they invaded in 1066. The contrast behind the older Tower and Tower Bridge and the tall buildings of The City and City Hall is much starker still.

Beyond Tower Bridge, and the big yacht Kismet, the Thames Path becomes its quiet self again, with people going about their weekend morning business. The walk was now a lot more pleasant as I didn't have to dodge around people all the time. With the tide so low, many of the former docks now showed their muddy bottom, which made for some great photos.

After coming around the corner at Rotherhithe, the Thames Path loops nearly 270° degrees around The City, where dark clouds and rainbows loomed over on the North bank of the Thames, all the way to Greenwich. At Greenwich, the path goes along the Cutty Sark, a 19th century tea clipper which is now a museum. After crossing the square at the former Royal Naval College, the scenery turns quickly from "lovely residential area" into "industrial", which persists nearly all the way to the Thames Barrier.

Some interesting landmarks are the Millennium Dome. You can now walk across the roof, which is precisely what some people were doing when I came past. The Emirates Air Line flew high overhead between the two banks of the Thames as I was making my way further downstream.

After just over 300 kilometres of walking in 56 hours, while burning 25 000 calories, I then made it to the Thames Barrier, London's flood defence barrier. The barrier marks the end of the official Thames Path, although it is possible to extend it to the river Darent, where it then connects to the London LOOP that I walked a few years ago.

With the Thames Path done, it's time to look for the next adventure—I've my eyes on Hadrians Wall ;-).

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

11/13/2018 03:12 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 7

On the previous trip I walked from to Staines to Teddington Lock. The start of the walk was easy to reach by a direct train from Waterloo, but it does involve a 1 km walk in Teddington. But please do make sure you get out at the right side of the station, otherwise that walk is 2 km ☺.

Day 9 — Teddington Lock to the Golden Jubilee Bridges

I didn't spend a lot of time at Teddington Lock, as I had done so a week earlier. I did notice a few signs saying "Caution! Tidal" as a reminder that the Thames downstream from Teddington Lock is tidal. At the moment it was low tide, although at Teddington that is still a little hard to see due to the presence of Richmond Lock. At Teddington Lock you can chose whether the follow the North or South Bank to the Thames Barrier. As I know that the North Bank's path has a lot of barriers and diversions, especially further East in the City, I chose to walk the Thames Path along the South Bank.

On most of today's walk I was accompanied with rowers on the Thames—for practice, teaching, and races. There must be a rowing club every other kilometre or so. On the way into Richmond the Petersham Hotel looming above the town.

Just beyond Richmond the Thames path passes along the Old Deer Park, where I encountered a Meridian Line. Before Greenwich became the Prime Meridian in 1884, the line at Richmond, which passes through the King's Observatory fulfilled this role for a few centuries.

Not far from the first Meridian Line is Richmond Lock and Footbridge. This is the last lock on the Thames, but it is not a full lock. The lock is accompanied by sluice gates which are open when the tide is at it highest—around two hours on either side of the high tide mark. During that time boats can bypass the lock gates. When the sluice gates are closed, boats have to use the lock to make their way downstream or upstream. When I walked by, the Thames was nearly at its lowest point, and hence, the sluice gates were closed. I didn't see any boats wanting to go through the lock either though.

Beyond the Old Deer Park, the Thames Path passes between the Thames and Kew Gardens. You can unfortunately not see much of it, due the dense vegetation and a wall. You can only really get a glimpse of Kew Palace. Royal Botanic Gardens, Kew, as the gardens are also called, is well worth visiting—especially in Spring.

Now that I passed the last lock at Richmond, there are no further more locks all the way to the North Sea, however, there are plenty of bridges that cross the Thames, both for car, and rail and tube traffic. It became very clear that it was low tide, as there were vast areas of the bottom visible, with plenty of people trying to hunt for treasures.

I had lunch in Putney, at The Duke's Head, where I also had a local brew from nearby Wimbledon brewery. I did pass by the Stag Brewery as well, but that only makes Budweiser, or so their sign says.

After lunch the small deviations from the Thames Part started, as there is a lot of new building works going on in Wandsworth and Battersea. The path wasn't always signed very well, or the signs were outdated due to either new developments, or completed ones with the redirections still in place. Only at the former Battersea Power Station the deviation was a little annoying, as it included going along a busy road.

After going around the SIS building, I was back on familiar territory. I have walked along the Thames East from Lambeth Bridge to Tower Bridge many times. This is also the section that becomes really busy on nice warm autumn days, so my progress was slowed down a little.

A few more bridges later, the Sun started to come closer to setting. I always like this moment in a day, as the light becomes extra yellow, which makes in my opinion for much better photos. You also get to catch the Sun's rays on buildings, such as Lambeth Palace.

Across the river from Lambeth Palace is the Palace of Westminster, where the UK's parliament sits. One of the towers is the Elizabeth Tower, which also hosts the Big Ben bell. The latter is currently out of action and wrapped in scaffolding, as the bell and its tower are being refurbished.

I finished my walk just beyond the London Eye at the Golden Jubilee Bridges. From there on it was a short walk to Waterloo station where I took the tube home. It took however nearly an hour from the bridge to the station, but that was only because I sat down at the Waterloo Tap for a restorative beverage.

Only one more walk to go!

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

10/30/2018 04:17 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 6

On the previous trip I walked from Bourne End to Staines. Staines is pretty easy to reach by a direct train from London Waterloo. Normally I would walk on a Saturday, but this time I walked on a Sunday because the weather on Saturday was rubbish, and I don't like walking in the rain.

Day 8 — Staines to Teddington Lock

Luckily, the weather had all cleared up on Sunday, and I started in Staines on another beautiful morning. It was cold when I left home in the morning, and I had actually brought a coat. Instead of the coat, I should have brought my sunglasses, as the Sun warmed me up quickly. The coat went not soon after I set of walking in earnest, as I was getting sweaty.

It was quite quiet in Staines, with only few people walking their dogs, running, and cycling. There were a lot of posh houses along the Thames on this stretch.

After walking past a new built house called "Downstream" (coincidence?) I arrived at a point where I could either take a detour along the River Wey Navigation, or take the ferry to the other bank of the Thames and continue my journey there. The Nauticalia Ferry from Shepperton to Weybridge goes once every 15 minutes, and has been in operation for about 500 years, with a short break in the eighties. The choice was easy, and after buying my £2.50 ticket I didn't have to wait long for it to set off and ferry me, and a couple with a dog, across.

Just before I arrived in Walton-on-Thames I "ran into" a bunch of people running a marathon. There was even a sign congratulating a runner with their 100th marathon. I think I'll stick to walking…

For lunch I stopped at The Swan in Walton, and sat in its garden to enjoy my fish and chips, and a half. It was still a little early and I spend a little too much time relaxing and sitting in the Sun.

A little further the Thames passes close along a few massive reservoirs, which feed into the water supply of Greater London. Slightly annoyingly, their border is a massive stone wall — and I even had to cross some tank barriers.

There was another tiny ferry at Hampton, which I ignored. Instead I crossed the Thames near Hampton Court Palace, where I walked into the forecourt to smooch and take some photos. I visited the palace with my parents a few years ago, but it is so massive that you really need a full day to visit it all. I certainly didn't have the time for that, and continued along the Thames Path. Between Hampton Court Palace and Kingston, there is a quite wide path/walking area, The Barge Walk, with at the beginning lots of flower banks.

At Kingston upon Thames I crossed the Thames for the last time on today's walk. I continued through Canberry Gardens and a little section along a raised road. Just after coming back onto a footpath my way was blocked by a massive blown-over tree. There was no way to climb over it, so I made my way up the steep bank towards the road, move 20 meters downstream, and then slide back onto the footpath. I think the tree recently blew over, as its brilliant yellow leaves hadn't fallen off yet.

I finished my walk at Teddington Lock, but not after wandering around the lock's island for a while. There are three lock gates at Teddington, instead of the usual one (or two) lock gates. I also always thought that Teddington Lock was the last (or first) lock on the Thames, but I failed to realise that there is one further downstream at Richmond as well. However, the one at Teddington is the last full lock. Beyond Teddington Lock the river Thames is tidal, and the lock at Richmond only operates on low tide—more about that in the next instalment.

After wandering around the lock, I crossed a few bridges, and made my way to Teddington station for my train back into Waterloo. Curiously, as Teddington is situated on the Kingston Loop Line, you can catch the train to Waterloo from either of its two platforms.

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

10/23/2018 03:15 am   derickrethans.nl   Mirror   Link  

Using the Right Debugging Tools

A while ago, we updated the MongoDB PHP driver's embedded C library to a new version. The C library handles most of the connection management and other low level tasks that the PHP driver needs to successfully talk to MongoDB deployments, especially in replicated environments where servers might disappear for maintenance or hardware failures.

After upgrading the C library from 1.12 to 1.13 we noticed that one of the PHP driver's tests was failing:

derick@singlemalt:~/dev/php/derickr-mongo-php-driver $ make test TESTS=tests/atlas.phpt
Atlas Connectivity Tests [tests/atlas.phpt]

When running the test manually, we get:

derick@singlemalt:~/dev/php/mongodb-mongo-php-driver $ php tests/atlas.phpt
Atlas Connectivity Tests
skip Atlas tests not wanted
mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852 mongoc_cluster_fetch_stream_single():
        precondition failed: scanner_node && !scanner_node->retired

That was not good news.

The atlas.phpt test tests whether the PHP driver (through the C driver) can connect to a set of different deployments of Atlas, MongoDB's Database as a Service platform. The test makes sure we can talk to an Atlas replica set, a sharded cluster, a free tier replica set, as well as TLS 1.1 and TLS 1.2 specific deployments. The test started failing when connecting to the second provided URI (the sharded cluster).

At first I thought this was caused by the upgrade from version 1.12 to 1.13 of the C driver, but that didn't end up being the case. Let's see how we got to finding and fixing this bug.

First of all, I started GDB to see where it was failing:

derick@singlemalt:~/dev/php/mongodb-mongo-php-driver $ gdb --args php tests/atlas.phpt
GNU gdb (Debian 8.1-4+b1) 8.1
Reading symbols from php...done.
(gdb) run
Starting program: /usr/local/php/7.3dev/bin/php tests/atlas.phpt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Atlas Connectivity Tests
mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852 mongoc_cluster_fetch_stream_single():
        precondition failed: scanner_node && !scanner_node->retired

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

This shows that the C driver bailed out due to a specific assertion on line 1852 of mongoc-cluster.c. This assertion reads:

BSON_ASSERT (scanner_node && !scanner_node->retired);

Which didn't really say a lot. The next thing to try is then to make a backtrace in GDB with the bt command. This revealed:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff41472f1 in __GI_abort () at abort.c:79
#2  0x00007ffff32cdcf1 in mongoc_cluster_fetch_stream_single (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1852
#3  0x00007ffff32cd9df in _mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1762
#4  0x00007ffff32cdbe6 in mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1826
#5  0x00007ffff32ddb0a in _mongoc_cursor_fetch_stream (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:647
#6  0x00007ffff32e12a0 in _prime (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor-find.c:40
#7  0x00007ffff32df22c in _call_transition (cursor=0x555556b95700)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:1146
#8  0x00007ffff32df54b in mongoc_cursor_next (cursor=0x555556b95700, bson=0x7fffffffa038)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cursor.c:1214
#9  0x00007ffff332511c in phongo_cursor_advance_and_check_for_error (cursor=0x555556b95700)
        at mongo-php-driver/php_phongo.c:742
#10 0x00007ffff33253d9 in phongo_execute_query (
        client=0x555556ba8e60, namespace=0x7fffeaeb5d58 "test.test", zquery=0x7ffff38201c0, options=0x0,
        server_id=1, return_value=0x7ffff38200f0, return_value_used=1)
        at mongo-php-driver/php_phongo.c:810
#11 0x00007ffff3342cc3 in zim_Manager_executeQuery (execute_data=0x7ffff3820160, return_value=0x7ffff38200f0)
        at mongo-php-driver/src/MongoDB/Manager.c:492
#12 0x0000555555e41d16 in execute_internal (execute_data=0x7ffff3820160, return_value=0x7ffff38200f0)
        at /home/derick/dev/php/php-src.git/Zend/zend_execute.c:2328

At first glance, I couldn't really see anything wrong with this backtrace, and was still puzzled why it would abort. I decided to go for a lunch time walk and have a look at it again. I always find that walks are good for clearing my mind.

After the walk, and a cuppa tea, I looked at the backtrace again, and noticed the following curiosity:

#4  0x00007ffff32cdbe6 in mongoc_cluster_stream_for_server (
        cluster=0x555556ba8e68, server_id=2, reconnect_ok=true, cs=0x0, reply=0x7fffffff9ee0, error=0x555556b95858)
        at mongo-php-driver/src/libmongoc/src/libmongoc/src/mongoc/mongoc-cluster.c:1826


#10 0x00007ffff33253d9 in phongo_execute_query (
        client=0x555556ba8e60, namespace=0x7fffeaeb5d58 "test.test", zquery=0x7ffff38201c0, options=0x0,
        server_id=1, return_value=0x7ffff38200f0, return_value_used=1)
        at mongo-php-driver/php_phongo.c:810

In frame #10 the server_id variable is 1, whereas in frame #4 later on, the server_id variable is 2. These should have been the same.

The server ID is determined by the C driver when selecting a server to send a read or write operation to, and refers to a specific server's connection ID. The PHP driver uses this server ID when executing the query through the phongo_execute_query function, which calls the C driver's mongoc_collection_find_with_opts. The latter accepts as 3rd argument a bson_t value with options to use while executing a query. These options include the pre-selected server ID, so that the C driver does not attempt to reselect a server:

cursor = mongoc_collection_find_with_opts(collection, query->filter, query->opts,
        phongo_read_preference_from_zval(zreadPreference TSRMLS_CC));

I decided to investigate which options the PHP driver was sending to mongoc_collection_find_with_opts. A while ago I developed a GDB helper function, about which I wrote in pretty-printing BSON. I sourced this helper within my GDB instance, and switched to frame #10 to inspect the value of the query options:

(gdb) source ~/dev/php/mongodb-mongo-php-driver/src/libmongoc/.gdbinit
(gdb) frame 10

The function call uses the options from the query struct query->opts, so I used the printbson helper function to display its contents:

(gdb) printbson query->opts

Which showed:

$11 = "!\000\000\000\020serverId\000\002\000\000\000\020serverId\000\001", '\000' 
INLINE (len=33)
        'serverId' : NumberInt("2"),
        'serverId' : NumberInt("1")

There are not supposed to be two conflicting serverId elements. Unlike PHP's arrays, bson_t values can have the same key appear multiple times. Although the C driver had selected server ID 1 for this query, server 2 was used because it was the first serverId element in the options struct. But why where there two values in the first place?

If you look at the PHP test, you see the following:

executeQuery('test.test', $query));
        } catch(Exception $e) {

From this follows that we create the Query object, assign it to $query, and then use the same variable for each iteration. Somehow, we were not resetting the query options back to default before we used them, resulting in a duplicate serverId field. Once we figured out the problem, creating the fix was easy enough: Make sure we use a clean query->opts struct before we pass it to the mongoc_collection_find_with_opts function.

Debugging this issue was made lot easier by having the right debugging tools, and this case shows that spending time on writing the GDB helper function printbson earlier in the year paid off. With this bug fixed, we could release a new patch version of the MongoDB Driver for PHP.

Happy hacking!

10/09/2018 04:26 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 5

On the previous trip I walked from Reading to Bourne End on the Thames Path, and on this trip I will progress to Staines. Getting to Bourne End wasn't as easy as getting to Reading, and a fair bit slower with a train to Maidenhead and then the branch line towards Marlow.

Day 7 — Bourne End to Staines

I set off from Bourne End (without going to the lovely craft beer place) fairly early in the morning, on a bright and sunny day. It was clearly regatta season as became apparent when I made my way to Cookham, where their yearly races were on. Not only were they preparing for the watery fun, there was also a good display of vintage cars being set-up. I think I was too early to see it in full flow. Unfortunately, it did mean diverting from the river for a while, first because of the regatta, and then because there are private things in the way.

When I got back to the river after about a mile, I could just about make out the outlines and hill of Cliveden House across the river. I had visited this National Trust property with family.

Just before reaching Maidenhead I came upon Boulter's Lock. An industrial looking lock that replaced earlier 16th and 18th century locks on the same place. It was quite busy around the lock, with apparently one or two tour buses having stopped to let their passengers having a look at it. I spend some time wandering around and relaxing, before continuing my walk.

Not far from Maidenhead lays Bray, famous for their large number of Michelin starred restaurants. There is the one star Hind's Head where I had eaten before, the three star Fat Duck, and the three start Waterside Inn, where I attended a wedding once. You can't see much of it from the other side of the Thames though, so just had to do with my memories of the tasty food.

Not far from Bray is the Dorney Lake, which was used in the 2012 Olympics as a rowing venue under the name "Eton Dorney". You can't see much of it from the North side, but once you've walked along it to the bottom end you can actually go up and have a look.

Immediately after rounding a bend in the river Thames, you see Windsor Castle, the Queen's "country home", towering over Windsor and the river. My plan was to have lunch in Windsor, but just after getting into Eton I noticed a nice looking pub, the Waterman's Arms. Their menu looked good, and they have a few nice draughts on so I decided to have lunch there instead. As a bonus, I also noticed the pub's ceiling had a painting of the whole length of the Thames on it!

After lunch, and perhaps half a pint too much, I continued my walk. On the bridge that connects Eton to Windsor, a guitar player was playing joyful tunes, to which I listened when taking some photos and video.

After walking for a while in between a railway and the river, there were plenty of signs saying that "no drones" could be flown in and near The Home Park. I bet the Queen doesn't like people snoop on her activities much. Beyond lovely Datchet, I was directed through a field with nettles to be "closer to the river", but I don't think that was specifically worth it. The 200 meter stretch along the road would probably be easier walking. The only real scenery were planes leaving Heathrow Airport anyway. The noise of planes overhead continued for the rest of the walk into Staines, or as it now is called "Staines-upon-Thames". Please note that Staines does not rhyme with Thames.

Just before Staines is Runnymede. Runnymede is an important site in British history, as this is the location where the Magna Carta was signed, all the way back in 1215. The Magna Carta is a charter of rights that is the first one to prevent from the King being above the law. Modifications were made in the following decades, and not many of its provisions are currently left, but it was very important on a journey to a constitutional monarchy.

The site itself is an open expanse of grassland, with a piece of art — "The Jurors", by Hew Locke. There is also a memorial created by the American Bar Association. Unfortunately I only found out about this after I had come home, and hence did not visit it.

Just beyond Runnymede the Thames Path passes under the M25, which means I have walked all the way from the source into Greater London. From there it was not much further to the end of my walk at St. Peter's Church in Staines.

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

10/02/2018 03:08 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 4

On the previous trip I walked from Shillingford to Reading on the Thames Path, and on this trip I was planning to walk to Marlow. For once, getting to the start was easy by a single and fast train ride into Reading.

Day 6 — Reading to Bourne End

With the Reading Festival over, there was little to see of hoardings and temporary holding pits when I started my walk towards Marlow. The first interesting thing along the path is Caversham Lock, which quite busy on the slightly hazy morning.

A little further along, I crossed, by means of a foot bridge, the River Kennet. Through the Kennet and Avon Canal, it connects the Thames at Reading with the River Avon at Bristol, with the canal section between Bath and Newbury. When recording the video section on this, I messed up so had to re-record the narration.

Continuing along the Thames Path I noticed that there are a significant number of posh schools located along this stretch of the river, which sometimes means you need to make a little detour. Another of these detours away from the river is at Shiplake, where the Thames Path brings you straight to the quaint village, with its Phone Box Book Exchange, and its lovely railway station.

My plan was to have lunch at Henley, at one of the pubs along the river. Just when I got to Henley however it started to rain, and all the pubs where absolutely packed, with lots of people wearing colourful clothes. I had no idea what was going on, and I was hungry, so I found a little cafe, Cafe Copia near the town hall to have lunch, and wait for the rain to pass.

After lunch I crossed the bridge and found the Thames Path clogged with these same people in colourful clothes. Due to the busyness it was quite slow going for a while, until I got to a massive festival site just North of Remenham. There was a reason for the funny dress sense after all, as this site was hosting the Rewind Festival, a big 80s festival.

After half an hour, the festival site was no longer visible and audible and the tranquillity of the walk returned, and I could listen to the podcast series Caliphate from the New York Times, once again, in the middle of the private deer park of Culham Court.

At Hurley it became busy along the Thames again as they were having their annual Regatta. As I was quite early, there was no racing going on yet, but as often with these river-side festivals they make you go around them, away from the Thames Path.

Near Bisham, you can spot the former grounds of Bisham Abbey across the river. Now it is sports centre, but when I passed by there was a wedding reception in progress.

As I was now approaching Marlow, I asked one of my coworkers, who is local, for a good recommendation for a post-walk beer. He came back with saying that Marlow is primarily food-orientated, and proceeded to recommend a pub in Bourne End, another three and a half miles walking. I decided to do that, as it was quite reasonably early, and I was going to miss the train at Marlow anyway. Just past the town, the Thames Path got diverted again, as repaving work was being done on a section — yay for shiny new paths!

The extra miles to Bourne End where worth it in the end as I settled for a half at KEG Craft Beer for a lovely sour. I also picked up a train beer for my trip back into London.

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

09/25/2018 03:08 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 3

On the previous trip I walked from Oxford to Shillingford on the Thames Path, and on this trip I was planning to walk to Pangbourne. Getting to Shillingford requires a bit of planning as the buses aren't terribly frequent, so I got up early (07:00 on a Saturday!) for a 08:50 bus from Reading to Shillingford to continue downstream on my adventure.

Day 5 — Shillingford to Reading

Setting off from Shillingford I quickly came upon the river Thames again, where I was sharing the path with a (very slow) running group—I was keeping up while walking! The first activity on the river itself came at Benson Marina where early birds where getting their canoes in the water, avoiding a range of other motorised pleasure craft.

Wallingford wasn't very far away, and it is a quite picturesque old market town. The Thames Path doesn't go through the town itself, but I did catch some of it on the bus on the way to the start of this walk, and when returning from Shillingford on my previous walk.

Five kilometres beyond Wallingford the Thames Path dives under the Moulsford Railway Bridge before the path diverges from the river around a school. It was an unpleasant section up a hill, and then along a busy and narrow road before the path moves back towards the river, where it joins the tow path by going through a pub's garden. I hope that in the future this diversion would not be needed.

This walk featured six locks, from Benson Lock near the start, to Caversham Lock near Reading. I noticed that although the locks themselves are not getting much larger, the accompanying weirs most definitely do. I spent a little time at Goring Lock just before lunch.

I walked into the small village of Goring-on-Thames to have lunch in a "locals" pub where I had my first first-hand "Brexit Experience". I think I'll leave that story for a separate post though.

After lunch I walked back to the river, and noticed that the landscape was very different from most of the other landscapes along the Thames so far. Instead of it being flat, and wide open, near Goring, the landscape suddenly turned very hilly around me. I did some googling and found that this specific feature is called the Goring Gap. Apparently, about a hundred thousand years ago the Thames didn't flow through Oxford and London but instead went through St. Albans and Ipswitch. During the last ice age the route got diverted because there was a glacier in the way. The Goring Gap is the new route that the Thames cut out through the hills.

The feature became even more apparent when the Thames Path diverted slightly from the river and had me going right up a hill. Near the top I spoke with some other hikers which assured me this was the only location where the Thames Path would do that.

My original plan was to finish at Pangbourne, but it was still early and the weather was nice, so I decided to continue to Reading, with its much easier rail connections.

At Purley the path left the Thames for a while, as due to "access rights" the path can't follow the river closely. Annoyingly that meant up another bit of uphill. Although it wasn't particularly a long stretch, it was quite steep. When I got back to the river, a rain (drizzle) shower started, which kept showing its head all the way back to my finish in Reading.

The last three kilometres go past the normally quiet fields of Little John's farm, but with the Reading Festival only a week away, it was a busy business of setting up for the festival.

Just before Reading I crossed under the Caversham bridge, the last bridge on this part of the walk. Before boarding my train in Reading, I had a well deserved pint at The Three Guineas right in front of the station. Another 34km done!

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo. You can also see all the photos on a map.

09/04/2018 03:41 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 2

It had been a few weeks since the previous walk. It had been quite hot and therefore I left it a little to continue along the Thames Path. Today was a lot cooler to start with, with some scattered showers predicted. As this was another weekend day I suspected rail engineering works—and indeed, there were no trains from Paddington to Oxford. The rail replacement bus which took me from Didcot Parkway towards the start in Oxford was luckily will organised, and I only ended up being about 15 minutes later than I originally had hoped for.

Day 4 — Oxford to Shillingford

It was fairly windy, and a little chilly when I left Oxford along the river. There were lots of rowers out on the Thames, which didn't come as a big surprise as that's what Oxford's stretch on the Thames Isis is known for. Not only were there lots of rowing clubs out, but also plenty of amateurs.

After a while it started to become darker and darker, and I started to wonder whether I was going to keep it dry. I encountered a pair of hikers also doing the Thames Path having a break near one of the boat houses. After having a chat, and avoiding a tree branch being blown out of the tree, I decided to continue and mentioned that I would chance getting rained upon. Without surprise, 5 minutes later it was pelting it down. I found some shelter under a few trees to put on my rain coat, but by the time I managed to put on my coat it was already dry. The additional humidity wasn't that great though.

Just before Abingdon, and its lock, there is a complicated section on the Thames Path that isn't very well signed. OpenStreetMap has the Thames Path (near) perfectly mapped, so it was easy enough for me to find the route, but I did assist somebody going in the opposite direction and explained where she had to go.

In Abingdon I had lunch at the Nag's Head, which is situated on its own island, Nag's Head Island. I opted for a traditional fish & chips, and had a pint of the local brew, called Nag's Island. Just when I was finishing lunch it started to drizzle a little bit, but it had already stopped by the time I had finished my visit to the little coder's room.

Unfortunately the drizzle did come back a few times over the next hours, while I was getting closer to the big towers of the Didcot power stations, which were looming on the horizon. Just before Culham I stopped to have a look at the Swift Ditch, and old short cut for the river Thames ignoring Abingdon. It is no longer in use, and very overgrown, but it was the fastest route for nearly a thousand years.

During the next bit of the walk, there were a great many locks to go past and over, until I came upon the river Thame. From the source at Thames' Head to where the river Thames meats the river Thame, the Thames is also called the Isis. Especially in Oxford do make a big deal out of that. On some Ordnance Survey maps you will still see "River Thames or Isis" for the whole stretch from the source to the Thame.

About 20 minutes later I arrived at my finish today at Shillingford, where I took the bus into Reading, via a bus-replacement bus in Wallingford, and via Henley-on-Thames as the schedule worked better for that. Why wait 30 minutes for a fast route that gets me into Reading 3 minutes earlier while it was very clearly going to rain soon? In Reading, the trains into London were unsurprisingly delayed, as they are so often.

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo.

08/21/2018 04:23 am   derickrethans.nl   Mirror   Link  

Downstream — Trip 1

As you might have picked up on, I am walking the whole length of the Thames Path currently. All 184 miles from the source to the Thames Barrier.

I started making videos about each day of walking, which I will be posting here, but the videos don't tell the whole story so I have decided to write about it as well. This is the story of my first trip, which I did over 3 days at the end of June. I had taken the Monday off so that I could to the first 3 stages in one go. That's mostly necessary because there is no, or really crappy, public transport along this first third of the Thames Path. But both the source, and Oxford, are reasonably easy to do by train.

Day 1 — Source to Cricklade

The first day started by getting the train to Kemble, which is about a mile away from the source in the Cotswolds. After being delayed in Paddington for an hour, the change at Swindon wasn't great either, as the smaller train into Kemble was very full as the previous London to Swindon train had been cancelled. Getting from Kemble station to the source was a little bit tricky, as it wasn't overly well sign posted. And once you get in the fields near the source, you basically have to have a compass to follow it, as there is no hint of the river yet at all.

When I made it to the source, I rested for a bit before starting my walk along "the river" towards Cricklade. The first bit was really just walking in a field, to a little past where I got onto the path from Kemble, where there was suddenly a little stream, with really clear water. From there on, the river gradually became wider and wider. Most of this first day was easy to walk, with mostly paths, and even though I thought I'd see a lot of mud, there was basically none. It having been so warm in the last few weeks really helped.

I was also quite excited when I spotted the first wild life. At first I thought it was a snake, but it turned out to be a fake-snake, a Slow-worm. We have them in the Netherlands too. Not much after that, I also spotted a single pheasant.

Walking through the Cotswolds Water Park with its many lakes was quite variable. Some were quiet and had waterbirds, others were loud with people attempting to waterski.

Just before Cricklade I walked through the North Meadow, with its abundance of wild plants. Apparently it's one of the very few real meadows left in the UK.

Once I made it to Cricklade, I checked in into the White Hart Hotel, had a shower, and promptly mapped the whole town on OpenStreetMap. Then it was time for a pint, and dinner. While checking in my pint on Untappd I noticed that there was a nearby venue, The Red Lion, which also features its own brews. I spend most of the rest of the evening there trying out their brews in the nicely cool garden reading a book (and twitter!). It was a good end of the day after a 22km walk.

Day 2 — Cricklade to Newbridge

I got up really early (7:30am!!) to have breakfast and to head out as early as I could. I had to cover a lot of ground today to make it to Newbridge where I had booked a night to stay in an inn. I knew it was a long way, and unfortunately I had developed a blister on my heel. Luckily I had packed blister plasters so this didn't end up too big of a deal. Sunday morning in Cricklade was really quiet, with only a few people walking their dogs. It is not a big town, and soon I found myself in the country side.

Just before Lechlade, my Thames Path in the Country guide said that there was an annoying section along the A361, quite a bit away from the river. But since the guide was written, the Thames Path had been redirected through a path much closer to the river after having secured access. This new section had lots of brand new gates, and would surely have been a big improvement if there weren't a few fields of stinging nettles to get through. Walking in shorts was not a clever plan there.

I came upon the first swans, boats, and of course, the first lock (St. John's Lock). I spend a little time here, as locks always fascinated me. But as I had a long walk today, I didn't linger too long.

I had lunch at Ye Olde Swan. I thought their web site had said they brewed their own beer, but that appeared to be not really true. I still had lunch with a lovely pint on their outdoor Thames-overlooking terrace.

After lunch many of the "paths" turned into fields of plants that I had to battle through for a while, and many had nettles. Where the path was a bit easier, there were lots of butterflies and dragonflies. I probably brought a little bit too little water, but found out that most locks will have a water point. That was particularly useful on this hot section later in the day.

At about 19:00, I made it to the end of the section at Newbridge. I would be staying at The Rose Revived.

Day 3 — Newbridge to Oxford

After a good night's rest I left rather early again. Not quite because I had a long way to go (only 22km!), but mostly so that I could take it very slowly. The previous day's walk had definitely taken its toll.

The Thames Path went straight through the pub's garden, so I didn't have very far to the path. Once I got out of the garden, there were very many rabbits hopping around, only to be disturbed by me, one fellow hiker, and a very adventurous farmer which found it necessary to mow the hay at 08:30 in the morning, on a Sunday.

After having to go around a caravan park bordering the river, and through fields of sheep, I made good progress. With the river a bit wider, there was quite a lot more boat traffic on the Thames, and even some people swimming in it. I guess the Thames is quite a bit cleaner before cities like Oxford, Reading, and London dump their waste in it.

After about half way, I ran into a sign saying that one of the bridges over one of the tributaries was closed and that I should take a detour instead. That detour was very poorly signed, but with some help of the well mapped paths on OpenStreetMap, I found myself a new route. About a third through the detour, on the top of a hill, I ran into another hiker which informed me that the bridge wasn't actually closed. Which stinks, as this meant that I didn't actually walk the whole Thames Path, even though it was about the same distance.

After passing the Godstow Abbey Ruins, near Oxford, the river side became busy with people (and geese!) that were enjoying the river on this hot day. I was happy to be done with the walking around 15:00, and refresh myself with a pint and lunch before heading home by train to London.

Photos from my Adventure on the Thames Path are available on Flickr, and all videos on Vimeo.

08/10/2018 05:12 am   derickrethans.nl   Mirror   Link  


Last week I listened to an episode of The Sceptics' Guide to the Universe where the word of the week was "analemma". An analemma is a diagram showing the position of the Sun in the sky over the course of a year, as viewed at a fixed time of day from the same location on Earth. I once tried to make such a diagram when I was still living in Norway from a series of photos, but the weather wasn't consistent enough to make that work.

But as I am currently starting to update the Guide to Date and Time Programming for a second edition, I was wondering whether I could create an analemma from existing PHP functions. Unfortunately, PHP only provides functionality to calculate when the Sun is at its highest point, through date_sun_info():

getTimestamp(), // Unix timestamp
        51.53,                                     // latitude
        -0.19                                      // longitude

$zenith = new DateTimeImmutable( "@{$sunInfo['transit']}" );
echo $zenith->format( DateTimeImmutable::ISO8601 ), "\n";

Which on February 26th, was at 2018-02-26T12:13:38+0000 in London.

Then I remembered that a few years ago I wrote Where is the Sun?. There I features a new hobby library "astro" that I was working on. This library implements a few astronomical calculations. I wrote a little PHP extension around it too: php-solarsystem. Neither library or extension have really been released.

The php-solarsystem extension implements just one function: earth_sunpos(), which fortunately does exactly what I needed for drawing an analemma: it gives you the position of the Sun in the sky for a specific location on Earth at a specific time.

With this function, all I had to do is calculate the position of the Sun in the sky at the same time-of-day for a whole year. With the DatePeriod class in PHP, I can easily create an iterator that does just that:

modify( "+1 year 1 day" );
$dateInterval = new DateInterval( "P1D" );

foreach ( new DatePeriod( $dateStart, $dateInterval, $dateEnd ) as $date )

We don't really want Daylight Saving Time to be in the way, so we set the time zone to just UTC, which works fine for London for which we'll draw the analemma.

We start at the start of the year (2018-01-01 09:00) and iterate for a year and a day (+1 year 1 day) so we can create a closed loop. Each iteration increases the returned DateTimeImmutable by exactly one day (P1D).

After defining the latitude and longitude of London, all we need to do is to use the earth_sunpos() function to calculate the azimuth and altitude inside the loop. Azimuth is the direction of where the Sun is, with 180° being due South. And altitude is the height of the Sun above the horizon.

$lat = 51.53;
$lon = -0.09;

foreach ( new DatePeriod( $dateStart, $dateInterval, $dateEnd ) as $date )
        $ts = $date->format( 'U' );
        $position = earth_sunpos( $ts, $lat, $lon );
        echo $ts, "\n";
        echo $position['azimuth'], ",";
        echo $position['altitude'], "\n";

The script outputs the calculation as a "CSV", which we should redirect to a file:

php tests/analemma.php > /tmp/analemma.csv

To plot we use the following gnuplot script:

set style line 1 lt 1 lw 2 pt 0 ps 0 linecolor rgb "orange"
set style line 2 lt 1 lw 1 pt 0 ps 0 linecolor rgb "grey"

set datafile separator comma
set xrange [100:150]
set yrange [0:50]

set grid linestyle 2
set terminal png size 640,640 enhanced font "Helvetica,12"
set output '/tmp/analemma.png'

plot "/tmp/analemma.csv" using 2:3 title "London @ 9 am" with linespoints linestyle 1

With this script, we can then draw the analemma:

gnuplot /tmp/analemma.plot

The result:


Analemma (Plot) — Derick Rethans

03/06/2018 03:39 am   derickrethans.nl   Mirror   Link   @10

Pretty Printing BSON

In Wireshark and MongoDB 3.6, I explained that Wireshark is amazing for debugging actual network communications. But sometimes it is necessary to debug things before they get sent out onto the wire. The majority of the driver's communication with the server is through BSON documents with minimal overhead of wire protocol messages. BSON documents are represented in the C Driver by bson_t data structures. The bson_t structure wraps all of the different data types from the BSON Specification. It is analogous to PHP's zval structure, although its implementation is a little more complicated.

A bson_t structure can be allocated on the stack or heap, just like a zval structure. A zval structure represents a single data type and single value. A bson_t structure represents a buffer of bytes constituting one or more values in the form of a BSON document. This buffer is exactly what the MongoDB server expects to be transmitted over a network connection. As many BSON documents are small, the bson_t structure can function in two modes, determined by a flag: inline, or allocated. In inline mode it only has space for 120 bytes of BSON data, but no memory has to be allocated on the heap. This mode can significantly speed up its creation, especially if it is allocated on the stack (by using bson_t value, instead of bson_t *value = bson_new()). It makes sense to have this mode, as many common interactions with the server fall under this 120-byte limit.

For PHP's zval, the PHP developers have developed a helper function, printzv, that can be loaded into the GDB debugger. This helper function unpacks all the intricacies of the zval structure (e.g. arrays, objects) and displays them on the GDB console. When working on some code for the MongoDB Driver for PHP, I was looking for something similar for the bson_t structure only to find that no such thing existed yet. With the bson_t structure being more complicated (two modes, data as a binary stream of data), it would be just as useful as PHP's printzv GDB helper. You can guess already that, of course, I felt the need to just write one myself.

GDB supports extensions written in Python, but that functionality is sometimes disabled. It also has its own scripting language that you can use on its command line, or by loading your own files with the source command. You can define functions in the language, but the functions can't return values. There are also no classes or scoping, which means all variables are global. With the data stored in the bson_t struct as a stream of binary data, I ended up writing a GDB implementation of a streamed BSON decoder, with a lot of handicaps.

The new printbson function accepts a bson_t * value, and then determines whether its mode is inline or allocated. Depending on the allocation type, printbson then delegates to a "private" __printbson function with the right parameters describing where the binary stream is stored.

__printbson prints the length of the top-level BSON document and then calls the _printelements function. This function reads data from the stream until all key/value pairs have been consumed, advancing its internal read pointer as it goes. It can detect that all elements have been read, as each BSON document ends with a null byte character (\0).

If a value contains a nested BSON document, such as the document or array types, it recursively calls __printelements, and also does some housekeeping to make sure the following output is nicely indented.

Each element begins with a single byte indicating the field type, followed by the field name as a null-terminated string, and then a value. After the type and name are consumed, __printelements defers to a specialised print function for each type. As an example, for an ObjectID field, it has:

if $type == 0x07
    __printObjectID $data

The __printObjectID function is then responsible for reading and displaying the value of the ObjectID. In this case, the value is 12 bytes, which we'd like to display as a hexadecimal string:

define __printObjectID
    set $value = ((uint8_t*) $arg0)
    set $i = 0
    printf "ObjectID(\""
    while $i < 12
        printf "%02X", $value[$i]
        set $i = $i + 1
    printf "\")"
    set $data = $data + 12

It first assigns a value of a correctly cast type (uint8_t*) to the $value variable, and initialises the loop variable $i. It then uses a while loop to iterate over the 12 bytes; GDB does not have a for construct. At the end of each display function, the $data pointer is advanced by the number of bytes that the value reader consumed.

For types that use a null-terminated C-string, an additional loop advances $data until a \0 character is found. For example, the Regex data type is represented by two C-strings:

define __printRegex
    printf "Regex(\"%s\", \"", (char*) $data

    # skip through C String
    while $data[0] != '\0'
        set $data = $data + 1
    set $data = $data + 1

    printf "%s\")", (char*) $data

    # skip through C String
    while $data[0] != '\0'
        set $data = $data + 1
    set $data = $data + 1

We start by printing the type name prefix and first string (pattern) using printf and then advance our data pointer with a while loop. Then, the second string (modifiers) is printed with printf and we advance again, leaving the $data pointer at the next key/value pair (or our document's trailing null byte if the regex type was the last element).

After implementing all the different data types, I made a PR against the MongoDB C driver, where the BSON library resides. It has now been merged. In order to make use of the .gdbinit file, you can include it in your GDB session with source /path/to/.gdbinit.

With the file loaded, and bson_doc being bson_t * variable in the local scope, you can run printbson bson_doc, and receive something like the following semi-JSON formatted output:

(gdb) printbson bson_doc
ALLOC [0x555556cd7310 + 0] (len=475)
    'bool' : true,
    'int32' : NumberInt("42"),
    'int64' : NumberLong("3000000042"),
    'string' : "Stŕìñg",
    'objectId' : ObjectID("5A1442F3122D331C3C6757E1"),
    'utcDateTime' : UTCDateTime(1511277299031),
    'arrayOfInts' : [
        '0' : NumberInt("1"),
        '1' : NumberInt("2"),
        '2' : NumberInt("3"),
        '3' : NumberInt("5"),
        '4' : NumberInt("8"),
        '5' : NumberInt("13"),
        '6' : NumberInt("21"),
        '7' : NumberInt("34")
    'embeddedDocument' : {
        'arrayOfStrings' : [
            '0' : "one",
            '1' : "two",
            '2' : "three"
        'double' : 2.718280,
        'notherDoc' : {
            'true' : NumberInt("1"),
            'false' : false
    'binary' : Binary("02", "3031343532333637"),
    'regex' : Regex("@[a-z]+@", "im"),
    'null' : null,
    'js' : JavaScript("print foo"),
    'jsws' : JavaScript("print foo") with scope: {
        'f' : NumberInt("42"),
        'a' : [
            '0' : 3.141593,
            '1' : 2.718282
    'timestamp' : Timestamp(4294967295, 4294967295),
    'double' : 3.141593

In the future, I might add information about the length of strings, or the convert the predefined types of the Binary data-type to their common name. Happy hacking!

02/13/2018 03:47 am   derickrethans.nl   Mirror   Link   @12

Wireshark and SSL

This is a follow up post to Wireshark and MongoDB 3.6, in which I explained how I added support for MongoDB's OP_MSG and OP_COMPRESSED message formats to Wireshark.

In the conclusion of that first article, I alluded to the complications with inspecting SSL traffic in Wireshark, which I hope to cover in this post. It is common to enable SSL when talking to MongoDB, especially if the server communicates over a public network. When a connection is encrypted with SSL, it is impossible to dissect the MongoDB Wire Protocol data that is exchanged between client and server—unless a trick is employed to first decrypt that data.

Fortunately, Wireshark allows dissection and analysis of encrypted connections in two different ways. Firstly, you can configure Wireshark with the private keys used to encrypt the connection, and secondly, you can provide Wireshark with pre-master keys obtained from a client process that uses OpenSSL.

The first option, providing Wireshark with the private keys, is by far the easiest. You can go to EditPreferencesProtocolsSSL and add the private key to the RSA keys list:

When you start using Wireshark with SSL encryption, it is also wise to configure an SSL debug file in the same screen. I have set it here to /tmp/ssl-debug.txt.

Months ago, I had added my private key to the RSA keys list, but when I tried it now for this post, Wireshark failed to decrypt my SSL traffic to MongoDB. I was a little confused as it worked in the past. Since I had my SSL debug file at least I had some chance of figuring out why this no longer worked. After a quick look I noticed the following in the debug file:

   session uses Diffie-Hellman key exchange
   (cipher suite 0xC030 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
   and cannot be decrypted using a RSA private key file.

After some searching, I found out that if the session uses Diffie-Hellman for key exchange, Wireshark can not use the RSA private key, and needs different information. On an earlier run, I must have used a different version of either the encryption library (OpenSSL) or MongoDB, which did not use Diffie-Hellman.

This brings me to the second way of providing Wireshark with the information it needs to decrypt SSL encrypted connections: the pre-master key. This key is created during the connection set-up, and therefore you need to read data structures from within the OpenSSL library. You can do that manually with GDB, but it is also possible to inject a special library that hooks into OpenSSL symbols to read the data for you, and store them in a file with a format that Wireshark understands. You can find the source code for the library here.

Once you've obtained the source code, you can compile it with:

cc sslkeylog.c -shared -o libsslkeylog.so -fPIC -ldl

The compiled key logging library can be loaded in the process to override the existing OpenSSL symbols with:

SSLKEYLOGFILE=/tmp/premaster.txt LD_PRELOAD=./libsslkeylog.so \
    ./mongo --ssl \
    --sslPEMKeyFile=/tmp/ssl/ssl/client.pem --sslCAFile=/tmp/ssl/ssl/ca.pem

The OpenSSL LD_PRELOAD trick should also work with the PHP driver for MongoDB as long as it uses OpenSSL. You can verify which SSL library the PHP driver uses by looking at phpinfo() output. For Java programs, there is an agent you can use instead.

With the key logging library and its generated file with pre-master keys in place, and Wireshark configured to read the keys from this file through the (Pre)-Master-Secret log filename setting, we can now decrypt SSL-encrypted connections between MongoDB client and server:

There was one caveat: a small patch to Wireshark is needed for it to realise that MongoDB's connections can be SSL encrypted on the default port (27017). I created a patch with the following one-liner:

      dissector_add_uint_with_preference("tcp.port", TCP_PORT_MONGO, mongo_handle);
+     ssl_dissector_add(TCP_PORT_MONGO, mongo_handle);

This patch, and the two patches mentioned in the previous post, have been merged into Wireshark's master branch and will be included in the upcoming 2.6 release. Until that is released, you will have to compile Wireshark yourself, or use a nightly build.

02/06/2018 03:54 am   derickrethans.nl   Mirror   Link   @14

Xdebug 2.6

I have just released Xdebug 2.6. Xdebug 2.6 adds supports for PHP 7.2 (and drops support for PHP 5), and adds a whole bunch of new features. This article describes these new features.

Garbage Collection Statistics

PHP has a built-in garbage collector, which makes it possible for PHP to free up memory that normally would be lost due to interdependent references. I wrote about the garbage collectors in previous articles.

Xdebug 2.6 provides insight into the runs of PHP's built-in Garbage Collector. There are two new functions: xdebug_get_gc_run_count(), which returns how often the Garbage Collector has run, and xdebug_get_gc_total_collected_roots(), which returns the number of variable roots that the garbage collection has collected.

There is also a new set of settings: xdebug.gc_stats_enable, xdebug.gc_stats_output_dir, and xdebug.gc_stats_output_name. When the statistics collection is enabled by setting xdebug.gc_stats_enable to true, Xdebug will write a file to the configured output directory with a name configured through xdebug.gc_stats_output_name. Just like xdebug.trace_output_name, the latter supports different format specifier to add additional information to the file names.

Instead of recording the Garbage Collection runs for the whole script, you can also selectively record this information by using xdebug_start_gcstats() and xdebug_stop_gcstats().

When PHP's garbage collector runs, Xdebug will write information about each run into a file. The format of the file is:

Garbage Collection Report
version: 1
creator: xdebug 2.6.0 (PHP 7.2.0)

Collected | Efficiency% | Duration | Memory Before | Memory After | Reduction% | Function
    10000 |    100.00 % |  0.00 ms |       5539880 |       579880 |    79.53 % | bar
    10000 |    100.00 % |  0.00 ms |       5540040 |       580040 |    79.53 % | Garbage::produce
     4001 |     40.01 % |  0.00 ms |       2563048 |       578968 |    77.41 % | gc_collect_cycles

For each run, it will write how many roots are collected, and how much % of them ends up getting freed. The duration of the Garbage Collector's run, and the memory usage before and after are also recorded, as well as how much reduction in memory usage this Garbage Collector run created. The last column shows the active function or method name when the Garbage Collection algorithm was run, or gc_collect_cycles() if it was run manually.

Profiler Enhancements

Xdebug's profiler now also collects information about memory usage. This can assist tracking down which parts of your application allocate a lot of memory, and perhaps why some memory is not freed up.

Caveat: As described above in Garbage Collection Statistics, PHP has a Garbage Collector built in, which can trigger at seemingly random times. This will distort the memory information that is recorded in the profiler's output files. In order to get better results for memory profiling, you might want to consider disabling PHP's internal garbage collector.

Additionally, Xdebug will now add a X-Xdebug-Profile-Filename HTTP header for requests for which the profiler is active. This header holds the name of the file that contains the profiling information for that request.

Remote Debugging Improvements

A new protocol feature, extended_properties, has been introduced that IDEs can opt into. When this feature is enabled, Xdebug will send variable names as Base64 encoded data to allow for characters that can not be represented safely in XML.

Another new protocol feature, notifications, has been introduced that IDEs can opt into. When this feature is enabled, Xdebug will send any Notice, Warning, or Error as an out-of-band notification over the debugging protocol to the IDE which can then display this information.

A new setting, xdebug.remote_timeout, has been added to configure how long Xdebug should wait for an IDE to acknowledge an incoming debugging connection. The default value, 200 ms, should in most cases be enough, but can be increased if you have a particularly high latency on your network and Xdebug fails to make a connection due to the low timeout.

A new function, xdebug_is_debugger_active(), can be used whether there currently is an IDE attached to Xdebug through the DBGp protocol.

Xdebug now supports debugging through Unix domain sockets. You can specify Unix domain socket "hosts" with unix:///path/to/sock, with thanks to Sara Golemon.

Xdebug now enables FD_CLOEXEC on its debugging sockets to prevent them from being leaked to forked processes, thanks to Chris Wright.

Smaller Improvements

A new setting, xdebug.filename_format, has been added to configure how Xdebug will render filenames in HTML-like stack traces. Just like xdebug.trace_output_name, it accepts a set of format specifiers that can be used to include certain aspects of a path. Xdebug 2.6 adds the specifiers below. With a full path of /var/www/vendor/mail/transport/mta.php, the able below lists what each specifier represents:





File name



Directory and file name



Two directory segments and filename



Full path



Platform specific slash

/ on Linux and OSX, \ on Windows

Xdebug now adds the values of superglobals to the error log as well. Previously, it would only add this information to on-screen stack traces. In order for Xdebug to show this information, you need to configure through xdebug.dump_globals and xdebug.dump.* which superglobal keys you want to see at all.

The %s format specifier is now available to be used with the xdebug.trace_output_name setting. Previously, it was only available for use with the xdebug.profiler_output_name setting.

Trace files generated with xdebug.collect_assignments now also contain assign-by-ref (=&) assignments.

01/30/2018 03:11 am   derickrethans.nl   Mirror   Link   @28

Wireshark and MongoDB 3.6

While working on the MongoDB Driver for PHP, we sometimes run into problems where we send the wrong data to the server, tripping it up, and not getting the results we want.

Drivers talk to MongoDB with a binary protocol, which we refer to as the Wire Protocol. It is a documented binary format, mostly wrapping around BSON structures.

Different versions of the MongoDB may respond with varying message formats and likewise prefer (or require) that drivers use newer formats to issue commands and queries. An additional problem is that sometimes mongoS, our query router for sharding, wants instructions in different formats, and sends results back in a slightly different format than mongoD.

Over the past years, I have been using a network traffic analyser called Wireshark to figure out what actually gets sent over the network connections.

Wireshark runs on various platforms, and on Debian platforms can easily be installed by running apt install wireshark. Starting Wireshark presents you with the following screen:

This screen lists all the network interfaces that are available on the system, and it allows you to select one (Loopback: lo in my case) and start packet collection by clicking on the shark fin.

With Wireshark collecting data, I then connected with the MongoDB shell to the server to see what was sent over a network connection. As no network protocol is the same as any other, Wireshark contains lots of different "dissectors" to analyse each specific protocol. There are a lot of dissectors, for nearly every protocol that you can think of—including HTTP, MySQL, NTP, TCP, and of course MongoDB.

After we connected with the MongoDB shell to the server, we can have a look at what Wireshark collected.

In this first screenshot we see that the MongoDB client's first packet is a Request : Query packet. In this first packet it sends an isMaster command to the server to figure out its version, and some other parameters. The dissector has unpacked the BSON Document with length 201 to be able to show us the various elements that make up the request. For example, it sends along which driver is being used to make the request (MongoDB Internal Client) in the clientdrivername element.

The result includes parameters such as the maximum size of a data packet (maxMessageSizeBytes), the localTime, as well as the wire version (not pictured). The latter is used to determine which wire protocol versions can understand.

So far we have seen two types of packets, Request and Response, but others also exist. In MongoDB 3.6, two extra types of packages are are supported. These are the Extensible Message Format and Compressed packets. The former is replacing all earlier ways of sending queries and commands (and receiving their replies), with a new single packet that can be used in both directions. The latter can be used to compress the data, and this packet also wraps an Extensible Message Format packet.

Unfortunately, Wireshark's MongoDB dissector did not support these new packets yet. Instead of waiting for them to be added by the maintainers, I set out to add the missing functionality myself. It is after all closely related to my day job ☺.

As I had never contributed to Wireshark, I read up on their development guidelines and quickly found that they have a highly automated way of doing code reviews with rigorous practises in place. This is a welcome difference from most other open source projects to which I've contributed.

As I expected, my first patch adding OP_MSG support needed a few revisions before it was merged.

The result of this first patch is that packets of the OP_MSG type can now also be visualised:

A Extensible Message Format packet consists of one more Sections, which can either be a Body, or a Document Sequence. With a Body being required with zero or more Document Sequence sections following. In this example, all of the Message Flags are empty (unset).

In this case the reply contains a cursor description, including its initial batch (firstBatch) of three documents totalling 6392 bytes. Drivers can use these cursor descriptions to iterate over a whole result set.

Once there is lot of data to transport, it can be beneficial to compress the data stream. MongoDB 3.4 already included some preliminary work for this, but with MongoDB 3.6 compression is now enabled by default. It can either use the snappy compressor or zlib compressor. Compressed data is send with a different packet type (OP_COMPRESSED) which, as you can probably guess, also was not yet supported.

My second patch adds OP_COMPRESSED support to Wireshark. Adding support for zlib was easy, as Wireshark already had helper functions available. Supporting snappy required a bit more manual work.

Compressed packets replace the original packet type (OpCode) with one (2012) to denote it is a compressed packet. The header that describes the parameters of the compressed data (Uncompressed Size, and Compressor) also includes the original OpCode that is represented in the compressed data (Original Opcode).

When the dissector decompresses a data packet, it adds a new tab containing the Decompressed Data so that you can view the raw bytes of this uncompressed data once you dive into the data structures (e.g. Section, BodyDocument).

With support for OP_MSG and OP_COMPRESSED added, figuring out if we send something wrong to a MongoDB 3.6 server becomes a lot easier. Things might still be complicated once we throw SSL into the mix, but that will have to wait until another blog post.

01/23/2018 03:36 am   derickrethans.nl   Mirror   Link   @24

Weird Off-by-One Errors

A while ago, a user of the PHP Library for MongoDB reported a bug stating:

There is a special find query that fails with the message: Client Error: bad object in message: bson length doesn't match what we found in object with unknown _id

One can change the characters but not the length of strings and obtain the same exception.

Over the past week I had a look at this, after this was filed as PHPC-1067 for the MongoDB PHP driver, once we found that it actually causes a crash. When running the script from the report, I indeed found a Segmentation Fault (read: crash), where I noticed that during the course of a few function calls an unrelated C variable changed. The change was in a bson_t* type, which the driver uses for variables that go in and out of the MongoDB C client library. This data type has two implementations: an inline version, used for most small BSON documents; and an allocated version that can be used for building BSON structures, and which is typically used for larger documents.

In the driver's function, the data types are declared as:

bson_t  bdocument = BSON_INITIALIZER, boptions = BSON_INITIALIZER;
bson_t *bson_out = NULL;

With the affected code being:

php_phongo_zval_to_bson(zdocument, bson_flags, &bdocument, &bson_out TSRMLS_CC)

zdocument is the zval structure that PHP uses for storing variables, in this case, the document sent into the method. bdocument receives the created BSON document, boptions are options to go with the insertion, and bson_out receives the _id field that is either present in the original zdocument document, or that was generated in case no _id field was present. MongoDB documents are required to have this _id field and bson_out is used to be able to return the generated _id field so that you can do follow up queries with it.

However, after php_phongo_zval_to_bson was run, parts of the structure of boptions changed in such a way that an internal flag now showed it was an allocated bson_t* instead of the inline version it really was. This meant that when freeing it later in the code through bson_destroy(&boptions) a memory free was attempted on non-allocated memory, which then of course caused a crash in form of the Segmentation Fault. The weird thing is that the change happened in a variable that should not have been touched by the call to php_phongo_zval_to_bson.

With the reason of the crash found, it was now time to find the cause of the crash. Which proved to be a little trickier.

Normally, with any sort of memory issues the first thing I would do is to wield Valgrind as it usually tells were things go wrong. I wrote about Valgrind before. But in this case, it didn't show anything besides that the process tried to free not-allocated memory:

==9002== Invalid read of size 8
==9002==    at 0xC7CBDA6: bson_destroy (bson.c:2285)
==9002==    by 0xC862155: zim_BulkWrite_insert (BulkWrite.c:246)
==9002==    by 0x9D64E2: execute_internal (zend_execute.c:2077)
==9002==    by 0xB8690CC: xdebug_execute_internal (xdebug.c:2020)
==9002==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

And we already knew that. After looking, and stepping through the code with GDB for way too long, I had a revelation while relaxing on the weekend that the bdocument and boptions were not actually allocated variables, but instead just variables on the stack. And Valgrind's memcheck tool does not have ways to actually detect this.

Having these variables allocated on the stack is a performance improvement, as there is no overhead of memory allocation, but while trying to find a bug, we don't care much about it being fast. So I converted these few lines of code to use an allocated version, to see if I could get some more useful information out of Valgrind:

bson_t  bdocument = BSON_INITIALIZER, boptions = BSON_INITIALIZER;

php_phongo_zval_to_bson(zdocument, bson_flags, &bdocument, &bson_out TSRMLS_CC);


bson_t *bdocument, *boptions;
bdocument = bson_new();
boptions = bson_new();

php_phongo_zval_to_bson(zdocument, bson_flags, bdocument, &bson_out TSRMLS_CC);

With that changed, the Valgrind run now showed:

Invalid write of size 1
   at 0xC7C72FA: _bson_append_va (bson.c:348)
   by 0xC7C74DB: _bson_append (bson.c:404)
   by 0xC7C9DA1: bson_append_regex (bson.c:1575)
   by 0xC85122D: php_phongo_bson_append_object (bson-encode.c:234)
   by 0xC851C57: php_phongo_bson_append (bson-encode.c:382)
   by 0xC85230E: php_phongo_zval_to_bson (bson-encode.c:540)
   by 0xC862047: zim_BulkWrite_insert (BulkWrite.c:229)
 Address 0xd8c9540 is 0 bytes after a block of size 128 alloc'd
   at 0x4C2CBEF: malloc (vg_replace_malloc.c:299)
   by 0x940C00: __zend_malloc (zend_alloc.c:2829)
   by 0xC84DBCF: php_phongo_malloc (php_phongo.c:2485)
   by 0xC7DD388: bson_malloc (bson-memory.c:68)
   by 0xC7CB170: bson_new (bson.c:2015)
   by 0xC861F94: zim_BulkWrite_insert (BulkWrite.c:216)

It says invalid write of size 1 and 0 bytes after a block of size 128 alloc'd. This quickly illustrated that it was trying to do a write outside of its allocated memory area. Adding elements to the BSON document, in this case a regular expression through bson_append_regex, should normally grow the allocated memory when it runs out of already allocated space. Knowing this, combined with the phrasing One can change the characters but not the length of strings and obtain the same exception. from the original bug report, hinted towards an improper calculation of the amount of new data being written to the allocated memory buffer.

I dove into the bson_append_regex code and found the spot where it builds up all the elements to add. I have annotated it a little:

r =  _bson_append (bson,
        5,          // Number of data elements to add

        // number of bytes to add
        (1 + key_length + 1 + regex_len + options_sorted->len),

        1,          // length of first element (BSON type, int8)
        &type,      // the BSON type
        key_length, // the length of the field name
        key,        // the field name (not 0-termined)
        1,          // the length of the ending 0 byte
        &gZero,     // the null 0 byte
        regex_len,  // the length of the regular expression (including 0 byte)
        regex,      // the regular expression with 0 byte
        options_sorted->len + 1, // the length of the sorted options, with 0 byte
        options_sorted->str);    // the sorted options, with 0 byte

When comparing the "bytes" to add to the sum of the 5 "length" fields, I noticed that although for options_sorted->len an extra byte was added, that was not done in the calculated size on the third line.

This code got recently changed as part of CDRIVER-2118. Unfortunately a bug sneaked in miscalculating the length of the now sorted options. After adding the extra byte back in, the bug disappeared:

-     (1 + key_length + 1 + regex_len + options_sorted->len),
+     (1 + key_length + 1 + regex_len + options_sorted->len + 1),

I filed this bug as CDRIVER-2455, and made a Pull Request to fix the bug.

Additional note: I now have found Valgrind's SGCheck tool which should assist in finding stack related memory errors. Unfortunately, this tool currently seems to be inoperative on my platform.

01/08/2018 03:21 am   derickrethans.nl   Mirror   Link   @48

PHP 7.2's "switch" optimisations

PHP 7.2 is around the corner soon, and comes with many optimisations. Many new optimisations are implemented in opcache, but some others are implemented in PHP itself. One optimisation that falls in the latter category is an optimisation of the switch/case construct.

Before PHP 7.2, PHP considers each case statement in order. Let's say we have the following (trivial code):

The PHP interpreter approaches the executing of this switch/case structure as if it was written like:

Yup, it really compares $cc variable to each case statement one by one. Which means that if your most commonly used switch case is further down the list, you end up wasting valuable CPU code.

With vld we can represent this in a graph. For each case statement, the graph shows that the interpreter can either go left or right, just as it was executing multiple if statements in sequence:

PHP 7.2 introduces an optimisation that converts this sequence of if statements into a jump table if all the case statements are either integers or strings. If we look at the pseudo code (PHP does not support variable labels with goto), we get something like:

 1, "en" => 2, "nl" => 3, "no" => 4 ];
if (gettype($cc) == 'string') {
        if (array_key_exists($cc, $_table)) {
                goto "jmp_{$_table[$cc]}";
        } else {
                goto jmp_5;
} else {
        /* do original if/else if/else sequence */
jmp_1: echo "DEUTSCH"; goto end;
jmp_2: echo "English"; goto end;
jmp_3: echo "Nederlands"; goto end;
jmp_4: echo "Norsk"; goto end;
jmp_default: echo "unknown";

This example shows that as long as $cc is a string, it will use the jump table $_table to jump (with goto) to the immediate code representing the case statement. If $cc is not a string, it will have to do the usual if/else if/else dance. In this case, it will eventually end up on the default case.

Looking at the graph generated by vld for this optimised construct, we see this new structure as well:

Especially with big switch/case structures this is a welcome optimisation!


Xdebug implements code coverage gathering functionality. This can be used with PHP_CodeCoverage to see which parts of your code, which branches, and which paths are covered while executing your tests. In order for Xdebug to figure out the branches and paths, it has to follow PHP's internal opcode structures to see where jump instructions happen. I had to add support for this new switch/case optimisation to make this work again. This has now been done, and is part of Xdebug's master branch. This branch will soon become the first beta release of Xdebug 2.6 which will provide support for PHP 7.2.

11/01/2017 05:33 am   derickrethans.nl   Mirror   Link   @344

New Date/Time Support in MongoDB

In the past few months I have been working on adding time zone support to MongoDB's Aggregation Framework. This support brings in the timelib library that is also used in PHP and HHVM to do time zone calculations.

Time Zone Support for Date Extraction Operators

MongoDB's Aggregation Framework already has several operators to extract date/time information from a Date value. For example, $month would retrieve the month from a Date value. Considering we have the document:

{ _id: 'derickr', dateOfBirth: ISODate("1978-12-22T08:15:00Z") }

The following PHP code would extract the month 12 from the dateOfBirth field:


$cursor = $collection->aggregate([
    [ '$project' => [
        'birthMonth' => [ '$month' => '$dateOfBirth' ]
    ] ]

foreach ($cursor as $person) {
    echo $person->birthMonth;

However, it was not possible to extract this information in the local time zone. So although my birth-time is given at 08:15:00, in reality it was 09:15 in the Netherlands.

The new functionality adds the timezone argument to the date extraction functions. Currently the operators only take a single value:

{ "$month" : ISODateValue }

Because it is not possible to add another argument to this, the syntax has been extended to:

{ "$month" : { "date" : ISODateValue } }

And then with the time zone argument added, it becomes:

{ "$month" : {
    "date" : ISODateValue,
    "timezone": timeZoneIdentifier
} }

For example:

{ $hour: {
    date: ISODate("1978-12-22T08:15:00Z"),
    timezone: "Europe/Amsterdam"
} }

Would return 09 (One hour later than UTC).

The timezone argument is optional, and must evaluate to either an Olson Time Zone Identifier such as Europe/London or America/New_York, or, an UTC offset string in the forms: +03, -0530, and +04:45. If you specify a timezone argument it means that the dateString that you provided will be interpreted as it was in that time zone.

In PHP, this example extracts the hour, and expresses the value in the Europe/Amsterdam time zone:


$cursor = $collection->aggregate([
    [ '$project' => [
        'birthHour' => [
            '$hour' => [
                "date" => '$dateOfBirth',
                "timezone" => "Europe/Amsterdam",
    ] ]

foreach ($cursor as $person) {
    echo $person->birthHour, "\n";

The $dateToParts Operator

Extracting a single value from a date can easily be done with one of the Date Extraction operators, but it becomes cumbersome to do this for all the fields one by one, especially when considering the application of time zones as well:

{ "$project" : {
    "year": { "$year" : { "date" : '$dob', "timezone" : "Europe/Amsterdam" } },
    "month": { "$month" : { "date" : '$dob', "timezone" : "Europe/Amsterdam" } },
    "day": { "$day" : { "date" : '$dob', "timezone" : "Europe/Amsterdam" } },
    "hour": { "$hour" : { "date" : '$dob', "timezone" : "Europe/Amsterdam" } },
    "minute": { "$minute" : { "date" : '$dob', "timezone" : "Europe/Amsterdam" } },
} }

The new $dateToParts operator simplifies having multiple single date value extraction operators into a single one. Its syntax is:

{ "$project" : {
    "parts" : {
        "$dateToParts" : {
            "date" : ISODateValue,
            "timezone" : timeZoneIdentifier,
            "iso8601" : boolean
} }

The timezone argument is optional, and is interpreted in the same as the timezone argument in the Date Extraction functions as explained above.

The result of the operator is a sub-document with the broken down parts, expressed in the (optionally) given time zone:

"parts" : {
    "year" : 1978, "month" : 12, "day" : 22,
    "hour" : 9, "minute" : 15, "second" : 0, "millisecond" : 0

$dateToParts also supports a third boolean argument, iso8601. If set to true, instead of year, month, and day, it returns the ISO 8601 isoYear, isoWeekYear, and isoDayOfWeek fields representing an ISO Week Date. With the same date, the example is represented as:

"parts" : {
    "isoYear" : 1978, "isoWeekYear" : 51, "isoDayOfWeek" : 5,
    "hour" : 9, "minute" : 15, "second" : 0, "millisecond" : 0



$cursor = $collection->aggregate([
    [ '$project' => [
        'parts' => [
            '$dateToParts' => [
                "date" => '$dateOfBirth',
                "timezone" => "Europe/Amsterdam",
    ] ]

foreach ($cursor as $person) {
    var_dump( $person->parts );

Which outputs, with formatting:

class MongoDB\Model\BSONDocument#5 (1) {
  private $storage =>
  array(7) {
    'year' => int(1978)
    'month' => int(12)
    'day' => int(22)
    'hour' => int(9)
    'minute' => int(15)
    'second' => int(0)
    'millisecond' => int(0)

The $dateFromParts Operator

The new $dateFromParts operator does the opposite of the $dateToParts operator and constructs a new Date value from its constituent parts, with the possibility of interpreting the given values in a different time zone.

Its syntax is either:

{ "$project" : {
    "date" : {
        "$dateFromParts": {
            "year" : yearExpression,
            "month" : monthExpression,
            "day" : dayExpression,
            "hour" : hourExpression,
            "minute" : minuteExpression,
            "second" : secondExpression,
            "millisecond" : millisecondExpression,
            "timezone" : timezoneExpression
} }


{ "$project" : {
    "date" : {
        "$dateFromParts": {
            "isoYear" : isoYearExpression,
            "isoWeekYear" : isoWeekYearExpression,
            "isoDayOfWeek" : isoDayOfWeekExpression,
            "hour" : hourExpression,
            "minute" : minuteExpression,
            "second" : secondExpression,
            "millisecond" : millisecondExpression,
            "timezone" : timezoneExpression
} }

Each argument's expression needs to evaluate to a number. This means the source can be either double, NumberInt, NumberLong, or Decimal. Decimal and double values are only supported if they convert to a NumberLong without any data loss.

Every argument is optional, except for year or isoYear, depending on which variant is used. If month, day, isoWeekYear, or isoDayOfWeek are not given, they default to 1. The hour, minute, second and millisecond values default to 0 if not present.

The timezone argument is interpreted in the same as the timezone argument in the Date Extraction functions as explained above.

In PHP, an example looks like:


$cursor = $collection->aggregate([
    [ '$project' => [
        'date' => [
            '$dateFromParts' => [
                "year" => 1978, "month" => 12, "day" => 22,
                "hour" => 9, "minute" => 15, "second" => 0,
                "millisecond" => 0,
                "timezone" => "Europe/Amsterdam",
    ] ]

foreach ($cursor as $person) {
    var_dump( $person->date->toDateTime() );

Which outputs:

class DateTime#12 (3) {
  public $date => string(26) "1978-12-22 08:15:00.000000"
  public $timezone_type => int(1)
  public $timezone => string(6) "+00:00"

Changes to the $dateToString Operator

The $dateToString operator is extended with the timezone argument. Its full new syntax is now:

{ $dateToString: {
    format: formatString,
    date: dateExpression,
    timezone: timeZoneIdentifier
} }

The timezone argument is optional. If present, it formats the string according to the given time zone, otherwise it uses UTC.

The $dateToString format arguments have also been expanded. With the addition of the timezone argument came the %z and %Z format specifiers:


The +hhmm or -hhmm numeric time zone as a string (that is, the hour and minute offset from UTC). Example: +0445, -0500


The minutes offset from UTC as a number. Example (following the +0445 and -0500 from %z): +285, -300

Once SERVER-29627 gets merged, the following new format specifiers will also be available:


The abbreviated English name of the day of the week.


The abbreviated English name of the month.


The day of the month as a decimal number, but unlike %d, pre-padded with space instead of a 0.

An example of this in PHP:


$cursor = $collection->aggregate([
    [ '$project' => [
        'date' => [
            '$dateToString' => [
                'date' => '$dateOfBirth',
                'format' => '%Y-%m-%d %H:%M:%S %z',
                'timezone' => 'Australia/Sydney',
    ] ]

foreach ($cursor as $person) {
    echo $person->date;

Which outputs:

1978-12-22 19:15:00 +1100

The $dateFromString Operator

Analogous to PHP's DateTimeImmutable constructor, this operator can be used to create a Date value out of a string. It has the following syntax:

{ "$dateFromString": {
    "dateString": dateString,
    "timezone": timeZoneIdentifier
} }

The dateString could be anything like:

  • 2017-08-04T17:02:51Z

  • August 4, 2017 17:10:27.812+0100

In fact, it will accept everything that PHP's DateTimeImmutable constructor accepts as under the hood, it uses the same library. MongoDB enforces though that it is an absolute date/time string.

The timezone argument is optional, and is interpreted in the same as the timezone argument in the Date Extraction functions as explained above.

For example:

{ $dateFromString: {
    dateString: "2017-08-04T17:06:41.113",
    timezone: "Europe/London"
} }

Would mean 17:06 local time in London, or 16:06 in UTC (as London right now is at UTC+1).

It is not allowed to specify a time zone through the dateString (such as the ending Z or +0400) and also specify a time zone through the timezone argument. In that case, an exception is thrown.

In PHP, this looks like:


$cursor = $collection->aggregate([
    [ '$project' => [
        'date' => [
            '$dateFromString' => [
                "dateString" => 'August 8th, 2017. 14:14:40',
                "timezone" => "Europe/Amsterdam",
    ] ]

foreach ($cursor as $person) {
    var_dump( $person->date->toDateTime() );

Which outputs:

class DateTime#12 (3) {
  public $date =>
  string(26) "2017-08-08 12:14:40.000000"
  public $timezone_type =>
  public $timezone =>
  string(6) "+00:00"

As you can see, the time zone information is lost when the data is transferred between MongoDB and PHP as the BSON DateTime data type does not carry this information.


The time zone support is currently only available in a development release of MongoDB, and should be considered experimental. It is likely that some of it will still change. In particular:

08/15/2017 04:22 am   derickrethans.nl   Mirror   Link   @479

Detecting Problems With -fsanitize

In the past few months I have been working on adding time zone support to MongoDB's Aggregation Framework. This support brings in the timelib library that is also used in PHP and HHVM to do time zone calculations. One of the stages in our workflow before we commit code to master, is to put our patches up onto our continuous integration platform Evergreen, where tests are run against multiple platforms. You expect the usual Ubuntu, RHEL and Windows platforms, but we also run on more esoteric platforms like s390. We also define a few special platforms that run tests in an environment where the code has been compiled in a special mode to test for undefined behaviour in the C language, and memory leaks.

One of the issues that this found (quite quickly) in timelib, was:

src/third_party/timelib-2017.05beta3/parse_tz.c:152:16: runtime error:
  left shift of 128 by 24 places cannot be represented in type 'int'

Which referred to the following code:

buffer[i] = timelib_conv_int(buffer[i]);

timelib_conv_int is a macro defined as:

#define timelib_conv_int(l) ((l & 0x000000ff) << 24) + \
        ((l & 0x0000ff00) << 8) + ((l & 0x00ff0000) >> 8) + \
        ((l & 0xff000000) >> 24)

The sanitiser stumbled over some of the data in the Olson database where we attempted to shift the unsigned integer 128 left by 24 positions into an signed integer, which of course can not represent this value. Thanks to the sanitizer, beta4 has this problem now fixed.

As part of the fix, I investigated how our tests managed to figure out that there was undefined behaviour. It appeared that GCC and clang have a specific flag to enable this debugging tool. It is as simple as adding -fsanitize=undefined to your build flags, which is what timelib in its standalone Makefile now includes.

One of the things that is tricky is that when writing C++, I tend to use many C-isms as that is what I have been working in for so long. And the opposite is true to. C++-isms are sometimes used when dealing with the timelib library which is written in C. One of these issues created a memory leak (fixed through this patch), as a structure was not allocated on the heap, but on the stack. This structure (timelib_time*) sometimes contains an extra string (for the time zone abbreviate) that needs to be freed if set.

This memory leak was also discovered by a sanitizer flag, but this time it was -fsanitize=address. This flag adds code to the compiled binary to test for memory leaks, overflows, etc.—not to dissimilar as to what Valgrind (I wrote about that before) provides. After adding this flag to the default build rules for timelib, it quickly found a few other memory leaks in some test and example files which I then addressed.

So there we have it, two new GCC and Clang flags that I did not know about. I now always compile timelib, as well as my local MongoDB test builds with these two flags. It certainly slows down execution, but that's a cheap price to pay to prevent bugs from making it into production.

07/21/2017 03:44 am   derickrethans.nl   Mirror   Link   @843

HHVM and MongoDB

At the start of 2015 we began work on an HHVM driver for MongoDB, as part of our project to renew our PHP driver. Back then, HHVM was in its ascendancy and outperforming PHP 5.6 two to one. With such a huge performance difference it was reasonable to assume that many users would be switching over.

Around the start of 2016 I wrote a series of blog posts when we released the brand new drivers for PHP and HHVM. However, by then, PHP 7.0 had been released. PHP 7's performance is on par with HHVM making it less useful for users to move from PHP to HHVM. HHVM still offers a more strongly typed PHP syntax through Hack, and some other features, but its main attraction, speed, was mostly gone.

Writing an extension for HHVM is very different than doing so for PHP. PHP extensions, of which there are plenty, are written in C. HHVM extensions are written in C++, with very few third party extensions existing. At the same time, although PHP's APIs are not particularly well documented, there is a large group of people to ask for help. PHP also has a clearly defined internal API which is stable across minor versions. HHVM does not have this, and APIs kept changing often, although it wasn't always clear whether we were using an internal API.

Writing an HHVM extension is a very different experience for extension developers compared to PHP extensions. There is even less documentation, and virtually no third party extensions to look at for "inspiration". At the same time, it was much harder to get help from the developers, and much harder to debug as HHVM is many times more complex than PHP.

With PHP 7 released, we saw very little use of the HHVM driver for MongoDB. Some months ago I did a twitter poll, where very few people were indicating that they were using HHVM—and even if they were, they would likely not choose to switch to HHVM given the current climate.

Some of the feedback on the poll was not very assuring either:

With few users, frequent API breaks, and curious bugs we came to the conclusion that supporting the HHVM driver for MongoDB no longer makes good use of our engineering time. With Doctrine and Symfony 4 also no longer supporting HHVM, we have decided to discontinue the MongoDB driver for HHVM.

If anyone is interested in assuming ownership of the HHVM driver, please drop me a line and we can discuss the process in more detail.

05/25/2017 10:19 am   derickrethans.nl   Mirror   Link   @1281

15 years of Xdebug

This article was going to be about some upcoming features in the 2.6 release. Or rather, I was hoping to announce at least a beta release of Xdebug 2.6. Unfortunately, I couldn't find enough time to work on all the issues that I wanted, although I've made a little progress.

What I can write about, is a little mystery. About 3 weeks ago, I got a mysterious invitation to meet up with James Titcumb, right outside my (and James's!) favourite whisky shop in London. And that I must bring some carrying capacity:

At that time I had had put together that with Xdebug's 15th anniversary looming, and me having an Amazon wishlist full of whisky, James would be kind enough to buy me something less "standard".

I however had not quite expected what actually happened. During the day of meeting, I saw some tweets going on about a little (secret) fundraiser. Apparently I wasn't to know, but it is difficult to keep things a secret I suppose. In any case, because I had thought it'd have something to do with meeting James later that day, I didn't really read much of it, as it would likely spoil a surprise.

And what a surprise it was!

So I show up at 5, and there is James with his phone trying to figure out how Periscope works. We go in and the manager, has a story to tell about 8 quite amazing whiskies, which I then get to take home. With amazing, I mean, amazing and special and rare whiskies from closed down distilleries, and a few more approachable whiskies. I quickly realized that they are not, erm, cheap either:

  • Balvenie 21 Portwood

  • Dalmore King Alexander

  • Dallas Dhu 27

  • Springbank 25 - 2017 edition

  • Caperdonich 20 - (Demolished Distillery) 2016 release

  • Glenlochy 1979 - (Demolished Distillery) Rare Old Label 2016

  • St.Magdalane 26 - (Demolished Distillery) Rare Archive bottle from 2006

  • Banff 1974 18 year old 40% - (Demolished Distillery) Rare Archive bottle from 1992 Gordon & Macphails

I have tried the Balvenie Portwood 21 and Dalmore King Alexander before, but certainly not the others!

So yeah, after borrowing a suitcase I managed to get these great bottles home, and while doing so, James explains what hat happened. PHP's "godfather", Cal Evans, had originally intended to raise money to buy the most expensive bottle from his local whisk(e)y shop - at 4699 USD. James (luckily) managed to convince him that in the whisky world, price doesn't always equal quality. There is a bit of a limit at perhaps £125 for "normal" bottles, but of course quite a bit more for "rare" whiskies. They managed to raise slightly more, through the generous donation of many people and companies that find Xdebug useful. I saw the list, and there were many lovely messages in there as well, a few of them I am including here:

"XDebug is hugely important for our team. Thank you for all you have done!"


"Thanks Derick! Such a critical part of day-to-day PHP dev life =)"

"Xdebug has helped me solving numerous bugs, thanks!"

"Because every good developer knows that great code begins with great debugging tools!"

"xdebug helps me to not drink... ashnazg"

"Thank you for your efforts in the community!"

"Xdebug has made my development life immeasurably easier over the last 15 years. Here's to the next 15, and a massive thank you! :-)"

"For one of the most used and usefully tools in a professionals day to day live beside a good editor."

"It's people like you that make this such an amazing community! Thank you for giving yourself so freely for so long!"

"I can barely remember what coding was like before discovering XDebug. Thank you Derick for making our lives easier!"

Thanks for these lovely messages, and thanks for donating to my whisky fund Aaron Saray, Accent Interactive, Adam Culp, Adam Kammeyer, Adrián Cárdenas, Alain Schlesser, Alex Ross, Alexander Marinov, Andreas Heigl, Andrew Caya, Andrew Millington, Antonis Pavlakis, Barry Hughes, Bart Reunes, Ben Ramsey, Bernhard Breytenbach, Bill Condo, Boone Gorges, Boyan Djumakov, Boyan Yordanov, Chris Brookins, Chris Hartjes, Chris Sherry, Chris Spruck, Chuck Burgess, Code4Hire Kft, Cristiano Diniz da Silva, Damien Seguy, Daniel Abernathy, Dave Hall, David Alger, David Lundgren, Diana Espino, Doug Johnson, Dougal Campbell, Enrico Zimuel, Eric Hogue, Ferenc Kovács, Fran Novo, Frank de Jonge, Frederic Dewinne, Freek Van der Herten, Gilbert Pellegrom, Goran Mitrovic, Gordon Forsythe, Guillaume Rossolini, Iain Poulson, Ian H, Ian Littman, J.T. Grimes, Jake Smith, Jakub Gadkowski, James LaChance, James Titcumb, Jeff Carouth, Jeff Kolesnikowicz, Jeff Rupert, Jeremy Emery, Jeremy Lindblom, Jeroen Boersma, Jeroen de Jong, JetBrains sro, Joey Fowler, Josh Butts, Josh Holmes, Joshua Thijssen, Juliette Reinders Folmer, Kara Ferguson, Kathryn Reeve, Ken Sherman, Kevin Schroeder, Lance Cleveland, Laura Folco, Liam Wiltshire, Lucas van Lierop, Luke Stokes, Mark Baker, Matt Trask, Matthew Weier O'Phinney, Max Griffin, Merlijn Tishauser, Michael Babker, Michael Butler, Michael Dyrynda, Michael Moussa, Michael Pearson, Michael Stowe, Michael Williams, Mihail Irintchev, Milan Popovic, Modern Tribe, Nate Ritter, Navarr Barnier, Nikolay Ignatov, Nils Preuss, Noah Heck, Omni Adams, Paul McGrane, Paul Sohier, Paul Yasi, Peter Breuls, Pádraic Brady, Rafael Dohms, Rich Sage, Richard Bairwell, Richard Hagen, Rob Allen, Robert Basic, Robert Landers, Rodrigo Capilé, Russell Barnhart, Ryan Weaver and Leanna Pelham, Samantha Quiñones, Sammy Powers, Sandy Smith, Scott Arciszewski, Sebastian Feldmann, Shaun Hare, SitePoint PHP Channel, Stefan Koopmanschap, Stephan Hochdörfer, Steve Grunwell, Steven Wade, Svetlozar Stoyanov, Team Enrise, Tim Stamp, Tom Cruickshank, Tom De Wit, Toni Vega, Ulf Wandschneider, Willem-Jan Zijderveld, Wim Godden, Youri Thielen, Zeke Farwell, and the anonymous donors!*

What's next?

I'll be publishing the tasting notes for the whiskies on the https://dram.io — one of the few places where I actually use Xdebug myself. I might not open all of them (yet) though.

And on the Xdebug front, there are plenty of bugs to fix, features to add for Xdebug 2.6, and undoubtedly Dmitry will be "breaking" some things in PHP 7.2 that I need to support in Xdebug as well.


05/09/2017 08:00 am   derickrethans.nl   Mirror   Link   @1501

Good Bye PHP 5

A few days ago I merged a patch into Xdebug that removes support for PHP 5 in Xdebug's master branch on GitHub. Maintaining PHP 5 and PHP 7 support in one code base is not particularly easy, and even more complicated for something like Xdebug, with its deep interactions with PHP's internals.

As PHP 5.6's active support has ended on December 31st, I also felt it no longer needed to support PHP 5 with Xdebug any more. It saves more than 5000 lines of code:

Many people people were quite positive about that:

Others were less keen:

Removing PHP 5 support from Xdebug's master branch does not mean that Xdebug suddenly stops working for PHP 5 installations. Xdebug 2.5, which was recently released supports PHP 5.5 and 5.6, and is not going to go away.

Right now, Xdebug will no longer receive new features in the branch that also supports PHP 5. New features will only go into master (to become Xdebug 2.6). However, Xdebug 2.5 continues to receive bug fixes until Xdebug 2.6 comes out.

Once Xdebug 2.6 comes out, the Xdebug 2.5 branch will no longer get bug fixes, and hence support for PHP 5 goes away. That still does not mean that you can no longer use Xdebug with PHP 5. The releases of the 2.5 branch will still be available.

On the positive side, not having to implement lots of code twice, also means that new features can be added faster, as less work is required. Xdebug 2.6 has already have some new features lined up.

01/11/2017 03:44 am   derickrethans.nl   Mirror   Link   @2878

Natural Language Sorting with MongoDB 3.4

Arranging English words in order is simple—most of the time. You simply arrange them in alphabetical order. Sorting a set of German words, or French words with all of their accents, or Chinese with their different characters is a lot harder than it looks. Sorting rules are specified through locales, which determine how accents are sorted, in which order the characters are in, and how to do case-insensitive sorting. There is a good set of those sorting rules available through CLDR, and there is a neat example to play with all kinds of sorting at ICU's demo site. If you want to know how the algorithms work, have a look at the Unicode Consortium's report on the Unicode Collation Algorithm.

Years ago I wrote about collation and MongoDB. There is an old issue in MongoDB's JIRA tracker, SERVER-1920, to implement collation so that sorting and indexing could work depending on the different sorting orders as described for each language (locale).

Support for these collations have finally landed in MongoDB 3.4 and in this article we are going to have a look at how they work.

How Unicode Collation Works

Many computer languages have their own implementation of the Unicode Collation Algorithm, often implemented through ICU. PHP has an ICU based implementation as part of the intl extension, in the form of the Collator class.

The Collator class encapsulates the Unicode Collation Algorithm to allow you to sort an array of text yourself. It also allows you to visualise the "sort key" to see how the algorithm works:

Take for example the following array of words:

$dictionary = [
    'boffey', 'bøhm', 'brown',

Which we can turn into sort keys, and sort using the en locale (English):

$collator = new Collator( 'en' );
foreach ( $dictionary as $word )
    $sortKey = $collator->getSortKey( $word );
    $dictionaryWithKey[ bin2hex( $sortKey ) ] = $word;

ksort( $dictionaryWithKey );
print_r( $dictionaryWithKey );

Which outputs:

    [2b4533333159010a010a] => boffey
    [2b453741014496060109] => bøhm
    [2b4b45554301090109] => brown

If we would do this according to the nb (Norwegian) locale, the output would have brown and bøhm reversed:

    [2b4533333159010a010a] => boffey
    [2b4b45554301090109] => brown
    [2b5c6703374101080108] => bøhm

The sort key for bøhm has now changed, so that its numerical value now makes it sort after brown instead of before brown. In Norwegian, the ö is a distinct letter that sorts after z.

MongoDB 3.4

Before the release of MongoDB 3.4, it was not possible to do a locale based search. As case-insensitivity is just another property of a locale, that was not supported either. Many users worked around this by storing a lower case version of the value in separate field just to do a case-insensitive search. But this has now changed with the implementation of SERVER-1920.

In MongoDB 3.4 you may attach a default locale to a collection:

db.createCollection( 'dictionary', { collation: { locale: 'nb' } } );

A default locale is used for any query without a different locale being specified with the query. Compare the default (nb) locale:

> db.dictionary.find().sort( { word: 1 } );
{ "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" }
{ "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" }
{ "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bøhm" }

With the English (en) locale:

> db.dictionary.find().collation( { locale: 'en'} ).sort( { word: 1 } );
{ "_id" : ObjectId("5846d65210d52027a50725f0"), "word" : "boffey" }
{ "_id" : ObjectId("5846d65210d52027a50725f2"), "word" : "bøhm" }
{ "_id" : ObjectId("5846d65210d52027a50725f1"), "word" : "brown" }

The default locale of a collection is also inherited by an index when you create one:

db.dictionary.createIndex( { word: 1 } );

        "v" : 2,
        "key" : { "word" : 1 },
        "name" : "word_1",
        "ns" : "demo.dictionary",
        "collation" : {
            "locale" : "nb",
            "caseLevel" : false,
            "caseFirst" : "off",
            "strength" : 3,
            "numericOrdering" : false,
            "alternate" : "non-ignorable",
            "maxVariable" : "punct",
            "normalization" : false,
            "backwards" : false,
            "version" : "57.1"

From PHP

All the examples below are using the PHP driver for MongoDB (1.2.0) and the accompanying library (1.1.0). These are the minimum versions to work with locales.

To use the MongoDB PHP Library, you need to use Composer to install it, and include the Composer-generated autoloader to make the library available to the script. In short, that is:

php composer require mongodb/mongodb=^1.1.0

And at the start of your script:


In this first example, we are going to drop the collection dictionary from the demo database, and create a collection with the default collation en. We also create an index on the word field and insert a couple of words.

First the set-up and assigning of the database handle ($demo):

$client = new \MongoDB\Client();
$demo = $client->demo;

Then we drop the dictionary collection:

$demo->dropCollection( 'dictionary' );

We create a new collection dictionary and set the default collation for this collection to the en locale:

        'collation' => [ 'locale' => 'en' ],
$dictionary = $demo->dictionary;

We create the index, and we also give the index the name dictionary_en. MongoDB supports multiple indexes with the same field pattern, as long as they have a different name and have different collations (e.g. locale, or locale options):

    [ 'word' => 1 ],
    [ 'name' => 'dictionary_en' ]

And then we insert some words:

$dictionary->insertMany( [
    [ 'word' => 'beer' ],
    [ 'word' => 'Beer' ],
    [ 'word' => 'côte' ],
    [ 'word' => 'coté' ],
    [ 'word' => 'høme' ],
    [ 'word' => 'id_12' ],
    [ 'word' => 'id_4' ],
    [ 'word' => 'Home' ],
] );

When doing a query, you can specify the locale for that operation. Only one locale can be used for a single operation, which means that MongoDB uses the same locale for the find and the sort parts of a query. We do intent to support more granular support for using collations on different parts of an operation. This is tracked in SERVER-25954.

Only the Base Character

There are many variants of locales. The strength option defines the number of levels that are used to perform a comparison of characters. At strength=1, only base characters are compared. This means that with the en locale: beer == Beer, coté == côte, and Home == høme.

You can specify the strength while doing each query. First we use the en locale and strength 1. This is equivalent to a case insensitive match:

    "Match on base character only",
        [ 'word' => 'beer' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 1 ] ]

Which outputs:

Match on base character only:
beer Beer

Strength 1 also ignores accents on characters, such as in:

    "Match on base character only, ignoring accents",
        [ 'word' => 'home' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 1 ] ]

Which outputs:

Match on base character only, ignoring accents:
høme Home

As strength, or any of the other options we will see later, changes the sort key for a string, it is important that you realise that because of this, an index in MongoDB will only be used if it is created with the exact same locale options as the query.

Because we only have an index on word with the default en locale, all other examples do not make use of an index while matching or sorting. If you want to make an indexed lookup for the en/strength=1 example, you need to create an index with:

    [ 'word' => 1 ],
        'name' => 'word_en_strength1',
        'collation' => [
            'locale' => 'en',
            'strength' => 1

Sorting Accents

Strength 2 takes into account accents on letters while matching and sorting. If we run the match on home in the English locale with strength 2, we get:

    "Match on base character with accents",
        [ 'word' => 'home' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 2 ] ]

Which outputs:

Match on base character with accents:

The word høme is no longer included. However, the case of characters is still not considered:

    "Match on base character with accents (and not case sensitive)",
        [ 'word' => 'beer' ],
        [ 'collation' => [ 'locale' => 'en', 'strength' => 2 ] ]

Which outputs:

Match on base character with accents (and not case sensitive):
beer Beer

Again, more fun can be had while sorting with accents, because languages do things differently. If we take the words cøte and coté, we see a difference in sorting between the fr (French) and fr_CA (Canadian French) locales:

    "Sorting accents in French (France)",
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
            'collation' => [ 'locale' => 'fr', 'strength' => 2 ],
            'sort' => [ 'word' => 1 ],

    "Sorting accents in Canadian French",
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
            'collation' => [ 'locale' => 'fr_CA', 'strength' => 2 ],
            'sort' => [ 'word' => 1 ],

Which outputs:

Sorting accents in French (France):
coté côte

Sorting accents in Canadian French:
côte coté

In Canadian French, the accents sort from back to front. This is called Backward Secondary Sorting sorting, and is an option you can set on any locale-based query. Some language locales have different default values for options. To make the French Canadian sort the "wrong" way, we can specify the additional backwards option:

    "Sorting accents in Canadian French, the 'wrong' way",
        [ 'word' => new \MongoDB\BSON\Regex( '^c' ) ],
            'collation' => [ 'locale' => 'fr_CA', 'strength' => 2, 'backwards' => false ],
            'sort' => [ 'word' => 1 ],

Which outputs:

Sorting accents in Canadian French, the 'wrong' way:
coté côte

12/16/2016 05:51 am   derickrethans.nl   Mirror   Link   @2839

Not Finding the Symbols

Yesterday we released the new version of the MongoDB Driver for PHP, to coincide with the release of MongoDB 3.4. Not long after that, we received an issue through GitHub titled "Undefined Symbol php_json_serializable_ce in Unknown on Line 0".

TL;DR: Load the JSON extension before the MongoDB extension.

The newly released version of the driver has support for PHP's json_encode() through the JsonSerializable interface, to convert some of our internal BSON types (think MongoDB\BSON\Binary and MongoDB\BSON\UTCDateTime) directly to JSON. For this it uses functionality in PHP's JSON extension, and with that the php_json_serializable_ce symbol that this extension defines.

We run our test suite on many different distributions, but (nearly) always with our own compiled PHP binaries as we need to support so many versions of PHP (5.4-5.6, 7.0, and now 7.1), in various configurations (ZTS, or not; 32-bit or 64-bit). It came hence quite as a surprise that a self-compiled extension would not load for one of our users.

When compiling PHP from its source, by default the JSON extension becomes part of the binary. This means that the JSON extension, and the symbols it implements are always available. Linux distributions often split out each extension into their own package or shared object. Debian has php5-json (on which php5-cli depends), while Fedora has php-json. In order to make use of the JSON extension, you therefore need to install a separate package that provides the shared object (json.so) and a configuration file. Fedora installs the 20-json.ini file in /etc/php.d/. Debian installs the 20-json.ini file in /etc/php5/mods-available with a symlink to /etc/php5/cli/conf.d/20-json.ini. In both cases, they include the required extension=json.so line that instruct PHP to load the shared object and make its symbols (and PHP functions) available.

A normal PHP binary uses the dlopen system call to load a shared object, with the RTLD_LAZY flag. This flag means that symbols (such as php_json_serializable_ce) are only resolved lazily, when they are first used. This is important, because PHP extensions and the shared objects they live in, can depend on each other. The MongoDB extension depends on date, spl and json. After PHP has loaded all the shared extensions, it registers the classes and functions contained in them, in an order to satisfy this dependency graph. PHP makes sure that the classes and functions in the JSON extension are registered before the MongoDB extension, so that when the latter uses the php_json_serializable_ce symbol to declare that the MongoDB\\BSON\\UTCDateTime class implements the JsonSerializable interface the symbol is already available.

Distributions often want to harden their provided packages with additional security features. For that, they compile binaries with additional features and flags.

Debian patches PHP to replace the RTLD_LAZY flag with RTLD_NOW. Instead of resolving symbols when they are first used, this signals to the dlopen system call to resolve the symbols when the shared object is loaded. This means, that if the MongoDB extension is loaded before the JSON extension, the symbols are not available yet, and the linker throws the "Undefined Symbol php_json_serializable_ce in Unknown on Line 0" error from our bug report. This is not a problem that only related to PHP; TCL has similar issues for example.

With Fedora, the same issue is present, but shows through slightly different means. Instead of patching PHP to replace RTLD_LAZY with RTLD_NOW, it uses linker flags ("-Wl,-z,relro,-z,now") to force binaries to resolve symbols as soon as they are loaded process wide. This Built with BIND_NOW security feature goes hand in hand with Built with RELRO. The explanation on why these features are enabled on Fedora is well described on their wiki. Previously, this did expose an issue with an internal PHP API regarding creating a DateTime object.

But where does this leave us? The solution is fairly simple: You need to make sure that the JSON extension's shared object is loaded before the MongoDB extension's shared object. PECL's pecl install suggests to add the extension=mongodb.so line to the end of php.ini. Instead, on Debian, it would be much better to put the extension=mongodb.so line in a separate 99-mongodb.ini file under /etc/php5/mods-available, with a symlink to /etc/php5/cli/conf.d/99-mongodb.ini and /etc/php5/apache2/conf.d/99-mongodb.ini:

cat << EOF > /etc/php5/mods-available/mongodb.ini
; priority=99
php5enmod mongodb

On Fedora, you should add the extension=mongodb.so line to the new file /etc/php.d/50-mongodb.ini:

echo "extension=mongodb.so" > /etc/php.d/50-mongodb.ini

Alternatively, you can install the distribution's package for the MongoDB extension. Fedora currently has the updated 1.2.0 release for Rawhide (Fedora 26). Debian however, does not yet provide a package for the latest release yet, although an older version (1.1.7) is available in Debian unstable. At the time of this writing, Ubuntu only provides older versions for Xenial and Yakkety.

11/30/2016 08:55 am   derickrethans.nl   Mirror   Link   @2832