“DLL Hell”, and avoiding an NP-complete problem

Needing two incompatible version of the same library  can be an evil problem, and a recent paper pointed out that solving it was NP-complete.

“DLL hell” has been a problem in Windows and many of the free and commercial Unixes. Interestingly, it was recognized and fixed in Multics, but then reproduced in both Unix and Windows, and finally re-fixed in SunOS 5. More recently, it was fixed in part in Linux glibc.

This is the story of how.

Introduction

Last week, I had the pleasure of reading about Kristoffer Grönlund’s 2017 linux.conf.au talk, Package managers all the way down , about the recurring problems of package managers for both operating systems and language families.

dll_hell

DLL hell is back, and hitting multi-language projects like Hawk as well as language-specific tools like npm.

Russ Cox investigated the problem, and wrote well-known paper that opens with “Dependency hell is NP-complete. But maybe we can climb out.”

The problem is that a program that use libraries can end up needing two different versions of a subordinate library. For example, I might write a main program that uses glibc n+1, but from it call a library that requires glibc n.

If your system can’t load both, you can’t link.

History

In principle, you can design a linker that can load both versions: the Rust language and the and the Nix package manager have addressed that approach, at the expense of managing a large and ambiguous symbol space.

A elegant idea, however, is to version-number the individual interfaces, and keep both old and new interfaces in the libraries. When you update a library, you get new interfaces and bug-fixes, but you also gets the backwards-compatibility of having the “old” interface in the same file.  Of course, it isn’t always a complete copy of the old versions: often it’s a “downdater” that calls the new one with different parameters, or an “updater” that wraps the old code in a function with additional capabilities.

The idea of having only one copy of a library on the system and keeping old versions of calls came originally from Multics: Paul Stachour wrote about their hardware and software versioning in You Don’t Know Jack About Software Maintenance. It that era, they did continuous delivery, something which required backwards compatibility: they just called it continuous maintenance.

On Multics, all maintained versions of an interface exist in the library, and it is the interfaces that have versions. If someone talks about version 7 of a library, they means a library that includes the interfaces that were added for version 7.

On Unix, that degree of complexity was undesirable: the early Unixes did not link dynamically and the number of libraries was small, so it was hard to have the problem.

Systems with lots of shared libraries and products depending on them, however, made us aware of it once more. Windows gave us the name “DLL hell”, and we promptly recreated it in  SunOS 4, as did the other Unix vendors.  For SunOS 5, my team under David J Brown then recreated the fix, and later discovered that the glibc team had done so too. David described both in a Usenix talk, Library Interface Versioning in Solaris and Linux.

We needed to do so because we had a “compatibility guarantee”. If your program ran and passed the “appcert” script, then we warranted it would run on all newer Solaris versions. If it didn’t, that was our fault, and ours to fix. Failing to fix DLL hell would have cost us significant money!

In Linux, Linus make a similar guarantee, to not break running programs. It too is hard, but manageable inside a mostly-static kernel of moderate size. It’s harder in the application world.

How it Works

Assume my main-program and a library both call memcpy. My program makes sure that the source and target don’t overlap, so it can use the default memcpy.  However, the library does do overlappped copies, and blithely assumed that memcpy did pure left-to-right copying. That’s no longer true, so the library is linked to  a specific memcpy, memcpy@GLIBC_2.2.5, which does make that guarantee. That interface is “downdated” to do strict left-to-right copying by being turned into a call to memmove instead of to the newest memcpy.

To ensure this, the compiler and linker need to be sensitive to the versions needed, and insert the equivalent of “.symver memcpy, memcpy@GLIBC_2.2.5” into the linkage process, using the rules described in David J. Brown’s paper to choose a version old enough to guarantee the needed behavior.

Understanding the implications

Paul Stachour’s paper describes how versioning is used in an across-the-net API, similar to modern web development, to achieve continuous delivery.

David J. Brown’s paper addresses managing many variations over time, and allowing some degree of new-runs-on-old, to avoid forcing developers to keep old compilers around when doing bug-fixes to older versions of a program.

To me, the value during development is huge.  We typically had a number of services in development or use, each with an API and several clients. We would only be working on one of the clients at a time, and wanted to avoid “flag days”, where the team had to stop and change the api of the server and all their clients before we could continue. Those were horrible.

Instead, our process became

  • put the new version of the API on the server, and add a downdater
  • when we next updated each  client, change the client to use it,
  • when all clients use the new version, remove the downdater.

The only oddity you’d notice was that some versions didn’t ever ship, but would be consumed internally.  You’d see versions like 4.1.3 followed by 4.1.9 if you used one of our team’s libraries.

That was not a solution to every linking problem: indeed, we still had the “normal” problem that each time we got  a new release of a library we call directly,  then we found any breaking changes the first time we compiled. But that’s the same read-the-release-notes problem we had before.  therefore we never need to worry about  “vendoring” or “dependency management”in our development.

Conclusions

 

Today, we’re once again faced with the incompatible versions problem, in both language- and os-based package managers.

This time, however, we have the opportunity to see two working implementations of a solution. From those, we have the opportunity take the best of their ideas and properly solve this recurring problem.

4 thoughts on ““DLL Hell”, and avoiding an NP-complete problem

  1. All to many people resist change, not realizing that it’s necessary. Or they try to defer it until something bad happens (;-))

    The problems of managing bugs deserves a post in its own right… on it’s way!

    Like

  2. The example of memcpy with overlapped arguments is interesting.
    If I remember correctly, it has been “undefined behaviour” (i.e. a bug) to overlap arguments since the function was introduced. It seems a bit unreasonable to require libraries to maintain identical behaviour of buggy clients. After all, those clients are not following the “contract”.
    And yet some buggy programs worked for years and then failed when the library changed its implementation to not be strictly left to right copying. Even one of mine. So it is a real problem.
    As a programmer, I really don’t want the libraries used by my programs to be frozen. I want bugs in libraries fixed and those fixes applied to my fielded programs. I even want enhancements added automatically if they are purely within the library and don’t retract any part of its documented interface to the program. One such enhancement might be improved efficiency.
    I get the impression that sysadmins are so wary of change these days that they do want frozen programs. So major vs minor vs micro library version bumps are the same: accept no substitutes. That means that any actual bug fix (think security fix), even if it only affects a library, can only be addressed by reissuing all the programs that use the library. Now that’s brittle.
    As a programmer, I’d like the library to check (within reason) if the program is violating the contract and give a clear unambiguous response on violation. (Heresy: I’ve also found that if that response isn’t a crash, bugs get ignored. So I like crashes.) C isn’t very good at responding to certain program errors (overlapping arguments, dangling pointers, array bound violations, etc.) and this has caused many, perhaps most, CVEs. Has that really been worth the runtime efficiency gain?
    Adding new checks to a library (or anything else in the tool chain) is a Good Thing but it might well cause old program’s bugs to become manifest in an inconvenient ways. If you cannot get that program fixed, you have the wrong relationship with your supplier.
    The fundamental problem is managing change. That’s hard. Especially with distributed responsibility. But “no change” is a long-term mistake. Somehow responsibility, control, and architecture need to be congruent. Each participant’s view must reflect this. Perhaps mine does not.

    Liked by 1 person

  3. In a related discussion on lwn (https://lwn.net/Articles/730779/) swilmet wrote about two concurrent approaches to the NP-complete problem:

    Interesting.

    In GNOME and GTK+, two different major versions of a library are completely parallel-installable, see:
    https://developer.gnome.org/programming-guidelines/stable

    For example GTK+ 2 is installed in parallel to GTK+ 3.

    For the glibc, I think it could work too: glibc 2 and 3 could be installed in parallel, and higher-level libraries/components/apps are ported to glibc 3 layer by layer, from bottom to top. But it requires that every higher-level library is also parallel-installable, and that they bump their major version when porting to glibc 3. But it would probably take many years to port all the Linux userland libraries to glibc 3, we would realize that some low-level stuff are not that well maintained and that nobody works on those modules anymore.

    But with containers, an application can bundle older versions of libraries and use an older runtime, if the work is not yet done to port the code to the new versions. So even without parallel-installability nor the mechanism to have several versions of the same symbol in a library (like currently implemented in glibc), two different containers can have two different (and incompatible) versions of the glibc or any other library.

    With containers the NP-complete problem doesn’t exist because the upstream developers have already chosen the right runtime and bundled the right versions of additional libraries.

    Like

Leave a comment