Postfix’s killed trivial-rewrite by signal 11

on September 21, 2014 in Linux with no comments by

I was setting up a small VPS as a backup e-mail server for the two already in place.  What was supposed to be a 15 minute task, particularly as it was being installed using a proven recipe with Puppet, turned into a diagnostic nightmare for hours. Looking back, it really shouldn’t have taken that long to diagnose either, but alas, Google led me astray.

See, everything was installed according to the other servers. Postfix started up fine, but as soon as it would perform a lookup in an LDAP directory, the following error occurred:

Sep 21 00:34:02 server postfix/master[23426]: warning: process /usr/lib/postfix/trivial-rewrite pid 23460 killed by signal 11
Sep 21 00:34:03 server postfix/qmgr[23431]: warning: problem talking to service rewrite: Success
Sep 21 00:34:03 server postfix/master[23426]: warning: process /usr/lib/postfix/trivial-rewrite pid 23461 killed by signal 11
Sep 21 00:34:03 server postfix/master[23426]: warning: /usr/lib/postfix/trivial-rewrite: bad command startup -- throttling

I have come accustomed to look things up on Google first, to see if someone else already figured out what the cause of this issue is.  It almost all referenced issues from back in 2005-2008, where there was a missing /dev/(u?)random in Postfix’s chroot directory (/var/pool/postfix/).  But, it was there like it supposed to be and with all the correct permissions.

With TLS disabled in the LDAP mappings for Postfix, there weren’t any errors. But the policies in place require that all communication should be done with TLS or SSL, even over a private network (which in this case wasn’t private  – it is a backup server in Sweden). The regular LDAP utilites didn’t have any trouble communicating with the LDAP directory over TLS/SSL however, which led me to ignore it and its associated libraries.

Additionally, the articles I found via Google were mainly referring to signal 6 errors, not signal 11 errors, which is a segmentation fault. Given that, I went on to strace that “trivial rewrite”. The Postfix documentation explains how to perform an auto trace well (instead of “truss” I use “strace”).  From here I could see the following relevant part:

Sep 21 01:10:48 server logger: socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8
Sep 21 01:10:48 server logger: connect(8, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
Sep 21 01:10:48 server logger: sendto(8, "\2\0\0\0\17\0\0\0\10\0\0\0postfix\0", 20, MSG_NOSIGNAL, NULL, 0) = 20
Sep 21 01:10:48 server logger: poll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
Sep 21 01:10:48 server logger: read(8, "\2\0\0\0\1\0\0\0\2\0\0\0", 12) = 12
Sep 21 01:10:48 server logger: read(8, "m\0\0\0p\0\0\0", 8) = 8
Sep 21 01:10:48 server logger: close(8) = 0
Sep 21 01:10:48 server logger: setgroups(2, [109, 112]) = 0
Sep 21 01:10:48 server logger: chroot("/var/spool/postfix") = 0
Sep 21 01:10:48 server logger: chdir("/") = 0
Sep 21 01:10:48 server logger: setuid(104) = 0
Sep 21 01:10:48 server logger: getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
Sep 21 01:10:48 server logger: setrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
Sep 21 01:10:48 server logger: epoll_create(10) = 8
Sep 21 01:10:48 server logger: fcntl(8, F_GETFD) = 0
Sep 21 01:10:48 server postfix/master[27591]: warning: process /usr/lib/postfix/trivial-rewrite pid 28079 killed by signal 11

So, it opens up a unix socket on /var/run/nscd/socket, does a bit of back-and-forth, then switches to chroot after which a subsequent fcntl call goes kaplooey.  So I’ve disabled the chroot to see if that would also cause the same issue, and it didn’t. But for the sake of security, I didn’t want it to run outside of a chroot — afterall, the other servers had no issues running it in chroot so why should this server? Moreso, it seemed to move the issue down to the other components of Postfix (to “flush” and “smtpd” among things).

It turns out, I was ignoring things I shouldn’t have.  A quick peek in the chroot’s lib dir (/var/spool/postfix/lib/x86_64-linux-gnu/) I noticed that the libraries in use were version 2.19, ie:

libnss_compat-2.19.so
libnss_compat.so.2 -> libnss_compat-2.19.so

That differed from the other servers, which were using version 2.13 instead.  These libraries came from the libc6 package. And in this case, from Debian’s unstable “sid”:

# apt-cache madison libc6
 libc6 | 2.19-11 | http://ftp.debian.org/debian/ sid/main amd64 Packages
 libc6 | 2.13-38+deb7u4 | http://security.debian.org/ wheezy/updates/main amd64 Packages
 libc6 | 2.13-38+deb7u2 | http://ftp.debian.org/debian/ wheezy/main amd64 Packages

Using sid isn’t advisable, especially on production systems; there’s a reason it is called “unstable”.  However, during the installation of the various components using Puppet, which also enables the sid repository, something had a dependency on the newer libc6 libary.

The problem is that libc6 is used for 99.9999% of the things you install (or seemingly so). Reverting it to an older version required carefully looking at what else would be removed, which in turn would have to be re-installed (for example, “upstart” and “mountall” would be removed, which are imperative for the server to work unless you don’t ever reboot the server again).

Luckily this was a small server with just a single purpose, so there weren’t that many things installed to begin with.  So with that, I reverted to the older version of libc6 using:

apt-get install libc6=2.13-38+deb7u4

And re-installed the other important components it removed as part of that process. For good measure, Postfix was fully purged and re-installed as well. Lo and behold, Postfix worked like a charm!

Image by Spider.Dog.

 

Join the discussion

Your email address will not be published. Required fields are marked *