[GIT PULL] Fix idmapd segfault on musl libc systems.

Chad Joan <chadjoan@xxxxxxxxx> · Tue, 18 Aug 2020 17:25:36 -0400

Hello,

The changes (described below) are available in a series of commits on
my fork of the nfs-utils repository at
https://github.com/chadjoan/nfs-utils/commits/bug344-passwd

Background:

I reported a bug (#344) to the linux-nfs bugzilla earlier this month
(https://bugzilla.linux-nfs.org/show_bug.cgi?id=344) where the idmapd
service would fail to start on one of my machines and printed a
"general protection fault" message in my system log:

Aug  3 07:13:21 l-andianov rpc.idmapd[29843]: Setting log level to 7
Aug  3 07:13:21 l-andianov rpc.idmapd[29843]: libnfsidmap: using
domain: ozymandias
Aug  3 07:13:21 l-andianov rpc.idmapd[29843]: libnfsidmap: Realms
list: 'OZYMANDIAS'
Aug  3 07:13:21 l-andianov rpc.idmapd[29843]: libnfsidmap: processing
'Method' list
Aug  3 07:13:21 l-andianov kernel: traps: rpc.idmapd[29843] general
protection fault ip:7fb4bc75c0cd sp:7ffe56527540 error:0 in
libc.so[7fb4bc748000+5f000]
Aug  3 07:13:21 l-andianov rpc.idmapd[29843]: libnfsidmap: loaded
plugin /usr/lib/libnfsidmap/static.so for method static
Aug  3 07:13:21 l-andianov /etc/init.d/rpc.idmapd[29819]: ERROR:
rpc.idmapd failed to start
Aug  3 07:13:21 l-andianov /etc/init.d/nfs[29818]: ERROR: cannot start
nfs as rpc.idmapd would not start

Long-story-short: I investigated, found out it was because musl libc
returns `-1` from `sysconf(_SC_GETPW_R_SIZE_MAX)` calls (which is
allowable by POSIX standards), and idmapd's code wasn't written to
handle this possibility, so a buffer overrun/overflow happened.

Resolution:

I have written commits that solve this problem wherever it might
occur, so anywhere `sysconf` was used to size memory buffers before
calling `get**nam_r` or `get***id_r`. It may also fix other bugs or
inconsistencies that were found in the affected code, as well as
improving the comprehensiveness of error detection.

Highlights:

- Memory for `get**nam_r` and `get***id_r` functions is no longer
allocated based on the return value of sysconf(...), as it can return
misleading values or -1 in this use-case.
- Introduces passwd_query.c module that handles buffer allocation and
adjustment for those functions, as well as other edge-cases. This does
increase complexity, but it allows some memory management and
exception handling complexity to be shifted away from feature and
retrieval logic.
- The passwd_query.c module was (manually) tested in isolation to get
more coverage than what I would be able to obtain by running idmapd.
There is a program for that (pwgrp_test.c) in this repository:
https://github.com/chadjoan/passwd_query
- Calls to `get**nam_r` and `get***id_r` are now very likely to use
stack allocation instead of `malloc`, though this logic will fall back
to `malloc` if the demands on buffer size are large (ex: absurdly long
comments in passwd file or very long group member lists).
- Furthermore, if `get**nam_r` and `get***id_r` still need more
buffer, `passwd_query.c` will use a `realloc`-loop until the buffer is
large enough. This results in an interface that does the same thing
but never returns ERANGE - the error does not have to be handled
because it does not happen.
- Any code that called the `get**nam_r` and `get***id_r` will now log
error messages for every possible error code that those functions
return. In the event of a non-standard error, it will attempt to get
error text from `strerror`, and information about what was being
looked-up or mapped will still be printed.
- In the `gums.c` plugin, `translate_to_uid` and `translate_to_gid`
did not have the same exception handling logic. For example,
`translate_to_gid` had a `realloc`-loop, while `translate_to_uid` did
not. Since this change has them all use `passwd_query.c`
functionality, they should be more robust and equally robust.
- In some places the code calling `get**nam_r` and `get***id_r` would
loop after EINTR error codes (retry on interrupt), while other places
would not. These differences were not justified with comments, so I
gave the EINTR retry looping behavior to everything that queries
`passwd` or `group` information. I did this by convention instead of
putting the functionality in `passwd_query.c`, so it should be pretty
easy to back it off selectively if there are places where this
shouldn't be done.
- There was a potential memory leak in `regex.c`: `regex_getpwnam`
would not free the `localname` string if it completed successfully
(the abortive cases were fine, though). That can no longer happen, and
it is written in a way that makes it clear that the `free()` will
always happen if-and-only-if `malloc()` happened. It will also attempt
to use a small-ish fixed-size stack allocated buffer whenever the
string fits.
- Since the code related to `get**nam_r` and `get***id_r` in `nss.c`
was already being refactored to some extent, I made it trivial to see
if `free()` was called if something was allocated. The `malloc()` call
was hidden in other functions in this case, so this probably won't
look as good as what happened in `regex.c`, but it should still help.
As far as I can tell, this code didn't leak memory beforehand; so this
particular refactoring just makes the code more self-documenting. I
like to make things clear if I have any doubt.
- This was done in a series of separate commits, each of which should
leave the codebase in a working state.

Process, testing, and commit layout:

The changes to `static.c` and `nfsidmap.c` were done alongside the
addition of `passwd_query.c`, and are effectively the starting point
for this changeset. This is the stuff I could test easily on my own
machine. I could not run the `regex.c` and `nss.c` plugins, but I did
make sure they compiled, and each of those was given their own
commits. The `gums.c` plugin was something I didn't have dependencies
installed for, so compilation was difficult. I still managed to obtain
some confidence with it by commenting out all of the code I didn't
change and then compiling what I did change, and that worked. I also
gave `gums.c` its own commit.

I normally like to write unittests for my code so that the
(regression) testing can be performed automatically by others, and
give you greater confidence in the maintainability of the code. In
this case, I think that this would require some kind of mock libc
implementation. I couldn't find that and I don't have one. So for this
changeset I fell back on testing my code by hand whenever I made
changes. I hope this will suffice.

Have a great day!
- Chad