Qualys Security Advisory System Down: A systemd-journald exploit ======================================================================== Contents ======================================================================== Summary CVE-2018-16864 - Analysis - Exploitation CVE-2018-16865 - Analysis - Exploitation CVE-2018-16866 - Analysis - Exploitation Combined Exploitation of CVE-2018-16865 and CVE-2018-16866 - amd64 Exploitation - i386 Exploitation Acknowledgments Timeline Conversion, software version 7.0 -- System of a Down, "Toxicity" ======================================================================== Summary ======================================================================== We discovered three vulnerabilities in systemd-journald (https://en.wikipedia.org/wiki/Systemd): - CVE-2018-16864 and CVE-2018-16865, two memory corruptions (attacker-controlled alloca()s); - CVE-2018-16866, an information leak (an out-of-bounds read). CVE-2018-16864 was introduced in April 2013 (systemd v203) and became exploitable in February 2016 (systemd v230). We developed a proof of concept for CVE-2018-16864 that gains eip control on i386. CVE-2018-16865 was introduced in December 2011 (systemd v38) and became exploitable in April 2013 (systemd v201). CVE-2018-16866 was introduced in June 2015 (systemd v221) and was inadvertently fixed in August 2018. We developed an exploit for CVE-2018-16865 and CVE-2018-16866 that obtains a local root shell in 10 minutes on i386 and 70 minutes on amd64, on average. We will publish our exploit in the near future. To the best of our knowledge, all systemd-based Linux distributions are vulnerable, but SUSE Linux Enterprise 15, openSUSE Leap 15.0, and Fedora 28 and 29 are not exploitable because their user space is compiled with GCC's -fstack-clash-protection. This confirms https://grsecurity.net/an_ancient_kernel_hole_is_not_closed.php: "It should be clear that kernel-only attempts to solve [the Stack Clash] will necessarily always be incomplete, as the real issue lies in the lack of stack probing." ======================================================================== CVE-2018-16864 ======================================================================== ------------------------------------------------------------------------ Analysis ------------------------------------------------------------------------ The waves all keep on crashing by -- System of a Down, "Suggestions" We accidentally discovered CVE-2018-16864 while working on the exploit for Mutagen Astronomy (CVE-2018-14634); if we pass several megabytes of command-line arguments to a program that calls syslog(), then journald crashes: systemd-journal[472]: segfault at 7ffe9a077420 ip 00007f45f6174877 sp 00007ffe9a0773f0 error 6 in systemd-journald[7f45f6169000+3f000] (gdb) disassemble 0x7f45f6174877 - 0x7f45f6169000 Dump of assembler code for function dispatch_message_real.4064: ... 0x000000000000b82c <+988>: callq 0x2bd10 <get_process_cmdline.constprop.96> 0x000000000000b831 <+993>: test %eax,%eax 0x000000000000b833 <+995>: js 0xb8ea <dispatch_message_real.4064+1178> 0x000000000000b839 <+1001>: mov -0x218(%rbp),%rbx 0x000000000000b840 <+1008>: test %rbx,%rbx 0x000000000000b843 <+1011>: je 0xd31b <dispatch_message_real.4064+7883> 0x000000000000b849 <+1017>: mov %rbx,%rdi 0x000000000000b84c <+1020>: callq 0x5360 <strlen@plt> 0x000000000000b851 <+1025>: add $0xa,%eax 0x000000000000b854 <+1028>: cltq 0x000000000000b856 <+1030>: add $0x1e,%rax 0x000000000000b85a <+1034>: and $0xfffffffffffffff0,%rax 0x000000000000b85e <+1038>: sub %rax,%rsp 0x000000000000b861 <+1041>: movabs $0x454e494c444d435f,%rax 0x000000000000b86b <+1051>: lea 0x37(%rsp),%r15 0x000000000000b870 <+1056>: and $0xfffffffffffffff0,%r15 0x000000000000b874 <+1060>: test %rbx,%rbx 0x000000000000b877 <+1063>: mov %rax,(%r15) 0x000000000000b87a <+1066>: mov $0x3d,%eax 0x000000000000b87f <+1071>: mov %ax,0x8(%r15) 0x000000000000b884 <+1076>: lea 0x9(%r15),%rax 0x000000000000b888 <+1080>: je 0xb895 <dispatch_message_real.4064+1093> 0x000000000000b88a <+1082>: mov %rbx,%rsi 0x000000000000b88d <+1085>: mov %rax,%rdi 0x000000000000b890 <+1088>: callq 0x5370 <stpcpy@plt> 538 static void dispatch_message_real( ... 604 r = get_process_cmdline(ucred->pid, 0, false, &t); 605 if (r >= 0) { 606 x = strjoina("_CMDLINE=", t); 919 #define strjoina(a, ...) \ 920 ({ \ 921 const char *_appendees_[] = { a, __VA_ARGS__ }; \ 922 char *_d_, *_p_; \ 923 int _len_ = 0; \ 924 unsigned _i_; \ 925 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \ 926 _len_ += strlen(_appendees_[_i_]); \ 927 _p_ = _d_ = alloca(_len_ + 1); \ 928 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \ 929 _p_ = stpcpy(_p_, _appendees_[_i_]); \ 930 *_p_ = 0; \ 931 _d_; \ 932 }) This vulnerability, an attacker-controlled alloca() (https://wiki.sei.cmu.edu/confluence/display/c/MEM05-C.+Avoid+large+stack+allocations) at instruction 0xb85e and line 927, was introduced in systemd v203: commit ae018d9bc900d6355dea4af05119b49c67945184 Date: Mon Apr 22 23:10:13 2013 -0300 ... r = get_process_cmdline(ucred->pid, 0, false, &t); if (r >= 0) { - cmdline = strappend("_CMDLINE=", t); + cmdline = strappenda("_CMDLINE=", t); (strappenda() was renamed strjoina() in systemd v219) and became exploitable in systemd v230: commit ac2e41f5103ce2c679089c4f8fb6be61d7caec07 Date: Fri Feb 12 04:59:57 2016 -0800 ... This adds a wait flag to journal_file_set_offline(), when false the offline is performed asynchronously in a separate thread. ------------------------------------------------------------------------ Exploitation ------------------------------------------------------------------------ ... it's the race Can you break out? -- System of a Down, "36" CVE-2018-16864 is similar to a Stack Clash vulnerability (https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt), but: - Steps 1 (Clash the stack with another memory region) and 2 (Run the stack pointer to the start of the stack) are not needed, because the attacker-controlled alloca() can be very large (several megabytes of command-line arguments); only Steps 3 (Jump over the stack guard page, into another memory region) and 4 (Smash the stack, or another memory region) are needed. - In Step 4 (Smash), the alloca() is fully written to (the vulnerability is essentially a stpcpy(alloca(strlen(cmdline) + 1), cmdline)), and the stpcpy() (a "wild copy") will therefore always crash into a read-only or unmapped memory region: https://googleprojectzero.blogspot.com/2015/03/taming-wild-copy-parallel-thread.html https://cansecwest.com/slides/2015/Taming%20wild%20copies%20-%20Chris%20evans.pdf We tried to asynchronously interrupt this stpcpy() before it crashes, with a signal or a timer, but we failed because journald uses signalfd() and timerfd_create() to handle these events synchronously. We eventually gained control of eip (i386's instruction pointer) by jumping into and smashing the stack of a concurrent thread (a "Parallel Thread Corruption"): - First, we send a large, high-priority message (LOG_CRIT or higher) to journald, from a process whose cmdline is small; this message forces a large write() (between 1MB and 2MB) to /var/log/journal/ and forces the creation of a short-lived thread that fsync()s the journal (the stack of this thread is allocated in the mmap region). - Next, we create several processes (between 32 and 64) that write() and fsync() large files (between 1MB and 8MB) to /var/tmp/ (for example); these processes stall journald's fsync() thread and will allow us to win a tight race: exploit the "wild copy" before it crashes. - Last, we send a small, low-priority message to journald, from a process whose cmdline is very large (roughly 128MB, the distance between the main stack and the mmap region); this message forces a very large alloca() that jumps from journald's main stack into the stack of the fsync() thread, and smashes a saved eip before fsync() returns from kernel space. On a Debian stable (9.5), our proof of concept wins this race and gains eip control after a dozen tries (systemd automatically restarts journald after each crash): systemd-journal[2195]: segfault at 41414141 ip 41414141 sp b5f3d22c error 14 Despite this initial success, we abandoned the exploitation of CVE-2018-16864: while working on our proof of concept, we discovered two different vulnerabilities (CVE-2018-16865, another attacker-controlled alloca(), and CVE-2018-16866, an information leak) that are reliably exploitable on both i386 and amd64. ======================================================================== CVE-2018-16865 ======================================================================== ------------------------------------------------------------------------ Analysis ------------------------------------------------------------------------ Can you feel their haunting presence? -- System of a Down, "Holy Mountains" Surprised by the heavy usage of alloca() in journald, we searched for another attacker-controlled alloca() and found CVE-2018-16865: 1963 int journal_file_append_entry(JournalFile *f, const dual_timestamp *ts, const struct iovec iovec[], unsigned n_iovec, uint64_t *seqnum, Object **ret, uint64_t *offset) { .... 1986 items = alloca(sizeof(EntryItem) * MAX(1u, n_iovec)); 1987 1988 for (i = 0; i < n_iovec; i++) { 1989 uint64_t p; 1990 Object *o; 1991 1992 r = journal_file_append_data(f, iovec[i].iov_base, iovec[i].iov_len, &o, &p); 1993 if (r < 0) 1994 return r; 1995 1996 xor_hash ^= le64toh(o->data.hash); 1997 items[i].object_offset = htole64(p); 1998 items[i].hash = o->data.hash; 1999 } This vulnerability was introduced in systemd v38: commit cf244689e9d1ab50082c9ddd0f3c4d1eb982badc Date: Thu Dec 29 15:00:57 2011 +0100 ... - items = new(EntryItem, n_iovec); - if (!items) - return -ENOMEM; + items = alloca(sizeof(EntryItem) * n_iovec); and became exploitable in systemd v201: commit c4aa09b06f835c91cea9e021df4c3605cff2318d Date: Mon Apr 8 20:32:03 2013 +0200 ... -#define ENTRY_SIZE_MAX (1024*1024*64) -#define DATA_SIZE_MAX (1024*1024*64) ... +#define ENTRY_SIZE_MAX (1024*1024*768) +#define DATA_SIZE_MAX (1024*1024*768) If we send a large "native" message to /run/systemd/journal/socket: since the maximum size of a "native" entry is 768MB, and the minimum length of a "native" item is 3 ("A=\n"), and the size of an EntryItem structure is 16 (a 64-bit offset and a 64-bit hash), the maximum size of the attacker-controlled alloca() in journal_file_append_entry() is 768MB / 3 * 16 = 4GB, large enough to jump from journald's main stack into the mmap region, even on amd64. On amd64, as described in the "64-bit exploitation" of our Stack Clash advisory, the randomized distance between the main stack and the mmap region is shorter than 4GB with a probability of (approximately): SUM(d = 0; d < 4GB; d++) d / (16GB * 1TB) ~= 1 / 2048 ------------------------------------------------------------------------ Exploitation ------------------------------------------------------------------------ Jump (pogo, pogo, pogo, pogo, pogo, pogo, pogo) -- System of a Down, "Bounce" CVE-2018-16865 is basically a simplified Stack Clash vulnerability: - Steps 1 (Clash) and 2 (Run) of the Stack Clash are not needed, since the largest attacker-controlled alloca() is 4GB; only Steps 3 (Jump) and 4 (Smash) are needed. - In Step 4 (Smash), the alloca() is not necessarily fully written to: if the size of an item is larger than 128MB (DEFAULT_MAX_SIZE_UPPER), then journal_file_append_data() returns an error that breaks the "for" loop in journal_file_append_entry() (at lines 1992-1994) and avoids a crash into a read-only or unmapped memory region. We eventually transformed this vulnerability into a crude "write-what-where" (https://cwe.mitre.org/data/definitions/123.html): - "write-where": We jump into and smash libc's read-write segment, and thereby overwrite a function pointer. Unfortunately this "write-where" is not surgical: the stack frames of the functions called from within the "for" loop (in journal_file_append_entry()) smash a few kilobytes below our target function pointer, and therefore overwrite vital libc variables that may crash or deadlock journald. Consequently, we must sometimes shift our alloca() jump slightly, to avoid overwriting such vital variables. - "write-what": We want to overwrite our target function pointer with the address of another function or ROP chain, but unfortunately the stack frames of the functions called from within the "for" loop (in journal_file_append_entry()) do not contain any data that we control. However, the 64-bit "hash" values that are written to the alloca()ted "items" are produced by jenkins_hashlittle2(), a noncryptographic hash function: we can easily find a short string (a preimage) that hashes to a given value (the address that will overwrite our target function pointer) and is also a valid_user_field() (or journal_field_valid()). This "write-what" restricts our "write-where" to function pointers whose address modulo 16 is equal to 8 (the offset of "hash" in the EntryItem structure). To complete our exploit, we need the address of journald's stack pointer before the alloca() jump, and the address of our target function pointer in libc's read-write segment -- we need an information leak. ======================================================================== CVE-2018-16866 ======================================================================== ------------------------------------------------------------------------ Analysis ------------------------------------------------------------------------ When they speak, we can peek from the windows of their mouths -- System of a Down, "Know" We discovered an out-of-bounds read in journald (CVE-2018-16866), and transformed it into an information leak: 31 #define WHITESPACE " \t\n\r" ... 194 size_t syslog_parse_identifier(const char **buf, char **identifier, char **pid) { 195 const char *p; ... 197 size_t l, e; ... 203 p = *buf; 204 205 p += strspn(p, WHITESPACE); 206 l = strcspn(p, WHITESPACE); 207 208 if (l <= 0 || 209 p[l-1] != ':') 210 return 0; 211 212 e = l; ... 240 if (strchr(WHITESPACE, p[e])) 241 e++; 242 *buf = p + e; 243 return e; 244 } If we send a syslog message to journald (in *buf), and if the last character of this message is a ':' (before the '\0' terminator), then: - at line 240, p[e] is the '\0' terminator of our message; - at line 240, strchr(WHITESPACE, p[e]) returns a pointer to the '\0' terminator of the WHITESPACE string (as mentioned in man strchr: "The terminating null byte is considered part of the string, so that if c is specified as '\0', these functions return a pointer to the terminator."); - at line 241, e is incremented; - at line 242, *buf points out-of-bounds, to the first character after the '\0' terminator of our message; - later, the out-of-bounds string at *buf (supposedly the body of our syslog message) is written (leaked) to the journal. Consequently, we can read this out-of-bounds string: - either directly from the journal (if journald's "Storage" is "persistent", or "auto" and /var/log/journal/ exists), because journald supports extended file ACLs (Access Control Lists): $ id uid=1000(john) gid=1000(john) groups=1000(john) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 $ ls -l /var/log/journal/*/user-$UID.journal -rw-r-----+ 1 root systemd-journal 8388608 Nov 20 09:35 /var/log/journal/2562d1eced654f44a3d3a217d66b9ff3/user-1000.journal $ getfacl /var/log/journal/*/user-$UID.journal ... user:john:r-- $ ./infoleak $ journalctl --all --user --lines=1 --identifier=infoleak | hexdump -C ... 00000050 2e 20 2d 2d 0a 4e 6f 76 20 32 30 20 31 36 3a 30 |. --.Nov 20 16:0| 00000060 30 3a 33 36 20 6c 6f 63 61 6c 68 6f 73 74 2e 6c |0:36 localhost.l| 00000070 6f 63 61 6c 64 6f 6d 61 69 6e 20 69 6e 66 6f 6c |ocaldomain infol| 00000080 65 61 6b 5b 33 35 34 38 5d 3a 20 78 fb 1e 78 54 |eak[3548]: x..xT| 00000090 7f 0a |..| - or (if journald's "Storage" is "volatile", or "auto" and /var/log/journal/ does not exist) from a tty that we recorded to /var/run/utmp, because journald writes ("walls") emergency messages (LOG_EMERG) to the tty of every logged-in user; our exploit records a tty to /var/run/utmp via an ssh connection to localhost, but other methods exist (for example, utempter and gnome-pty-helper): $ ./infoleak ... 00003510 0a 07 0d 0d 0a 42 72 6f 61 64 63 61 73 74 20 6d |.....Broadcast m| 00003520 65 73 73 61 67 65 20 66 72 6f 6d 20 73 79 73 74 |essage from syst| 00003530 65 6d 64 2d 6a 6f 75 72 6e 61 6c 64 40 6c 6f 63 |emd-journald@loc| 00003540 61 6c 68 6f 73 74 2e 6c 6f 63 61 6c 64 6f 6d 61 |alhost.localdoma| 00003550 69 6e 20 28 54 75 65 20 32 30 31 38 2d 31 31 2d |in (Tue 2018-11-| 00003560 32 30 20 31 36 3a 32 35 3a 34 36 20 43 53 54 29 |20 16:25:46 CST)| 00003570 3a 0d 0d 0a 0d 0d 0a 69 6e 66 6f 6c 65 61 6b 5b |:......infoleak[| 00003580 33 38 37 32 5d 3a 20 78 6b a2 e1 2f 7f 0d 0d 0a |3872]: xk../....| This vulnerability was introduced in systemd v221: commit ec5ff4445cca6a1d786b8da36cf6fe0acc0b94c8 Date: Wed Jun 10 22:33:44 2015 -0700 ... - e += strspn(p + e, WHITESPACE); + if (strchr(WHITESPACE, p[e])) + e++; and was inadvertently fixed in August 2018: commit a6aadf4ae0bae185dc4c414d492a4a781c80ffe5 Date: Wed Aug 8 15:06:36 2018 +0900 ... - if (strchr(WHITESPACE, p[e])) - e++; + e += strspn(p + e, WHITESPACE); commit 8595102d3ddde6d25c282f965573a6de34ab4421 Date: Fri Aug 10 11:07:54 2018 +0900 ... - e += strspn(p + e, WHITESPACE); + /* Single space is used as separator */ + if (p[e] != '\0' && strchr(WHITESPACE, p[e])) + e++; ------------------------------------------------------------------------ Exploitation ------------------------------------------------------------------------ For today we will take the body parts and put them on the wall -- System of a Down, "Dreaming" To leak a stack address or an mmap address from journald: - First, we send a large native message to /run/systemd/journal/socket; journald mmap()s our message, and malloc()ates a large array of iovec structures: most of these structures point into our mmap()ed message, but some of them point to the stack (in dispatch_message_real()). The contents of this iovec array (especially the mmap and stack pointers) are preserved in a heap hole after free() (after journald finishes processing our message). - Next, we send a large syslog message to /run/systemd/journal/dev-log; to receive our large message (in server_process_datagram()), journald realloc()ates its server buffer into the heap hole that previously contained the iovec array (and still contains remains of mmap and stack pointers). - Last, we send a large syslog message that exploits CVE-2018-16866; journald receives our large message in its server buffer (in the heap chunk that previously contained the iovec array), and if we carefully choose the size of our message and position its terminating ":" in front of a remaining mmap or stack pointer, then we can leak this pointer (it is mistakenly read out-of-bounds as the body of our message). >From this leaked stack pointer we easily deduce journald's stack pointer before the alloca() jump, because the distance between the two depends only on journald's executable. >From the leaked mmap address we can deduce libc's address, but chunks of unknown sizes are mmap()ed between the two, and we must therefore adopt different strategies based on our target architecture (i386 or amd64). ======================================================================== Combined Exploitation of CVE-2018-16865 and CVE-2018-16866 ======================================================================== Don't leave your seats now Popcorn everywhere ... -- System of a Down, "CUBErt" ------------------------------------------------------------------------ amd64 Exploitation ------------------------------------------------------------------------ - To deduce libc's address from the leaked mmap address of our native message, we arrange for this message to be mmap()ed into the 2MB hole between ld.so's read-execute and read-only segments: from this hole's address we deduce ld.so's address, and hence libc's address (with help from ldd's output). - If the resulting stack-to-libc distance is jumpable (if it is shorter than 4GB), then we proceed with our "write-what-where"; otherwise, we restart journald (we crash it with an alloca() of RLIMIT_STACK -- 8MB by default) and try again. We have a good chance of obtaining a jumpable stack-to-libc distance (and hence a root shell) after 2048 tries * 2 seconds ~= 68 minutes (by default, if journald crashes less than 5 times within 10 seconds, it is restarted automatically by systemd). - For the "write-where" part of our "write-what-where", we overwrite libc's __free_hook function pointer, whose address modulo 16 is always equal to 8 (on every amd64 distribution that we exploited). - For the "write-what" part of our "write-what-where", we overwrite __free_hook with the address of libc's system() function: whenever journald free()s data that we control, we achieve arbitrary command execution. Last-minute note: on CentOS 7, the usual function pointers in libc's read-write segment (__free_hook, __malloc_hook, etc) are not located at multiples of 16 plus 8. To circumvent this problem: - First, we overwrite the "_chain" pointer of stderr's FILE structure with the address of our own fake FILE structure (this "_chain" pointer is located at a multiple of 16 plus 8, in libc's read-write segment). - Next, we corrupt one of malloc's internal variables (also in libc's read-write segment). - Last, we force a call to malloc() or free(), which detects the corruption of its internal variable and calls abort(), which calls _IO_flush_all_lockp(), which follows stderr's overwritten "_chain" pointer to our fake FILE structure; we eventually achieve arbitrary command execution by calling libc's system() via one of the function pointers in our fake FILE structure. ------------------------------------------------------------------------ i386 Exploitation ------------------------------------------------------------------------ Our i386 exploit is very similar to the amd64 exploit, but: - The stack-to-libc distance is always jumpable (it is roughly 128MB). - There is no hole between ld.so's read-execute and read-only segments. However, libc's address is randomized in a narrow range of 1MB and is therefore brute forcible: we have a good chance of correctly guessing libc's address after 1MB / 4KB = 256 tries * 2 seconds ~= 8 minutes. - For the "write-where" part of our "write-what-where", we overwrite libc's __malloc_hook function pointer (__free_hook was never located at a multiple of 16 plus 8 or 12 on the i386 distributions that we exploited, but __malloc_hook always is). - For the "write-what" part of our "write-what-where", we overwrite __malloc_hook with the address of a "mov esp, 0x89fffa5d ; ret" gadget (or equivalent stack pivot): since our native message can be as large as 768MB, we can mmap() it at 0x89fffa5d, take control of the stack, and return into libc's execve(). ======================================================================== Acknowledgments ======================================================================== We thank systemd's developers, Red Hat Product Security, and the members of linux-distros@openwall. ======================================================================== Timeline ======================================================================== 2018-11-26: Advisory sent to Red Hat Product Security (as recommended by https://github.com/systemd/systemd/blob/master/docs/CONTRIBUTING.md#security-vulnerability-reports). 2018-12-26: Advisory and patches sent to linux-distros@openwall. 2019-01-09: Coordinated Release Date (6:00 PM UTC).