Re: [RFC PATCH] binfmt_elf: Dump smaller VMAs first in ELF cores

Kees Cook <kees@xxxxxxxxxx> · Mon, 5 Aug 2024 13:34:46 -0700

On Mon, Aug 05, 2024 at 06:44:44PM +0000, Brian Mak wrote:
> On Aug 5, 2024, at 10:25 AM, Kees Cook <kees@xxxxxxxxxx> wrote:
> 
> > On Thu, Aug 01, 2024 at 05:58:06PM +0000, Brian Mak wrote:
> >> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> >>> One practical concern with this approach is that I think the ELF
> >>> specification says that program headers should be written in memory
> >>> order.  So a comment on your testing to see if gdb or rr or any of
> >>> the other debuggers that read core dumps cares would be appreciated.
> >> 
> >> I've already tested readelf and gdb on core dumps (truncated and whole)
> >> with this patch and it is able to read/use these core dumps in these
> >> scenarios with a proper backtrace.
> > 
> > Can you compare the "rr" selftest before/after the patch? They have been
> > the most sensitive to changes to ELF, ptrace, seccomp, etc, so I've
> > tried to double-check "user visible" changes with their tree. :)
> 
> Hi Kees,
> 
> Thanks for your reply!
> 
> Can you please give me some more information on these self tests?
> What/where are they? I'm not too familiar with rr.

I start from where whenever I go through their tests:

https://github.com/rr-debugger/rr/wiki/Building-And-Installing#tests

> > And those VMAs weren't thread stacks?
> 
> Admittedly, I did do all of this exploration months ago, and only have
> my notes to go off of here, but no, they should not have been thread
> stacks since I had pulled all of them in during a "first pass".

Okay, cool. I suspect you'd already explored that, but I wanted to be
sure we didn't have an "easy to explain" solution. ;)

> > It does also feel like part of the overall problem is that systemd
> > doesn't have a way to know the process is crashing, and then creates the
> > truncation problem. (i.e. we're trying to use the kernel to work around
> > a visibility issue in userspace.)
> 
> Even if systemd had visibility into the fact that a crash is happening,
> there's not much systemd can do in some circumstances. In applications
> with strict time to recovery limits, the process needs to restart within
> a certain time limit. We run into a similar issue as the issue I raised
> in my last reply on this thread: to keep the core dump intact and
> recover, we either need to start up a new process while the old one is
> core dumping, or wait until core dumping is complete to restart.
> 
> If we start up a new process while the old one is core dumping, we risk
> system stability in applications with a large memory footprint since we
> could run out of memory from the duplication of memory consumption. If
> we wait until core dumping is complete to restart, we're in the same
> scenario as before with the core being truncated or we miss recovery
> time objectives by waiting too long.
> 
> For this reason, I wouldn't say we're using the kernel to work around a
> visibility issue or that systemd is creating the truncation problem, but
> rather that the issue exists due to limitations in how we're truncating
> cores. That being said, there might be some use in this type of
> visibility for others with less strict recovery time objectives or
> applications with a lower memory footprint.

Yeah, this is interesting. This effectively makes the coredumping
activity rather "critical path": the replacement process can't start
until the dump has finished... hmm. It feels like there should be a way
to move the dumping process aside, but with all the VMAs still live, I
can see how this might go weird. I'll think some more about this...

-- 
Kees Cook