Re: [RFC PATCH] binfmt_elf: Dump smaller VMAs first in ELF cores

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Fri, 02 Aug 2024 11:16:13 -0500

Brian Mak <makb@xxxxxxxxxxx> writes:

> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
>> Brian Mak <makb@xxxxxxxxxxx> writes:
>> 
>>> Large cores may be truncated in some scenarios, such as daemons with stop
>>> timeouts that are not large enough or lack of disk space. This impacts
>>> debuggability with large core dumps since critical information necessary to
>>> form a usable backtrace, such as stacks and shared library information, are
>>> omitted. We can mitigate the impact of core dump truncation by dumping
>>> smaller VMAs first, which may be more likely to contain memory for stacks
>>> and shared library information, thus allowing a usable backtrace to be
>>> formed.
>> 
>> This sounds theoretical.  Do you happen to have a description of a
>> motivating case?  A situtation that bit someone and resulted in a core
>> file that wasn't usable?
>> 
>> A concrete situation would help us imagine what possible caveats there
>> are with sorting vmas this way.
>> 
>> The most common case I am aware of is distributions setting the core
>> file size to 0 (ulimit -c 0).
>
> Hi Eric,
>
> Thanks for taking the time to reply. We have hit these scenarios before in
> practice where large cores are truncated, resulting in an unusable core.
>
> At Juniper, we have some daemons that can consume a lot of memory, where
> upon crash, can result in core dumps of several GBs. While dumping, we've
> encountered these two scenarios resulting in a unusable core:
>
> 1. Disk space is low at the time of core dump, resulting in a truncated
> core once the disk is full.
>
> 2. A daemon has a TimeoutStopSec option configured in its systemd unit
> file, where upon core dumping, could timeout (triggering a SIGKILL) if the
> core dump is too large and is taking too long to dump.
>
> In both scenarios, we see that the core file is already several GB, and
> still does not contain the information necessary to form a backtrace, thus
> creating the need for this change. In the second scenario, we are unable to
> increase the timeout option due to our recovery time objective
> requirements.
>
>> One practical concern with this approach is that I think the ELF
>> specification says that program headers should be written in memory
>> order.  So a comment on your testing to see if gdb or rr or any of
>> the other debuggers that read core dumps cares would be appreciated.
>
> I've already tested readelf and gdb on core dumps (truncated and whole)
> with this patch and it is able to read/use these core dumps in these
> scenarios with a proper backtrace.
>
>>> We implement this by sorting the VMAs by dump size and dumping in that
>>> order.
>> 
>> Since your concern is about stacks, and the kernel has information about
>> stacks it might be worth using that information explicitly when sorting
>> vmas, instead of just assuming stacks will be small.
>
> This was originally the approach that we explored, but ultimately moved
> away from. We need more than just stacks to form a proper backtrace. I
> didn't narrow down exactly what it was that we needed because the sorting
> solution seemed to be cleaner than trying to narrow down each of these
> pieces that we'd need. At the very least, we need information about shared
> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
> third piece sitting in an anonymous R/W VMA, which is the point that I
> stopped exploring this path. I was having a difficult time narrowing down
> what this last piece was.
>
>> I expect the priorities would look something like jit generated
>> executable code segments, stacks, and then heap data.
>> 
>> I don't have enough information what is causing your truncated core
>> dumps, so I can't guess what the actual problem is your are fighting,
>> so I could be wrong on priorities.
>> 
>> Though I do wonder if this might be a buggy interaction between
>> core dumps and something like signals, or io_uring.  If it is something
>> other than a shortage of storage space causing your truncated core
>> dumps I expect we should first debug why the coredumps are being
>> truncated rather than proceed directly to working around truncation.
>
> I don't really see any feasible workarounds that can be done for preventing
> truncation of these core dumps. Our truncated cores are also not the result
> of any bugs, but rather a limitation.

Thanks that clarifies things.