Re: [RFC PATCH] binfmt_elf: Dump smaller VMAs first in ELF cores

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Fri, 02 Aug 2024 22:08:55 -0500

Brian Mak <makb@xxxxxxxxxxx> writes:

> On Aug 2, 2024, at 9:16 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
>> Brian Mak <makb@xxxxxxxxxxx> writes:
>> 
>>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>> 
>>>> Brian Mak <makb@xxxxxxxxxxx> writes:
>>>> 
>>>>> Large cores may be truncated in some scenarios, such as daemons with stop
>>>>> timeouts that are not large enough or lack of disk space. This impacts
>>>>> debuggability with large core dumps since critical information necessary to
>>>>> form a usable backtrace, such as stacks and shared library information, are
>>>>> omitted. We can mitigate the impact of core dump truncation by dumping
>>>>> smaller VMAs first, which may be more likely to contain memory for stacks
>>>>> and shared library information, thus allowing a usable backtrace to be
>>>>> formed.
>>>> 
>>>> This sounds theoretical.  Do you happen to have a description of a
>>>> motivating case?  A situtation that bit someone and resulted in a core
>>>> file that wasn't usable?
>>>> 
>>>> A concrete situation would help us imagine what possible caveats there
>>>> are with sorting vmas this way.
>>>> 
>>>> The most common case I am aware of is distributions setting the core
>>>> file size to 0 (ulimit -c 0).
>>> 
>>> Hi Eric,
>>> 
>>> Thanks for taking the time to reply. We have hit these scenarios before in
>>> practice where large cores are truncated, resulting in an unusable core.
>>> 
>>> At Juniper, we have some daemons that can consume a lot of memory, where
>>> upon crash, can result in core dumps of several GBs. While dumping, we've
>>> encountered these two scenarios resulting in a unusable core:
>>> 
>>> 1. Disk space is low at the time of core dump, resulting in a truncated
>>> core once the disk is full.
>>> 
>>> 2. A daemon has a TimeoutStopSec option configured in its systemd unit
>>> file, where upon core dumping, could timeout (triggering a SIGKILL) if the
>>> core dump is too large and is taking too long to dump.
>>> 
>>> In both scenarios, we see that the core file is already several GB, and
>>> still does not contain the information necessary to form a backtrace, thus
>>> creating the need for this change. In the second scenario, we are unable to
>>> increase the timeout option due to our recovery time objective
>>> requirements.
>>> 
>>>> One practical concern with this approach is that I think the ELF
>>>> specification says that program headers should be written in memory
>>>> order.  So a comment on your testing to see if gdb or rr or any of
>>>> the other debuggers that read core dumps cares would be appreciated.
>>> 
>>> I've already tested readelf and gdb on core dumps (truncated and whole)
>>> with this patch and it is able to read/use these core dumps in these
>>> scenarios with a proper backtrace.
>>> 
>>>>> We implement this by sorting the VMAs by dump size and dumping in that
>>>>> order.
>>>> 
>>>> Since your concern is about stacks, and the kernel has information about
>>>> stacks it might be worth using that information explicitly when sorting
>>>> vmas, instead of just assuming stacks will be small.
>>> 
>>> This was originally the approach that we explored, but ultimately moved
>>> away from. We need more than just stacks to form a proper backtrace. I
>>> didn't narrow down exactly what it was that we needed because the sorting
>>> solution seemed to be cleaner than trying to narrow down each of these
>>> pieces that we'd need. At the very least, we need information about shared
>>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a
>>> third piece sitting in an anonymous R/W VMA, which is the point that I
>>> stopped exploring this path. I was having a difficult time narrowing down
>>> what this last piece was.
>>> 
>>>> I expect the priorities would look something like jit generated
>>>> executable code segments, stacks, and then heap data.
>>>> 
>>>> I don't have enough information what is causing your truncated core
>>>> dumps, so I can't guess what the actual problem is your are fighting,
>>>> so I could be wrong on priorities.
>>>> 
>>>> Though I do wonder if this might be a buggy interaction between
>>>> core dumps and something like signals, or io_uring.  If it is something
>>>> other than a shortage of storage space causing your truncated core
>>>> dumps I expect we should first debug why the coredumps are being
>>>> truncated rather than proceed directly to working around truncation.
>>> 
>>> I don't really see any feasible workarounds that can be done for preventing
>>> truncation of these core dumps. Our truncated cores are also not the result
>>> of any bugs, but rather a limitation.
>> 
>> Thanks that clarifies things.
>> 
>> From a quality of implementation standpoint I regret that at least some
>> pause during coredumping is inevitable.  Ideally I would like to
>> minimize that pause, preserve the memory and have a separate kernel
>> thread perform the coredumping work.  That in principle would remove the
>> need for coredumps to be stop when a SIGKILL is delievered and avoid the
>> issue with the systemd timeout.  Plus it would allow systemd to respawn
>> the process before the coredump was complete.  Getting there is in no
>> sense easy, and it would still leave the problem of not getting
>> the whole coredump when you are running out of disk space.
>
> This is actually another approach that we thought about, but decided
> against. The fact that it doesn't address running out of disk space is one
> thing, but also, if systemd were to respawn the process before the coredump
> completes, there would effectively be a doubling of that process' memory
> usage. For applications that take up a significant chunk of available
> memory on a system, effectively "duplicating" that memory consumption could
> result in a system running out of memory.
>
> With this approach, there's two options: to have systemd restart the
> process or wait until core dumping is complete to restart. In the first
> option, we would be risking system stability for debuggability in
> applications with a large memory footprint. In the second option, we would
> be back to where we started (either terminate the core dumping or miss
> recovery time objectives).
>
>> The explanation of the vma sort is good.  Sorting by size seems to make
>> a simple and very effective heuristic.  It would nice if that
>> explanation appeared in the change description.
>> 
>> From a maintenance perspective it would be very nice to perform the vma
>> size sort unconditionally.   Can you remove the knob making the size
>> sort conditional?  If someone reports a regression we can add a knob
>> making the sort conditional.
>> 
>> We are in the middle of the merge window right now but I expect Kees
>> could take a simplified patch (aka no knob) after the merge window
>> closes and get it into linux-next.  Which should give plenty of time
>> to spot any regressions caused by sorting the vmas.
>
> Understood, will get a PATCH v2 sent out with the removal of the knob. I
> should note though that the conditional sorting actually gives a second
> benefit, which is that in the event there is low available memory,
> allocating the sorted_vmas array may fail, allowing for a fallback to
> original functionality without any additional memory allocations. However,
> in the interest of maintainability, it may be better to just forgo that
> benefit.

The entire array of vmas is initially allocated in
fs/coredump.c:dump_vma_snapshot(), to avoid the need to hold the
mmap_lock while writing the coredump.

Unless there is some technical reason I haven't thought of you might as
well sort the vma array in place and avoid the copy to a new array.

The logic seems to apply all coredumps so I would recommend placing the
sorting in dump_vma_snapshot.  Then we don't have to worry about the
other implementations of elf coredumps (ELF FDPIC) getting out of sync
in with the primary coredump implementation.

Eric

> Thanks for looking this over!
>
> Best,
> Brian Mak
>
>> Eric
>> 
>> 
>>> 
>>> Please let me know your thoughts!
>>> 
>>> Best,
>>> Brian Mak
>>> 
>>>> Eric
>>>> 
>>>>> Signed-off-by: Brian Mak <makb@xxxxxxxxxxx>
>>>>> ---
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My initial testing with a program that spawns several threads and allocates heap
>>>>> memory shows that this patch does indeed prioritize information such as stacks,
>>>>> which is crucial to forming a backtrace and debugging core dumps.
>>>>> 
>>>>> Requesting for comments on the following:
>>>>> 
>>>>> Are there cases where this might not necessarily prioritize dumping VMAs
>>>>> needed to obtain a usable backtrace?
>>>>> 
>>>>> Thanks,
>>>>> Brian Mak
>>>>> 
>>>>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>>> 1 file changed, 62 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>>>>> index 19fa49cd9907..d45240b0748d 100644
>>>>> --- a/fs/binfmt_elf.c
>>>>> +++ b/fs/binfmt_elf.c
>>>>> @@ -13,6 +13,7 @@
>>>>> #include <linux/module.h>
>>>>> #include <linux/kernel.h>
>>>>> #include <linux/fs.h>
>>>>> +#include <linux/debugfs.h>
>>>>> #include <linux/log2.h>
>>>>> #include <linux/mm.h>
>>>>> #include <linux/mman.h>
>>>>> @@ -37,6 +38,7 @@
>>>>> #include <linux/elf-randomize.h>
>>>>> #include <linux/utsname.h>
>>>>> #include <linux/coredump.h>
>>>>> +#include <linux/sort.h>
>>>>> #include <linux/sched.h>
>>>>> #include <linux/sched/coredump.h>
>>>>> #include <linux/sched/task_stack.h>
>>>>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
>>>>>     shdr4extnum->sh_info = segs;
>>>>> }
>>>>> 
>>>>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr)
>>>>> +{
>>>>> +     const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **)
>>>>> +             vma_meta_lhs_ptr;
>>>>> +     const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **)
>>>>> +             vma_meta_rhs_ptr;
>>>>> +
>>>>> +     if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size)
>>>>> +             return -1;
>>>>> +     if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size)
>>>>> +             return 1;
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static bool sort_elf_core_vmas = true;
>>>>> +
>>>>> /*
>>>>> * Actual dumper
>>>>> *
>>>>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     struct elf_shdr *shdr4extnum = NULL;
>>>>>     Elf_Half e_phnum;
>>>>>     elf_addr_t e_shoff;
>>>>> +     struct core_vma_metadata **sorted_vmas = NULL;
>>>>> 
>>>>>     /*
>>>>>      * The number of segs are recored into ELF header as 16bit value.
>>>>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
>>>>>             goto end_coredump;
>>>>> 
>>>>> +     /* Allocate memory to sort VMAs and sort if needed. */
>>>>> +     if (sort_elf_core_vmas)
>>>>> +             sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL);
>>>>> +
>>>>> +     if (!ZERO_OR_NULL_PTR(sorted_vmas)) {
>>>>> +             for (i = 0; i < cprm->vma_count; i++)
>>>>> +                     sorted_vmas[i] = cprm->vma_meta + i;
>>>>> +
>>>>> +             sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL);
>>>>> +     }
>>>>> +
>>>>>     /* Write program headers for segments dump */
>>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>>> +             struct core_vma_metadata *meta;
>>>>>             struct elf_phdr phdr;
>>>>> 
>>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>>> +                     meta = cprm->vma_meta + i;
>>>>> +             else
>>>>> +                     meta = sorted_vmas[i];
>>>>> +
>>>>>             phdr.p_type = PT_LOAD;
>>>>>             phdr.p_offset = offset;
>>>>>             phdr.p_vaddr = meta->start;
>>>>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>>     dump_skip_to(cprm, dataoff);
>>>>> 
>>>>>     for (i = 0; i < cprm->vma_count; i++) {
>>>>> -             struct core_vma_metadata *meta = cprm->vma_meta + i;
>>>>> +             struct core_vma_metadata *meta;
>>>>> +
>>>>> +             if (ZERO_OR_NULL_PTR(sorted_vmas))
>>>>> +                     meta = cprm->vma_meta + i;
>>>>> +             else
>>>>> +                     meta = sorted_vmas[i];
>>>>> 
>>>>>             if (!dump_user_range(cprm, meta->start, meta->dump_size))
>>>>>                     goto end_coredump;
>>>>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm)
>>>>> end_coredump:
>>>>>     free_note_info(&info);
>>>>>     kfree(shdr4extnum);
>>>>> +     kvfree(sorted_vmas);
>>>>>     kfree(phdr4note);
>>>>>     return has_dumped;
>>>>> }
>>>>> 
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +
>>>>> +static struct dentry *elf_core_debugfs;
>>>>> +
>>>>> +static int __init init_elf_core_debugfs(void)
>>>>> +{
>>>>> +     elf_core_debugfs = debugfs_create_dir("elf_core", NULL);
>>>>> +     debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas);
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +fs_initcall(init_elf_core_debugfs);
>>>>> +
>>>>> +#endif               /* CONFIG_DEBUG_FS */
>>>>> +
>>>>> #endif               /* CONFIG_ELF_CORE */
>>>>> 
>>>>> static int __init init_elf_binfmt(void)
>>>>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void)
>>>>> {
>>>>>     /* Remove the COFF and ELF loaders. */
>>>>>     unregister_binfmt(&elf_format);
>>>>> +
>>>>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS)
>>>>> +     debugfs_remove(elf_core_debugfs);
>>>>> +#endif
>>>>> }
>>>>> 
>>>>> core_initcall(init_elf_binfmt);
>>>>> 
>>>>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d