Brian Mak <makb@xxxxxxxxxxx> writes: > On Aug 2, 2024, at 9:16 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> Brian Mak <makb@xxxxxxxxxxx> writes: >> >>> On Jul 31, 2024, at 7:52 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >>> >>>> Brian Mak <makb@xxxxxxxxxxx> writes: >>>> >>>>> Large cores may be truncated in some scenarios, such as daemons with stop >>>>> timeouts that are not large enough or lack of disk space. This impacts >>>>> debuggability with large core dumps since critical information necessary to >>>>> form a usable backtrace, such as stacks and shared library information, are >>>>> omitted. We can mitigate the impact of core dump truncation by dumping >>>>> smaller VMAs first, which may be more likely to contain memory for stacks >>>>> and shared library information, thus allowing a usable backtrace to be >>>>> formed. >>>> >>>> This sounds theoretical. Do you happen to have a description of a >>>> motivating case? A situtation that bit someone and resulted in a core >>>> file that wasn't usable? >>>> >>>> A concrete situation would help us imagine what possible caveats there >>>> are with sorting vmas this way. >>>> >>>> The most common case I am aware of is distributions setting the core >>>> file size to 0 (ulimit -c 0). >>> >>> Hi Eric, >>> >>> Thanks for taking the time to reply. We have hit these scenarios before in >>> practice where large cores are truncated, resulting in an unusable core. >>> >>> At Juniper, we have some daemons that can consume a lot of memory, where >>> upon crash, can result in core dumps of several GBs. While dumping, we've >>> encountered these two scenarios resulting in a unusable core: >>> >>> 1. Disk space is low at the time of core dump, resulting in a truncated >>> core once the disk is full. >>> >>> 2. A daemon has a TimeoutStopSec option configured in its systemd unit >>> file, where upon core dumping, could timeout (triggering a SIGKILL) if the >>> core dump is too large and is taking too long to dump. >>> >>> In both scenarios, we see that the core file is already several GB, and >>> still does not contain the information necessary to form a backtrace, thus >>> creating the need for this change. In the second scenario, we are unable to >>> increase the timeout option due to our recovery time objective >>> requirements. >>> >>>> One practical concern with this approach is that I think the ELF >>>> specification says that program headers should be written in memory >>>> order. So a comment on your testing to see if gdb or rr or any of >>>> the other debuggers that read core dumps cares would be appreciated. >>> >>> I've already tested readelf and gdb on core dumps (truncated and whole) >>> with this patch and it is able to read/use these core dumps in these >>> scenarios with a proper backtrace. >>> >>>>> We implement this by sorting the VMAs by dump size and dumping in that >>>>> order. >>>> >>>> Since your concern is about stacks, and the kernel has information about >>>> stacks it might be worth using that information explicitly when sorting >>>> vmas, instead of just assuming stacks will be small. >>> >>> This was originally the approach that we explored, but ultimately moved >>> away from. We need more than just stacks to form a proper backtrace. I >>> didn't narrow down exactly what it was that we needed because the sorting >>> solution seemed to be cleaner than trying to narrow down each of these >>> pieces that we'd need. At the very least, we need information about shared >>> libraries (.dynamic, etc.) and stacks, but my testing showed that we need a >>> third piece sitting in an anonymous R/W VMA, which is the point that I >>> stopped exploring this path. I was having a difficult time narrowing down >>> what this last piece was. >>> >>>> I expect the priorities would look something like jit generated >>>> executable code segments, stacks, and then heap data. >>>> >>>> I don't have enough information what is causing your truncated core >>>> dumps, so I can't guess what the actual problem is your are fighting, >>>> so I could be wrong on priorities. >>>> >>>> Though I do wonder if this might be a buggy interaction between >>>> core dumps and something like signals, or io_uring. If it is something >>>> other than a shortage of storage space causing your truncated core >>>> dumps I expect we should first debug why the coredumps are being >>>> truncated rather than proceed directly to working around truncation. >>> >>> I don't really see any feasible workarounds that can be done for preventing >>> truncation of these core dumps. Our truncated cores are also not the result >>> of any bugs, but rather a limitation. >> >> Thanks that clarifies things. >> >> From a quality of implementation standpoint I regret that at least some >> pause during coredumping is inevitable. Ideally I would like to >> minimize that pause, preserve the memory and have a separate kernel >> thread perform the coredumping work. That in principle would remove the >> need for coredumps to be stop when a SIGKILL is delievered and avoid the >> issue with the systemd timeout. Plus it would allow systemd to respawn >> the process before the coredump was complete. Getting there is in no >> sense easy, and it would still leave the problem of not getting >> the whole coredump when you are running out of disk space. > > This is actually another approach that we thought about, but decided > against. The fact that it doesn't address running out of disk space is one > thing, but also, if systemd were to respawn the process before the coredump > completes, there would effectively be a doubling of that process' memory > usage. For applications that take up a significant chunk of available > memory on a system, effectively "duplicating" that memory consumption could > result in a system running out of memory. > > With this approach, there's two options: to have systemd restart the > process or wait until core dumping is complete to restart. In the first > option, we would be risking system stability for debuggability in > applications with a large memory footprint. In the second option, we would > be back to where we started (either terminate the core dumping or miss > recovery time objectives). > >> The explanation of the vma sort is good. Sorting by size seems to make >> a simple and very effective heuristic. It would nice if that >> explanation appeared in the change description. >> >> From a maintenance perspective it would be very nice to perform the vma >> size sort unconditionally. Can you remove the knob making the size >> sort conditional? If someone reports a regression we can add a knob >> making the sort conditional. >> >> We are in the middle of the merge window right now but I expect Kees >> could take a simplified patch (aka no knob) after the merge window >> closes and get it into linux-next. Which should give plenty of time >> to spot any regressions caused by sorting the vmas. > > Understood, will get a PATCH v2 sent out with the removal of the knob. I > should note though that the conditional sorting actually gives a second > benefit, which is that in the event there is low available memory, > allocating the sorted_vmas array may fail, allowing for a fallback to > original functionality without any additional memory allocations. However, > in the interest of maintainability, it may be better to just forgo that > benefit. The entire array of vmas is initially allocated in fs/coredump.c:dump_vma_snapshot(), to avoid the need to hold the mmap_lock while writing the coredump. Unless there is some technical reason I haven't thought of you might as well sort the vma array in place and avoid the copy to a new array. The logic seems to apply all coredumps so I would recommend placing the sorting in dump_vma_snapshot. Then we don't have to worry about the other implementations of elf coredumps (ELF FDPIC) getting out of sync in with the primary coredump implementation. Eric > Thanks for looking this over! > > Best, > Brian Mak > >> Eric >> >> >>> >>> Please let me know your thoughts! >>> >>> Best, >>> Brian Mak >>> >>>> Eric >>>> >>>>> Signed-off-by: Brian Mak <makb@xxxxxxxxxxx> >>>>> --- >>>>> >>>>> Hi all, >>>>> >>>>> My initial testing with a program that spawns several threads and allocates heap >>>>> memory shows that this patch does indeed prioritize information such as stacks, >>>>> which is crucial to forming a backtrace and debugging core dumps. >>>>> >>>>> Requesting for comments on the following: >>>>> >>>>> Are there cases where this might not necessarily prioritize dumping VMAs >>>>> needed to obtain a usable backtrace? >>>>> >>>>> Thanks, >>>>> Brian Mak >>>>> >>>>> fs/binfmt_elf.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++-- >>>>> 1 file changed, 62 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c >>>>> index 19fa49cd9907..d45240b0748d 100644 >>>>> --- a/fs/binfmt_elf.c >>>>> +++ b/fs/binfmt_elf.c >>>>> @@ -13,6 +13,7 @@ >>>>> #include <linux/module.h> >>>>> #include <linux/kernel.h> >>>>> #include <linux/fs.h> >>>>> +#include <linux/debugfs.h> >>>>> #include <linux/log2.h> >>>>> #include <linux/mm.h> >>>>> #include <linux/mman.h> >>>>> @@ -37,6 +38,7 @@ >>>>> #include <linux/elf-randomize.h> >>>>> #include <linux/utsname.h> >>>>> #include <linux/coredump.h> >>>>> +#include <linux/sort.h> >>>>> #include <linux/sched.h> >>>>> #include <linux/sched/coredump.h> >>>>> #include <linux/sched/task_stack.h> >>>>> @@ -1990,6 +1992,22 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum, >>>>> shdr4extnum->sh_info = segs; >>>>> } >>>>> >>>>> +static int cmp_vma_size(const void *vma_meta_lhs_ptr, const void *vma_meta_rhs_ptr) >>>>> +{ >>>>> + const struct core_vma_metadata *vma_meta_lhs = *(const struct core_vma_metadata **) >>>>> + vma_meta_lhs_ptr; >>>>> + const struct core_vma_metadata *vma_meta_rhs = *(const struct core_vma_metadata **) >>>>> + vma_meta_rhs_ptr; >>>>> + >>>>> + if (vma_meta_lhs->dump_size < vma_meta_rhs->dump_size) >>>>> + return -1; >>>>> + if (vma_meta_lhs->dump_size > vma_meta_rhs->dump_size) >>>>> + return 1; >>>>> + return 0; >>>>> +} >>>>> + >>>>> +static bool sort_elf_core_vmas = true; >>>>> + >>>>> /* >>>>> * Actual dumper >>>>> * >>>>> @@ -2008,6 +2026,7 @@ static int elf_core_dump(struct coredump_params *cprm) >>>>> struct elf_shdr *shdr4extnum = NULL; >>>>> Elf_Half e_phnum; >>>>> elf_addr_t e_shoff; >>>>> + struct core_vma_metadata **sorted_vmas = NULL; >>>>> >>>>> /* >>>>> * The number of segs are recored into ELF header as 16bit value. >>>>> @@ -2071,11 +2090,27 @@ static int elf_core_dump(struct coredump_params *cprm) >>>>> if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note))) >>>>> goto end_coredump; >>>>> >>>>> + /* Allocate memory to sort VMAs and sort if needed. */ >>>>> + if (sort_elf_core_vmas) >>>>> + sorted_vmas = kvmalloc_array(cprm->vma_count, sizeof(*sorted_vmas), GFP_KERNEL); >>>>> + >>>>> + if (!ZERO_OR_NULL_PTR(sorted_vmas)) { >>>>> + for (i = 0; i < cprm->vma_count; i++) >>>>> + sorted_vmas[i] = cprm->vma_meta + i; >>>>> + >>>>> + sort(sorted_vmas, cprm->vma_count, sizeof(*sorted_vmas), cmp_vma_size, NULL); >>>>> + } >>>>> + >>>>> /* Write program headers for segments dump */ >>>>> for (i = 0; i < cprm->vma_count; i++) { >>>>> - struct core_vma_metadata *meta = cprm->vma_meta + i; >>>>> + struct core_vma_metadata *meta; >>>>> struct elf_phdr phdr; >>>>> >>>>> + if (ZERO_OR_NULL_PTR(sorted_vmas)) >>>>> + meta = cprm->vma_meta + i; >>>>> + else >>>>> + meta = sorted_vmas[i]; >>>>> + >>>>> phdr.p_type = PT_LOAD; >>>>> phdr.p_offset = offset; >>>>> phdr.p_vaddr = meta->start; >>>>> @@ -2111,7 +2146,12 @@ static int elf_core_dump(struct coredump_params *cprm) >>>>> dump_skip_to(cprm, dataoff); >>>>> >>>>> for (i = 0; i < cprm->vma_count; i++) { >>>>> - struct core_vma_metadata *meta = cprm->vma_meta + i; >>>>> + struct core_vma_metadata *meta; >>>>> + >>>>> + if (ZERO_OR_NULL_PTR(sorted_vmas)) >>>>> + meta = cprm->vma_meta + i; >>>>> + else >>>>> + meta = sorted_vmas[i]; >>>>> >>>>> if (!dump_user_range(cprm, meta->start, meta->dump_size)) >>>>> goto end_coredump; >>>>> @@ -2128,10 +2168,26 @@ static int elf_core_dump(struct coredump_params *cprm) >>>>> end_coredump: >>>>> free_note_info(&info); >>>>> kfree(shdr4extnum); >>>>> + kvfree(sorted_vmas); >>>>> kfree(phdr4note); >>>>> return has_dumped; >>>>> } >>>>> >>>>> +#ifdef CONFIG_DEBUG_FS >>>>> + >>>>> +static struct dentry *elf_core_debugfs; >>>>> + >>>>> +static int __init init_elf_core_debugfs(void) >>>>> +{ >>>>> + elf_core_debugfs = debugfs_create_dir("elf_core", NULL); >>>>> + debugfs_create_bool("sort_elf_core_vmas", 0644, elf_core_debugfs, &sort_elf_core_vmas); >>>>> + return 0; >>>>> +} >>>>> + >>>>> +fs_initcall(init_elf_core_debugfs); >>>>> + >>>>> +#endif /* CONFIG_DEBUG_FS */ >>>>> + >>>>> #endif /* CONFIG_ELF_CORE */ >>>>> >>>>> static int __init init_elf_binfmt(void) >>>>> @@ -2144,6 +2200,10 @@ static void __exit exit_elf_binfmt(void) >>>>> { >>>>> /* Remove the COFF and ELF loaders. */ >>>>> unregister_binfmt(&elf_format); >>>>> + >>>>> +#if defined(CONFIG_ELF_CORE) && defined(CONFIG_DEBUG_FS) >>>>> + debugfs_remove(elf_core_debugfs); >>>>> +#endif >>>>> } >>>>> >>>>> core_initcall(init_elf_binfmt); >>>>> >>>>> base-commit: 94ede2a3e9135764736221c080ac7c0ad993dc2d