[Crash-utility] Re: [PATCH] Fix a "Bus error" issue caused by 'crash --osrelease' or crash loading

Tao Liu <ltao@xxxxxxxxxx> · Fri, 5 Jul 2024 11:28:52 +1200

Hi Lianbo,

On Mon, Jul 1, 2024 at 9:15 PM lijiang <lijiang@xxxxxxxxxx> wrote:
>
> Hi, Tao
> On Thu, Jun 27, 2024 at 12:08 PM Tao Liu <ltao@xxxxxxxxxx> wrote:
>>
>> Hi Lianbo,
>>
>> Thanks for the patch.
>>
>> On Fri, Jun 14, 2024 at 3:00 PM Lianbo Jiang <lijiang@xxxxxxxxxx> wrote:
>> >
>> > Sometimes, in a production environment, there are still some vmcores
>> > that are incomplete, such as partial header or the data is corrupted.
>> > When crash tool attempts to parse such vmcores, it may fail as below:
>> >
>> >   $ ./crash --osrelease vmcore
>> >   Bus error (core dumped)
>> >
>> > or
>> >
>> >   $ crash vmlinux vmcore
>> >   ...
>> >   Bus error (core dumped)
>> >  $
>> >
>> > The gdb sees that crash tool reads out a null bitmap from the header in
>> > this vmcore, when executing memcpy(), emits a SIGBUS error as below:
>> >
>> >   $ gdb /home/lijiang/src/crash/crash /tmp/core.126301
>> >   Core was generated by `./crash --osrelease /home/lijiang/src/39317/vmcore'.
>> >   Program terminated with signal SIGBUS, Bus error.
>> >   #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
>> >   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
>> >   (gdb) bt
>> >   #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
>> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
>> >   #2  0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:1042
>> >   #3  0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at main.c:1938
>> >   #4  0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at main.c:271
>> >   (gdb) frame 1
>> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
>> >   820                   memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
>> >   (gdb) p dd->dumpable_bitmap
>> >   $1 = 0x7f8e89800010 ""
>> >   (gdb) p dd->bitmap
>> >   $2 = 0x7f8e87e09000 ""
>> >   (gdb) p dd->bitmap + bitmap_len/2
>> >   $3 = 0x7f8e88a17000 ""
>> >   (gdb) p *(char*)(dd->bitmap+bitmap_len/2)
>> >   $4 = 0 '\000'
>> >   (gdb) p bitmap_len/2
>> >   $5 = 12640256
>> >
>> > Let's add a sanity check for such cases to avoid causing a SIGBUS error.
>> >
>> I didn't really understand the reason why the SIGBUS error occurs. Is
>> it because bitmap_len/2 is too large(12M), which exceeded the
>> dd->bitmap memory range?
>
>
> I did not see any limitations on the bitmap memory range.
>
> Looks like an unaligned memory address was accessed, please see here:
>
> #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
>   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
>
> Maybe the vmcore is corrupted, anyway I have never seen it before this issue.
>
>>
>> Why *(char*)(dd->bitmap+bitmap_len/2) == '\0' will be considered as an
>> error, do we expect a value like "0xff" here?
>
>
> No.
> If it is a null string, no need to copy it(or it may be an exceptional string because of corrupted data).
>
I still didn't understand the check here. Assume we have a bitmap like
the following. The above code only checks if the starting byte is 0.
What about if there are 1s after the leading 0? Do these be considered
as valid bitmap and should they be copied?
... 0 1 1 0 1 1 0 1 1 1 1
   ^
... len/2

I understand that you want to skip the unnecessary memcpy to bypass
the SIGBUS error by setting a check condition. I'm worrying if the
check may filter out those valid bitmaps.

Thanks,
Tao Liu

>>
>>
>> I guess what we want here is to handle SIGBUS errors nicely right? why
>> don't we add a SIGBUS handler and process it directly?
>
>
> This may prevent a core dump file from being generated.
> And users can easily report it with a core dump file, sometimes It's hard to reproduce for us.
>
> Thanks
> Lianbo
>
>>
>> Thanks,
>> Tao Liu
>>
>> > With the patch:
>> >   $ crash -s vmlinux vmcore
>> >   crash: vmcore: not a supported file format
>> >   ...
>> >   Enter "crash -h" for details.
>> >
>> >   $ crash --osrelease vmcore
>> >   unknown
>> >
>> > Reported-by: Buland Kumar Singh <bsingh@xxxxxxxxxx>
>> > Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx>
>> > ---
>> >  diskdump.c | 6 ++++--
>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/diskdump.c b/diskdump.c
>> > index 1f7118cacfc6..c31eab01aa05 100644
>> > --- a/diskdump.c
>> > +++ b/diskdump.c
>> > @@ -814,10 +814,12 @@ restart:
>> >                 madvise(dd->bitmap, bitmap_len, MADV_WILLNEED);
>> >         }
>> >
>> > -       if (dump_is_partial(header))
>> > +       if (dump_is_partial(header)) {
>> > +               if (*(char*)(dd->bitmap + bitmap_len/2) == '\0')
>> > +                       goto err;
>> >                 memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
>> >                        bitmap_len/2);
>> > -       else
>> > +       } else
>> >                 memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len);
>> >
>> >         dd->data_offset
>> > --
>> > 2.45.1
>> > --
>> > Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>> > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>> > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
>> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki
>>
--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki