On Fri, Jul 5, 2024 at 7:29 AM Tao Liu <ltao@xxxxxxxxxx> wrote:
Hi Lianbo,
On Mon, Jul 1, 2024 at 9:15 PM lijiang <lijiang@xxxxxxxxxx> wrote:
>
> Hi, Tao
> On Thu, Jun 27, 2024 at 12:08 PM Tao Liu <ltao@xxxxxxxxxx> wrote:
>>
>> Hi Lianbo,
>>
>> Thanks for the patch.
>>
>> On Fri, Jun 14, 2024 at 3:00 PM Lianbo Jiang <lijiang@xxxxxxxxxx> wrote:
>> >
>> > Sometimes, in a production environment, there are still some vmcores
>> > that are incomplete, such as partial header or the data is corrupted.
>> > When crash tool attempts to parse such vmcores, it may fail as below:
>> >
>> > $ ./crash --osrelease vmcore
>> > Bus error (core dumped)
>> >
>> > or
>> >
>> > $ crash vmlinux vmcore
>> > ...
>> > Bus error (core dumped)
>> > $
>> >
>> > The gdb sees that crash tool reads out a null bitmap from the header in
>> > this vmcore, when executing memcpy(), emits a SIGBUS error as below:
>> >
>> > $ gdb /home/lijiang/src/crash/crash /tmp/core.126301
>> > Core was generated by `./crash --osrelease /home/lijiang/src/39317/vmcore'.
>> > Program terminated with signal SIGBUS, Bus error.
>> > #0 __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
>> > 831 LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
>> > (gdb) bt
>> > #0 __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
>> > #1 0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
>> > #2 0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:1042
>> > #3 0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at main.c:1938
>> > #4 0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at main.c:271
>> > (gdb) frame 1
>> > #1 0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820
>> > 820 memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
>> > (gdb) p dd->dumpable_bitmap
>> > $1 = 0x7f8e89800010 ""
>> > (gdb) p dd->bitmap
>> > $2 = 0x7f8e87e09000 ""
>> > (gdb) p dd->bitmap + bitmap_len/2
>> > $3 = 0x7f8e88a17000 ""
>> > (gdb) p *(char*)(dd->bitmap+bitmap_len/2)
>> > $4 = 0 '\000'
>> > (gdb) p bitmap_len/2
>> > $5 = 12640256
>> >
>> > Let's add a sanity check for such cases to avoid causing a SIGBUS error.
>> >
>> I didn't really understand the reason why the SIGBUS error occurs. Is
>> it because bitmap_len/2 is too large(12M), which exceeded the
>> dd->bitmap memory range?
>
>
> I did not see any limitations on the bitmap memory range.
>
> Looks like an unaligned memory address was accessed, please see here:
>
> #0 __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
> 831 LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))
>
> Maybe the vmcore is corrupted, anyway I have never seen it before this issue.
>
>>
>> Why *(char*)(dd->bitmap+bitmap_len/2) == '\0' will be considered as an
>> error, do we expect a value like "0xff" here?
>
>
> No.
> If it is a null string, no need to copy it(or it may be an exceptional string because of corrupted data).
>
I still didn't understand the check here. Assume we have a bitmap like
the following. The above code only checks if the starting byte is 0.
No, as above mentioned, the *(char*)(dd->bitmap+bitmap_len/2) == '\0' is checking if it is a null string, not a 0 value. They are different.
What about if there are 1s after the leading 0? Do these be considered
as valid bitmap and should they be copied?
... 0 1 1 0 1 1 0 1 1 1 1
^
This is not a null string. The value of checking *(char*)(dd->bitmap+bitmap_len/2) == '\0' is 'false', not 'true'.
... len/2
I understand that you want to skip the unnecessary memcpy to bypass
As I mentioned in the last email, If it is a null string('\0'), no need to copy it. Let's differentiate between 0 and '\0', they are not the same.
Thanks
Lianbo
the SIGBUS error by setting a check condition. I'm worrying if the
check may filter out those valid bitmaps.
Thanks,
Tao Liu
>>
>>
>> I guess what we want here is to handle SIGBUS errors nicely right? why
>> don't we add a SIGBUS handler and process it directly?
>
>
> This may prevent a core dump file from being generated.
> And users can easily report it with a core dump file, sometimes It's hard to reproduce for us.
>
> Thanks
> Lianbo
>
>>
>> Thanks,
>> Tao Liu
>>
>> > With the patch:
>> > $ crash -s vmlinux vmcore
>> > crash: vmcore: not a supported file format
>> > ...
>> > Enter "crash -h" for details.
>> >
>> > $ crash --osrelease vmcore
>> > unknown
>> >
>> > Reported-by: Buland Kumar Singh <bsingh@xxxxxxxxxx>
>> > Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx>
>> > ---
>> > diskdump.c | 6 ++++--
>> > 1 file changed, 4 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/diskdump.c b/diskdump.c
>> > index 1f7118cacfc6..c31eab01aa05 100644
>> > --- a/diskdump.c
>> > +++ b/diskdump.c
>> > @@ -814,10 +814,12 @@ restart:
>> > madvise(dd->bitmap, bitmap_len, MADV_WILLNEED);
>> > }
>> >
>> > - if (dump_is_partial(header))
>> > + if (dump_is_partial(header)) {
>> > + if (*(char*)(dd->bitmap + bitmap_len/2) == '\0')
>> > + goto err;
>> > memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
>> > bitmap_len/2);
>> > - else
>> > + } else
>> > memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len);
>> >
>> > dd->data_offset
>> > --
>> > 2.45.1
>> > --
>> > Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>> > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>> > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
>> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki
>>
-- Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki