[Crash-utility] Re: [PATCH] Fix a "Bus error" issue caused by 'crash --osrelease' or crash loading

lijiang <lijiang@xxxxxxxxxx> · Fri, 5 Jul 2024 16:26:26 +0800

On Fri, Jul 5, 2024 at 11:41 AM lijiang <lijiang@xxxxxxxxxx> wrote:
On Fri, Jul 5, 2024 at 9:25 AM lijiang <lijiang@xxxxxxxxxx> wrote:
On Fri, Jul 5, 2024 at 7:29 AM Tao Liu <ltao@xxxxxxxxxx> wrote:
Hi Lianbo,



On Mon, Jul 1, 2024 at 9:15 PM lijiang <lijiang@xxxxxxxxxx> wrote:

>

> Hi, Tao

> On Thu, Jun 27, 2024 at 12:08 PM Tao Liu <ltao@xxxxxxxxxx> wrote:

>>

>> Hi Lianbo,

>>

>> Thanks for the patch.

>>

>> On Fri, Jun 14, 2024 at 3:00 PM Lianbo Jiang <lijiang@xxxxxxxxxx> wrote:

>> >

>> > Sometimes, in a production environment, there are still some vmcores

>> > that are incomplete, such as partial header or the data is corrupted.

>> > When crash tool attempts to parse such vmcores, it may fail as below:

>> >

>> >   $ ./crash --osrelease vmcore

>> >   Bus error (core dumped)

>> >

>> > or

>> >

>> >   $ crash vmlinux vmcore

>> >   ...

>> >   Bus error (core dumped)

>> >  $

>> >

>> > The gdb sees that crash tool reads out a null bitmap from the header in

>> > this vmcore, when executing memcpy(), emits a SIGBUS error as below:

>> >

>> >   $ gdb /home/lijiang/src/crash/crash /tmp/core.126301

>> >   Core was generated by `./crash --osrelease /home/lijiang/src/39317/vmcore'.

>> >   Program terminated with signal SIGBUS, Bus error.

>> >   #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831

>> >   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))

>> >   (gdb) bt

>> >   #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831

>> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820

>> >   #2  0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:1042

>> >   #3  0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at main.c:1938

>> >   #4  0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at main.c:271

>> >   (gdb) frame 1

>> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f "/home/lijiang/src/39317/vmcore") at diskdump.c:820

>> >   820                   memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,

>> >   (gdb) p dd->dumpable_bitmap

>> >   $1 = 0x7f8e89800010 ""

>> >   (gdb) p dd->bitmap

>> >   $2 = 0x7f8e87e09000 ""

>> >   (gdb) p dd->bitmap + bitmap_len/2

>> >   $3 = 0x7f8e88a17000 ""

>> >   (gdb) p *(char*)(dd->bitmap+bitmap_len/2)

>> >   $4 = 0 '\000'

>> >   (gdb) p bitmap_len/2

>> >   $5 = 12640256

>> >

>> > Let's add a sanity check for such cases to avoid causing a SIGBUS error.

>> >

>> I didn't really understand the reason why the SIGBUS error occurs. Is

>> it because bitmap_len/2 is too large(12M), which exceeded the

>> dd->bitmap memory range?

>

>

> I did not see any limitations on the bitmap memory range.

>

> Looks like an unaligned memory address was accessed, please see here:

>

> #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831

>   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), %VMM(6), %VMM(7))

>

> Maybe the vmcore is corrupted, anyway I have never seen it before this issue.

>

>>

>> Why *(char*)(dd->bitmap+bitmap_len/2) == '\0' will be considered as an

>> error, do we expect a value like "0xff" here?

>

>

> No.

> If it is a null string, no need to copy it(or it may be an exceptional string because of corrupted data).

>

I still didn't understand the check here. Assume we have a bitmap like

the following. The above code only checks if the starting byte is 0.

No, as above mentioned,  the *(char*)(dd->bitmap+bitmap_len/2) == '\0' is checking if it is a null string, not a 0 value. They are different.


What about if there are 1s after the leading 0? Do these be considered

as valid bitmap and should they be copied?

... 0 1 1 0 1 1 0 1 1 1 1

   ^

This is not a null string.  The value of checking *(char*)(dd->bitmap+bitmap_len/2) == '\0' is 'false', not 'true'.
 

... len/2



I understand that you want to skip the unnecessary memcpy to bypass
 
As I mentioned in the last email, If it is a null string('\0'), no need to copy it. Let's differentiate between 0 and '\0', they are not the same.

Thanks
Lianbo


the SIGBUS error by setting a check condition. I'm worrying if the

check may filter out those valid bitmaps.

Think it over again, looks like your concern is correct, Tao.

Let's see if there is a better way to fix the current issue.

If the header is partial, the original intention might be to copy the bitmap starting from dd->bitmap(instead of dd->bitmap + bitmap_len/2), and the copy length is bitmap_len/2, just like this:

diff --git a/diskdump.c b/diskdump.c
index 1f7118cacfc6..8b7ec6e19c96 100644
--- a/diskdump.c
+++ b/diskdump.c
@@ -815,8 +815,7 @@ restart:
        }
 
        if (dump_is_partial(header))
-               memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,
-                      bitmap_len/2);
+               memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len/2);
        else
                memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len);
 
But, I can not track the history of these changes according to the github log. Just a guess.

With the above patch, it can print the result as below:

$ ./crash --osrelease ../vmcore
4.18.0-513.24.1.el8_9.x86_64


Thanks
Lianbo


Thanks
Lianbo





Thanks,

Tao Liu



>>

>>

>> I guess what we want here is to handle SIGBUS errors nicely right? why

>> don't we add a SIGBUS handler and process it directly?

>

>

> This may prevent a core dump file from being generated.

> And users can easily report it with a core dump file, sometimes It's hard to reproduce for us.

>

> Thanks

> Lianbo

>

>>

>> Thanks,

>> Tao Liu

>>

>> > With the patch:

>> >   $ crash -s vmlinux vmcore

>> >   crash: vmcore: not a supported file format

>> >   ...

>> >   Enter "crash -h" for details.

>> >

>> >   $ crash --osrelease vmcore

>> >   unknown

>> >

>> > Reported-by: Buland Kumar Singh <bsingh@xxxxxxxxxx>

>> > Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx>

>> > ---

>> >  diskdump.c | 6 ++++--

>> >  1 file changed, 4 insertions(+), 2 deletions(-)

>> >

>> > diff --git a/diskdump.c b/diskdump.c

>> > index 1f7118cacfc6..c31eab01aa05 100644

>> > --- a/diskdump.c

>> > +++ b/diskdump.c

>> > @@ -814,10 +814,12 @@ restart:

>> >                 madvise(dd->bitmap, bitmap_len, MADV_WILLNEED);

>> >         }

>> >

>> > -       if (dump_is_partial(header))

>> > +       if (dump_is_partial(header)) {

>> > +               if (*(char*)(dd->bitmap + bitmap_len/2) == '\0')

>> > +                       goto err;

>> >                 memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2,

>> >                        bitmap_len/2);

>> > -       else

>> > +       } else

>> >                 memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len);

>> >

>> >         dd->data_offset

>> > --

>> > 2.45.1

>> > --

>> > Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx

>> > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx

>> > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/

>> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki

>>






--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki