On Thu, Dec 7, 2023 at 7:58 AM Genes Lists <lists@xxxxxxxxxxxx> wrote: > > On 12/7/23 09:42, Guoqing Jiang wrote: > > Hi, > > > > On 12/7/23 21:55, Genes Lists wrote: > >> On 12/7/23 08:30, Bagas Sanjaya wrote: > >>> On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote: > >>>> I have not had chance to git bisect this but since it happened in > >>>> stable I > >>>> thought it was important to share sooner than later. > >>>> > >>>> One possibly relevant commit between 6.6.3 and 6.6.4 could be: > >>>> > >>>> commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e > >>>> Author: Song Liu <song@xxxxxxxxxx> > >>>> Date: Fri Nov 17 15:56:30 2023 -0800 > >>>> > >>>> md: fix bi_status reporting in md_end_clone_io > >>>> > >>>> log attached shows page_fault_oops. > >>>> Machine was up for 3 days before crash happened. > > > > Could you decode the oops (I can't find it in lore for some reason) > > ([1])? And > > can it be reproduced reliably? If so, pls share the reproduce step. > > > > [1]. https://lwn.net/Articles/592724/ > > > > Thanks, > > Guoqing > > - reproducing > An rsync runs 2 x / day. It copies to this server from another. The > copy is from a (large) top level directory. On the 3rd day after booting > 6.6.4, the second of these rysnc's triggered the oops. I need to do > more testing to see if I can reliably reproduce. I have not seen this > oops on earlier stable kernels. > > - decoding oops with scripts/decode_stacktrace.sh had errors : > readelf: Error: Not an ELF file - it has the wrong magic bytes at > the start > > It appears that the decode script doesn't handle compressed modules. > I changed the readelf line to decompress first. This fixes the above > script complaint and the result is attached. I probably missed something, but I really don't think the commit (2c975b0b8b11f1ffb1ed538609e2c89d8abf800e) could trigger this issue. >From the trace: kernel: RIP: 0010:update_io_ticks+0x2c/0x60 => 2a:* f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) << trapped here. [...] kernel: Call Trace: kernel: <TASK> kernel: ? __die+0x23/0x70 kernel: ? page_fault_oops+0x171/0x4e0 kernel: ? exc_page_fault+0x175/0x180 kernel: ? asm_exc_page_fault+0x26/0x30 kernel: ? update_io_ticks+0x2c/0x60 kernel: bdev_end_io_acct+0x63/0x160 kernel: md_end_clone_io+0x75/0xa0 <<< change in md_end_clone_io The commit only changes how we update bi_status. But bi_status was not used/checked at all between md_end_clone_io and the trap (lock cmpxchg). Did I miss something? Given the issue takes very long to reproduce. Maybe we have the issue before 6.6.4? Thanks, Song