On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: >> >> 1. The faulty addresses are all near 0000000100000000, with one exception >> of null (which is the most recent one) > > Well, they're at 8(%rax), except for that last case. > > And in every case (_including_ that last case), %rax has a very > interesting pattern.. That's the (bad) buf->ops pointer that was > loaded from the somehow corrupted "buf". > > The values in all cases are > > 00000000fffffffa > 00000000fffffffd > 00000000fffffff1 > 00000000fffffff7 > 00000000fffffff4 > 00000000fffffffa > 00000000fffffffd > 00000000fffffffd > 00000000fffffffa > 00000000ffffffe8 > 00000000fffffff1 > 00000000fffffff7 > > which kind of looks like a 32-bit error value. So we have (n, val, (errno)): > > 1 -24 (EMFILE) > 2 -15 (ENOTBLK) > 1 -12 (ENOMEM) > 2 -9 (EBADF) > 3 -6 (ENXIO) > 3 -3 (ESRCH) > > none of which makes any sense to me, but it's an interesting pattern > nonetheless. Yeah, good find! > >> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >> at the time of crash. > > So _if_ this is some kind of use-after-free thing, and the allocation > got re-used for something else, that might just be related to whatever > ends up being the offset that is filled in with the (int) error > number. > > Except the offset is that %r12*0x28+0x10, so we're talking a byte > offset of 330 bytes into the allocation, and apparently the eight > previous (0-7) iterations were fine. > > Which is really odd. > > I'm not seeing anything that makes sense. I'll have to think about this. > > I'm assuming you don't have slub debugging enabled, and no way to > enable it and try to catch this? We enable it at compile-time but not at run-time: CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y CONFIG_SLUB_CPU_PARTIAL=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set I can try to manually add slub_debug in boot parameters, but still have no idea how and when can trigger this bug again. Thanks!