On 1 November 2017 at 14:19, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: > On Mon, Oct 30, 2017 at 7:08 PM, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >> On Mon, Oct 30, 2017 at 6:19 PM, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote: >>> >>> 1. The faulty addresses are all near 0000000100000000, with one exception >>> of null (which is the most recent one) >> >> Well, they're at 8(%rax), except for that last case. >> >> And in every case (_including_ that last case), %rax has a very >> interesting pattern.. That's the (bad) buf->ops pointer that was >> loaded from the somehow corrupted "buf". >> >> The values in all cases are >> >> 00000000fffffffa >> 00000000fffffffd >> 00000000fffffff1 >> 00000000fffffff7 >> 00000000fffffff4 >> 00000000fffffffa >> 00000000fffffffd >> 00000000fffffffd >> 00000000fffffffa >> 00000000ffffffe8 >> 00000000fffffff1 >> 00000000fffffff7 >> >> which kind of looks like a 32-bit error value. So we have (n, val, (errno)): >> >> 1 -24 (EMFILE) >> 2 -15 (ENOTBLK) >> 1 -12 (ENOMEM) >> 2 -9 (EBADF) >> 3 -6 (ENXIO) >> 3 -3 (ESRCH) >> >> none of which makes any sense to me, but it's an interesting pattern >> nonetheless. > > > Yeah, good find! > > >> >>> 2. R12 register, which should map to the local vairable 'i', is always 0x8 >>> at the time of crash. >> >> So _if_ this is some kind of use-after-free thing, and the allocation >> got re-used for something else, that might just be related to whatever >> ends up being the offset that is filled in with the (int) error >> number. >> >> Except the offset is that %r12*0x28+0x10, so we're talking a byte >> offset of 330 bytes into the allocation, and apparently the eight >> previous (0-7) iterations were fine. >> >> Which is really odd. >> >> I'm not seeing anything that makes sense. I'll have to think about this. >> >> I'm assuming you don't have slub debugging enabled, and no way to >> enable it and try to catch this? > > We enable it at compile-time but not at run-time: > > CONFIG_SLUB_DEBUG=y > CONFIG_SLUB=y > CONFIG_SLUB_CPU_PARTIAL=y > # CONFIG_SLUB_DEBUG_ON is not set > # CONFIG_SLUB_STATS is not set > > I can try to manually add slub_debug in boot parameters, but still > have no idea how and when can trigger this bug again. > > > Thanks! This looks familiar... https://github.com/moby/moby/issues/34472 >From the bug report: "In particular, it looks like either docker-containerd or docker-containerd-shim (the log is cut off) has a pipe open that is causing a kernel BUG when attempting to kill the process. Fun times."