On Tue, Oct 31, 2017 at 09:43:03PM -0700, Cong Wang wrote: > On Tue, Oct 31, 2017 at 8:05 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Oct 31, 2017 at 06:51:08PM -0700, Cong Wang wrote: > >> >> Please let me know if I can provide any other information. > >> > > >> > How do you reproduce the problem? > >> > >> The warning is reported via ABRT email, we don't know what was > >> happening at the time of crash. > > > > Which makes it even harder to track down. Perhaps you should > > configure the box to crashdump on such a failure and then we > > can do some post-failure forensic analysis... > > Yeah. > > We are trying to make kdump working, but even if kdump works > we still can't turn on panic_on_warn since this is production > machine. Hmmm. Ok, maybe you could leave a trace of the xfs_iget* trace points running and check the log tail for unusual events around the time of the next crash. e.g. xfs_iget_reclaim_fail events. That might point us to a potential interaction we can look at more closely. I'd also suggest slab poisoning as well, as that will catch other lifecycle problems that could be causing list corruptions such as use-after-free. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html