Just looking at this and the backtrace: > On Apr 12, 2023, at 09:14, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Tue, Apr 11, 2023 at 05:15:36PM -0700, Andrew Morton wrote: >> On Tue, 11 Apr 2023 13:16:18 +0100 Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >>> On Mon, Apr 10, 2023 at 09:45:02AM +0800, xiaosong.ma wrote: >>>> perform the check in dump_mapping() to print warning info and avoid crash with invalid non-NULL page->mapping. >>>> For example, a panic with following backtraces show dump_page will show wrong info and panic when the bad page >>>> is non-NULL mapping and page->mapping is 0x80000000000. >>>> >>>> crash_arm64> bt >>>> PID: 232 TASK: ffffff80e8c2c340 CPU: 0 COMMAND: "Binder:232_2" >>>> #0 [ffffffc013e5b080] sysdump_panic_event$b2bce43a479f4f7762201bfee02d7889 at ffffffc0108d7c2c >>>> #1 [ffffffc013e5b0c0] atomic_notifier_call_chain at ffffffc010300228 >>>> #2 [ffffffc013e5b2c0] panic at ffffffc0102c926c >>>> #3 [ffffffc013e5b370] die at ffffffc010267670 >>>> #4 [ffffffc013e5b3a0] die_kernel_fault at ffffffc0102808a4 >>>> #5 [ffffffc013e5b3d0] __do_kernel_fault at ffffffc010280820 >>>> #6 [ffffffc013e5b410] do_bad_area at ffffffc01028059c >>>> #7 [ffffffc013e5b440] do_translation_fault$4df5decbea5d08a63349aa36f07426b2 at ffffffc0111149c8 >>>> #8 [ffffffc013e5b470] do_mem_abort at ffffffc0100a4488 >>>> #9 [ffffffc013e5b5e0] el1_ia at ffffffc0100a6c00 >>>> #10 [ffffffc013e5b5f0] __dump_page at ffffffc0104beecc >>> >>> This doesn't show a crash in dump_mapping(), it shows a crash in >>> __dump_page(). >> >> um, yes. >> >> But if page->mapping is corrupted, where does __dump_page() dereference it? > > I don't see anywhere that it does, so I'm suspicious that we have the > correct diagnosis here. I agree; since dump_mapping() is an actual function rather than a macro or inline, if a bad dereference were happening within dump_mapping() I would think we SHOULD see the call to dump_mapping() on the stack unless I'm missing something obvious here. Instead I'd like to know which instruction the faulting address in __dump_page() maps to for the kernel experiencing this. >> The initial patch >> (https://lkml.kernel.org/r/1680587425-4683-1-git-send-email-Xiaosong.Ma@xxxxxxxxxx) >> prevented __dump_page() from calling dump_mapping() if page->mapping is >> bad, and that presumably fixed things. > > Right, but doesn't the _existing_ get_kernel_nofault(host, &mapping->host) > already prevent us from blindly dereferencing a bad mapping pointer? I would think it would, but given the traceback, is the fault occurring within dump_mapping(), or have we perhaps completed dump_mapping() and some subtle corruption occurred such that the fault occurs on the return to __dump_page()? Certainly dump_mapping() looks to do the right thing to avoid using a bad passed "mapping" as it's not dereferenced anywhere without checks, just used for pointer math to create an address for calls to get_kernel_notfault(). >> So confusion reigns. I think making dump_mapping() tolerant of a wild >> mapping pointer makes sense, but I don't think we actually know why the >> reporter's kernel crashed. > > In my mind dump_mapping() is already tolerant of a wild page->mapping > pointer. I think the problem is something entirely different. Again, I agree. As posited above, could it be that something occurs within dump_mapping() such that when the code returns to __dump_page() it is at THAT point that the fault occurs? That would explain the backtrace and why it shows the fault as occurring within __dump_page(), but upon first glance the mechanism by which this could be occurring eludes me. The original patch doesn't mention whether any pr_warn() messages were printed as a result of the call to dump_mapping(), and the suggested fix would fix the issue whether the fault were occurring within dump_mapping() or in the return from calling dump_mapping(). -- Bill