On Mon, 10 Jul 2023, Mark Brown wrote: > On Mon, Jul 10, 2023 at 06:18:27PM +0100, Lorenzo Stoakes wrote: > > On Mon, Jul 10, 2023 at 03:42:31PM +0100, Mark Brown wrote: > > > > We end up seeing NULL or otherwise bad pointer dereferences, the > > > specific error does vary a bit though it mostly appears to be in the > > > pinctrl code. A bisect (full log below) identified this patch as > > > introducing the failure, nothing is jumping out at me about the patch > > > and it's not affecting everything so I'd not be surprised if it's just > > > unconvering some bug in the platform support but I'm not super familiar > > > with the code. > > > Yeah seems likely. Do you have a .config you can share for this board? For > > a 64-bit device you'd expect that this change would probably be a nop. > > It's definitely happening with arm64 defconfig, possibly with other > configs but that's the main one. I'm sorry for dropping you in it, Mark, but I'm totally baffled. I've spent most of the day trying to come up with ideas, but failed. I've no doubt that you're seeing what you're seeing, but how it comes about is a mystery. Lorenzo is right that the change should be a no-op - compared with 6.4. But it's not quite a no-op in this series, because 04/32 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to fail") diverts the old pte_offset_map() macro off to a new function in mm/pgtable-generic.c; then this commit restores it back to being the pte_offset_kernel() macro. So the asm in vmalloc_to_page() is expected to change in this commit, but change back to what it would have been in 6.4. This feels like one of those bugs which depends on the code size in some way (a bit like those bugs we used to have, where a function was mistakenly marked __init, then in some configs its code landed on a page which got freed at startup - I'm not saying this is that at all, just saying it feels weird in that way). Yet your bisection converges convincingly, which I wouldn't expect in that case. I suppose I should ask you to try reverting this 0d1c81edc61e alone from 6.5-rc1: the consistency of your bisection implies that it will "fix" the issues, and it is a commit which we could drop. It makes me a little nervous, applying userspace-pagetable validation to kernel pagetables, so I don't want to drop it; and it would really be cargo- culting to drop it without understanding. But we could drop it. I guess it would be interesting to know whether vmalloc_to_page() is ever even called in your kernel, before it crashes on the pinctrl stuff. But putting in a printk to report on that may change everything. And I guess it would be interesting to know (from a DEBUG_INFO build of the crashing kernel) which line of dt_remember_or_free_map() it oopses on i.e. which pointer is NULL when it shouldn't be - or maybe you already worked that out. And what device (which ->dt_node_to_map) is involved. If one of the many dt_node_to_map's fails to initialize *map to NULL when it should, and has relied on it happening to be a NULL on the stack already... that might explain it. Another thing to try, would be the kernel at 0d940a9b270b^, just before pte_offset_map() grew a function call: there's a faint possibility that the bug came in before this series, that 0d940a9b270b somehow masked it (I don't see how: vmalloc_to_page() does sensible validation itself), and then 0d1c81edc61e unmasked it again - so that the bisection skipped over, and converged on the wrong point. But I'm thrashing about: I have no confidence that any of this info will help us. Sorry for wasting your time. Thanks, Hugh