On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote: > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote: > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote: > > > Indeed, for architectures that define > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP > > > it is > > > possible that __kernel_map_pages() would fail, but since this > > > function is > > > void, the failure will go unnoticed. > > > > Could you elaborate on how this could happen? Do you mean during > > runtime today or if something new was introduced? > > A failure in__kernel_map_pages() may happen today. For instance, on > x86 > if the kernel is built with DEBUG_PAGEALLOC. > > __kernel_map_pages(page, 1, 0); > > will need to split, say, 2M page and during the split an allocation > of > page table could fail. On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page on the direct map and even disables locking in cpa because it assumes this. If this is happening somehow anyway then we should probably fix that. Even if it's a debug feature, it will not be as useful if it is causing its own crashes. I'm still wondering if there is something I'm missing here. It seems like you are saying there is a bug in some arch's, so let's add a WARN in cross-arch code to log it as it crashes. A warn and making things clearer seem like good ideas, but if there is a bug we should fix it. The code around the callers still functionally assume re-mapping can't fail. > Currently, the only user of __kernel_map_pages() outside > DEBUG_PAGEALLOC > is hibernation, but I think it would be safer to entirely prevent > usage > of __kernel_map_pages() when DEBUG_PAGEALLOC=n. I totally agree it's error prone FWIW. On x86, my mental model of how it is supposed to work is: If a page is 4k and NP it cannot fail to be remapped. set_direct_map_invalid_noflush() should result in 4k NP pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct map. Are you seeing this violated or do I have wrong assumptions? Beyond whatever you are seeing, for the latter case of new things getting introduced to an interface with hidden dependencies... Another edge case could be a new caller to set_memory_np() could result in large NP pages. None of the callers today should cause this AFAICT, but it's not great to rely on the callers to know these details.