On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote: > I think it's inveneted by your BIOS vendor :) BTW, all systems I use on a daily basis have that type 20... Only two of them are reproducing the VM_BUG_ON on a weekly basis on v5.9. If you search 'E820 "type 20"' you'll get plenty of hits, so it's not just me at least :).. In fact my guess is there are probably more workstation/laptops with that type 20 than without. Maybe it only showup if booting with EFI? Easy to check with `dmesg | grep "type 20"` after boot. One guess why this wasn't frequently reproduced is some desktop distro is doing the mistake of keeping THP enabled = madvise by default and they're reducing the overall compaction testing? Or maybe they're not all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that wouldn't happen with THP enabled=madvise. The two bpf tracing tools below can proof how the current defrag=madvise default only increase the allocation latency from a few usec to a dozen usec. Only if setting defrag=always the latency goes up to single digit milliseconds, because of the cost of direct compaction which is only worth paying for, for MADV_HUGEPAGE ranges doing long-lived allocations (we know by now that defrag=always was a suboptimal default). https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis using fork for snapshotting purposes should prefer THP enabled. (besides it would be better if it used uffd-wp as alternative to fork) 3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern because the decade old "fork() vs gup/O_DIRECT vs thread" race was supposed to be unnoticeable from userland if the O_DIRECT min I/O granularity was enforced to be >=PAGE_SIZE. However with THP backed anon memory, that minimum granularity requirement increase to HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving that race by doing cow during fork as it was originally proposed long time ago (https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which I suppose will solve the race with sub-PAGE_SIZE granularity too, but 3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE as some userland may have expected. The best of course is to fully prevent that race condition by setting MADV_DONTFORK on the regions under O_DIRECT (as qemu does for example). Overall the only tangible concern left is potential higher memory usage for servers handling tiny object storage freed at PAGE_SIZE granularity with MADV_DONTNEED (instead of having a way to copy and defrag the virtual space of small objects through a callback that updates the pointer to the object...). Small object storage relying on jemalloc/tcmalloc for tiny object management simply need to selectively disable THP to avoid wasting memory either with MADV_NOHUGEPAGE or with the prctl PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to disable THP too for those object storage apps making heavy use of MADV_DONTNEED, not even a single line of code need changing in the app for it if deployed through the OCI container runtime. Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's being used but I've an hope it already does the right thing and separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE granularity, with large object arena where THP shall remain enabled. glibc also should learn to separate small objects and big objects in different arenas. This has to be handled by the app, like it is handled optimally already in scylladb that in fact invokes MADV_HUGEPAGE, or plenty of other databases are using not just THP but also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted at PAGE_SIZE granularity.. or elastic search that also gets a significant boost from THP etc.. Thanks, Andrea