Hi Andy, On Sat, May 09, 2020 at 12:05:29PM -0700, Andy Lutomirski wrote: > 1. Non-PAE. There is a single 4k top-level page table per mm, and > this table contains either 512 or 1024 entries total. Of those > entries, some fraction (half or less) control the kernel address > space, and some fraction of *that* is for vmalloc space. Those > entries are the *only* thing that needs syncing -- all mms will either > have null (not present) in those slots or will have pointers to the > *same* next-level-down directories. Not entirely correct, on 32-bit non-PAE there are two levels with 1024 entries each. In Linux these map to PMD and PTE levels. With 1024 entries each PMD maps 4MB of address space. How much of these 1024 top-level entries are used for the kernel is configuration dependent in Linux, it can be 768, 512, or 256. > 2. PAE. Depending on your perspective, there could be a grand total > of four top-level paging pointers, of which one (IIRC) is for the > kernel. That is again configuration dependent, up to 3 of the top-level pointers could be used for kernel-space. > That points to the same place for all mms. Or, if you look at it the > other way, PAE is just like #1 except that the top-level table has > only four entries and only one points to VMALLOC space. There are more differences. In PAE an entry is 64-bit, which means each PMD-entry only maps 2MB of address space, which means you need two PTE pages to map the same amount of address space you could map with one page without PAE paging. And as noted above, all 3 of the top-level pointers can point to vmalloc space (not exclusivly). > So, unless I'm missing something here, there is an absolute maximum of > 512 top-level entries that ever need to be synchronized. And here is where your assumption is wrong. On 32-bit PAE systems it is not the top-level entries that need to be synchronized for vmalloc, but the second-level entries. And dependent on the kernel configuration, there are (in total, not only vmalloc) 1536, 1024, or 512 of these second-level entries. How much of them are actually used for vmalloc depends on the size of the system RAM (but is at least 64), because the vmalloc area begins after the kernel direct-mapping (with an 8MB unmapped hole). With 512MB of RAM you need 256 entries for the kernel direct-mapping (I ignore the ISA hole here). After that there are 4 unmapped entries and the rest is vmalloc, minus some fixmap and ldt mappings at the end of the address space. I havn't looked up the exact number, but 4 entries (8MB) for this should be close to the real value. 1536 total PMD entries - 256 for direct-mapping - 4 for ISA hole - 4 for stuff at the end of the address space = 1272 PMD entries for VMALLOC space which need synchronization. With less RAM it is even more, and to get rid of faulting and synchronization you need to pre-allocate a 4k PTE page for each of these PMD entries. > Now, there's an additional complication. On x86_64, we have a rule: > those entries that need to be synced start out null and may, during > the lifetime of the system, change *once*. They are never unmapped or > modified after being allocated. This means that those entries can > only ever point to a page *table* and not to a ginormous page. So, > even if the hardware were to support ginormous pages (which, IIRC, it > doesn't), we would be limited to merely immense and not ginormous > pages in the vmalloc range. On x86_32, I don't think we have this > rule right now. And this means that it's possible for one of these > pages to be unmapped or modified. The reason for x86-32 being different is that the address space is orders of magnitude smaller than on x86-64. We just have 4 top-level entries with PAE paging and can't afford to partition kernel-adress space on that level like we do on x86-64. That is the reason the address space is partitioned on the second (PMD) level, which is also the reason vmalloc synchronization needs to happen on that level. And because that's not enough yet, its also the page-table level to map huge-pages. > So my suggestion is that just apply the x86_64 rule to x86_32 as well. > The practical effect will be that 2-level-paging systems will not be > able to use huge pages in the vmalloc range, since the rule will be > that the vmalloc-relevant entries in the top-level table must point to > page *tables* instead of huge pages. I could very well live with prohibiting huge-page ioremap mappings for x86-32. But as I wrote before, this doesn't solve the problems I am trying to address with this patch-set, or would only address them if significant amount of total system memory is used. The pre-allocation solution would work for x86-64, it would only need 256kb of preallocated memory for the vmalloc range to never synchronize or fault again. I have thought about that and did the math before writing this patch-set, but doing the math for 32 bit drove me away from it for reasons written above. And since a lot of the vmalloc_sync_(un)mappings problems I debugged were actually related to 32-bit, I want a solution that works for 32 and 64-bit x86 (at least until support for x86-32 is removed). And I think this patch-set provides a solution that works well for both. Joerg