[adding linux-mm] On 21 Feb 19 15:41, Jerome Glisse wrote: > On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote: > > I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case. > > > > If multiple processes would use the identical page of PMDs corresponding > > to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead > > of populating a new PUD, just atomically increment a refcount and point > > to the same PUD in the next level above. Thanks for your feedback. Some comments/clarification below. > > I think page table sharing was discuss several time in the past and > the complexity involve versus the benefit were not clear. For 1GB > of virtual address you need: > #pte pages = 1G/(512 * 2^12) = 512 pte pages > #pmd pages = 1G/(512 * 512 * 2^12) = 1 pmd pages > > So if we were to share the pmd directory page we would be saving a > total of 513 pages for every page table or ~2MB. This goes up with > the number of process that map the same range ie if 10 process map > the same range and share the same pmd than you are saving 9 * 2MB > 18MB of memory. This seems relatively modest saving. The file blocksize = page size in what I am working on would be 2 MiB (sharing puds/pages of pmds), I'm not trying to support sharing pmds/pages of ptes. And yes, the savings in this case is actually even less than in your example (but see my example below). > > AFAIK there is no hardware benefit from sharing the page table > directory within different page table. So the only benefit is the > amount of memory we save. Yes, in our use case (high end Oracle database using DAX/XFS/PMEM/PMD) the main benefit would be memory savings: A future system might have 6 TiB of PMEM on it and there might be 10000 processes each mapping all of this 6 TiB. Here the savings would be approximately (6 TiB / 2 MiB) * 8 bytes (page table size) * 10000 = 240 GiB (and these page tables themselves would be in non-PMEM (ordinary RAM)). > > See below for comments on complexity to achieve this. > [trim] > > > > If I have a mmap of a DAX/FS/PMEM file and I take > > a page (either pte or PMD sized) fault on access to this file, > > the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?). > > Not exactly the page table are allocated long before dax_iomap_fault() > get calls. They are allocated by the handle_mm_fault() and its childs > functions. Yes, I misstated this, the fault is handled there which may well alter the PUD (in my case), but the original page tables are set up earlier. > > > > > If the process later munmaps this file or exits but there are still > > other users of the shared page of PMDs, I would need to > > detect that this has happened and act accordingly (#3 above) > > > > Where will these page table entries be torn down? > > In the same code where any other page table is torn down? > > If this is the case, what would the cleanest way of telling that these > > page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping > > (look at the physical address pointed to?) so that > > I could do the right thing here. > > > > I understand that I may have missed something obvious here. > > > > They are many issues here are the one i can think of: > - finding a pmd/pud to share, you need to walk the reverse mapping > of the range you are mapping and to find if any process or other > virtual address already as a pud or pmd you can reuse. This can > take more time than allocating page directory pages. > - if one process munmap some portion of a share pud you need to > break the sharing this means that munmap (or mremap) would need > to handle this page table directory sharing case first > - many code path in the kernel might need update to understand this > share page table thing (mprotect, userfaultfd, ...) > - the locking rules is bound to be painfull > - this might not work on all architecture as some architecture do > associate information with page table directory and that can not > always be share (it would need to be enabled arch by arch) Yes, some architectures don't support DAX at all (note again that I'm not trying to share non-DAX page table here). > > The nice thing: > - unmapping for migration, when you unmap a share pud/pmd you can > decrement mapcount by share pud/pmd count this could speedup > migration A followup question: the kernel does sharing of page tables for hugetlbfs (also 2 MiB pages), why aren't the above issues relevant there as well (or are they but we support it anyhow)? > > This is what i could think of on the top of my head but there might be > other thing. I believe the question is really a benefit versus cost and > to me at least the complexity cost outweight the benefit one for now. > Kirill Shutemov proposed rework on how we do page table and this kind of > rework might tip the balance the other way. So my suggestion would be to > look into how the page table management can be change in a beneficial > way that could also achieve the page table sharing. > > Cheers, > Jérôme Thanks. Larry