On Thu, Dec 16, 2021 at 12:37:37PM +0300, Kirill A. Shutemov wrote: > On Wed, Dec 15, 2021 at 09:55:20PM +0000, Matthew Wilcox wrote: > > I've been trying to understand whether we can simplify the mapcount > > handling for folios from the current situation with THPs. Let me > > quote the commit message from 53f9263baba6: > > > > > mm: rework mapcount accounting to enable 4k mapping of THPs > > > > > > We're going to allow mapping of individual 4k pages of THP compound. It > > > means we need to track mapcount on per small page basis. > > > > > > Straight-forward approach is to use ->_mapcount in all subpages to track > > > how many time this subpage is mapped with PMDs or PTEs combined. But > > > this is rather expensive: mapping or unmapping of a THP page with PMD > > > would require HPAGE_PMD_NR atomic operations instead of single we have > > > now. > > > > > > The idea is to store separately how many times the page was mapped as > > > whole -- compound_mapcount. This frees up ->_mapcount in subpages to > > > track PTE mapcount. > > > > > > We use the same approach as with compound page destructor and compound > > > order to store compound_mapcount: use space in first tail page, > > > ->mapping this time. > > > > > > Any time we map/unmap whole compound page (THP or hugetlb) -- we > > > increment/decrement compound_mapcount. When we map part of compound > > > page with PTE we operate on ->_mapcount of the subpage. > > > > > > page_mapcount() counts both: PTE and PMD mappings of the page. > > > > > > Basically, we have mapcount for a subpage spread over two counters. It > > > makes tricky to detect when last mapcount for a page goes away. > > > > > > We introduced PageDoubleMap() for this. When we split THP PMD for the > > > first time and there's other PMD mapping left we offset up ->_mapcount > > > in all subpages by one and set PG_double_map on the compound page. > > > These additional references go away with last compound_mapcount. > > > > > > This approach provides a way to detect when last mapcount goes away on > > > per small page basis without introducing new overhead for most common > > > cases. > > > > What breaks if we simply track any mapping (whether by PMD or PTE) > > as an increment to the head page (aka folio's) refcount? > > The obvious answer is CoW: as discussed yesterday we need exact mapcount > to know if the page can be re-used or has to be copied. > > Consider the case when you have folio mapped as PMD and then split into > PTE page table (like with mprotect()). You get WP page fault on a page > that has mapcount == 512. How would you know if we can re-use the 4k? I was trying to say the exact opposite of that ... fortunately I rephrased it below. The scenario I think you're describing here is: p = mmap(x, 2MB, PROT_READ|PROT_WRITE, ...): THP allocated mprotect(p, 4KB, PROT_READ): THP split. And in that case, I would say the THP now has mapcount of 2 because there are 2 VMAs mapping it. > Also we need to detect case when the last mapping of a 4k in the folio has > gone to trigger deferred_split_huge_page() logic. I think you're referring to this logic in rmap.c: if (TestClearPageDoubleMap(page)) { /* * Subpages can be mapped with PTEs too. Check how many of * them are still mapped. */ for (i = 0, nr = 0; i < thp_nr_pages(page); i++) { if (atomic_add_negative(-1, &page[i]._mapcount)) nr++; } /* * Queue the page for deferred split if at least one small * page of the compound page is unmapped, but at least one * small page is still mapped. */ if (nr && nr < thp_nr_pages(page)) deferred_split_huge_page(page); The 'partial_mapcount' idea I mentioned below could help with this. Checking that they're identical might be racy, though. > > Essentially, we make the head mapcount 'the number of VMAs which contain > > a reference to any page in this folio'. > > Okay, so you will have mapcount == 2 or 3 for mprotect case above, not > 512. But it doesn't help with answering question if the page can be > re-used. You would need to do rmap walk to get the answer. > > Note also that VMA lifecycle is different from page lifecycle: > MADV_DONTNEED removes mapping, but leaves VMA intact. Who would decrement > mapcount here? I think page_remove_rmap() does when called from zap_huge_pmd(). The tricky part is handling partial DONTNEED calls. eg we could have a 2MB page (if we're talking about shmem, it could even be mapped askew), MADV_DONTNEED the first 512KB, then MADV_DONTNEED the last 512KB, then finally MADV_DONTNEED the middle 1MB and only at the third call should the mapcount be decremented. So a zap call has to check all the PTE/PMD entries in the range of the (folio intersect vma) to be sure that there are no more references to a part of this folio. It might not be terribly fast, but it probably won't be noticable compared to all the other costs of doing a munmap(). > > We can remove PageDoubleMap. The tail refcounts will all be 0. If it's > > useful, we could introduce a 'partial_mapcount' which would be <= > > mapcount (but I don't know if it's useful). Splitting a PMD would not > > change ->_mapcount. Splitting the folio already causes the folio to be > > unmapped, so page faults will naturally re-increment ->_mapcount of each > > subpage. > > > > We might need some additional logic to treat a large folio (aka compound > > page) as a single unit; that is, when we fault on one page, we place > > entries for all pages in this folio (that fit ...) into the page tables, > > so that we only account it once, even if it's not compatible with using > > a PMD. > > I still don't see a way to simplify mapcount for THP. But I'm preconsived > becasue I'm the author of the current scheme. > > Please, prove me wrong. I want to be mistaken. :) I'm just trying to learn enough to make sensible suggestions for simplification. As yesterday's call proved, there are all kinds of corner cases when messing with mapcount and refcount.