On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 27/11/2023 08:42, Barry Song wrote: > >>> + for (i = 0; i < nr; i++, page++) { > >>> + if (anon) { > >>> + /* > >>> + * If this page may have been pinned by the > >>> + * parent process, copy the page immediately for > >>> + * the child so that we'll always guarantee the > >>> + * pinned page won't be randomly replaced in the > >>> + * future. > >>> + */ > >>> + if (unlikely(page_try_dup_anon_rmap( > >>> + page, false, src_vma))) { > >>> + if (i != 0) > >>> + break; > >>> + /* Page may be pinned, we have to copy. */ > >>> + return copy_present_page( > >>> + dst_vma, src_vma, dst_pte, > >>> + src_pte, addr, rss, prealloc, > >>> + page); > >>> + } > >>> + rss[MM_ANONPAGES]++; > >>> + VM_BUG_ON(PageAnonExclusive(page)); > >>> + } else { > >>> + page_dup_file_rmap(page, false); > >>> + rss[mm_counter_file(page)]++; > >>> + } > >>> } > >>> - rss[MM_ANONPAGES]++; > >>> - } else if (page) { > >>> - folio_get(folio); > >>> - page_dup_file_rmap(page, false); > >>> - rss[mm_counter_file(page)]++; > >>> + > >>> + nr = i; > >>> + folio_ref_add(folio, nr); > >> > >> You're changing the order of mapcount vs. refcount increment. Don't. > >> Make sure your refcount >= mapcount. > >> > >> You can do that easily by doing the folio_ref_add(folio, nr) first and > >> then decrementing in case of error accordingly. Errors due to pinned > >> pages are the corner case. > >> > >> I'll note that it will make a lot of sense to have batch variants of > >> page_try_dup_anon_rmap() and page_dup_file_rmap(). > >> > > > > i still don't understand why it is not a entire map+1, but an increment > > in each basepage. > > Because we are PTE-mapping the folio, we have to account each individual page. > If we accounted the entire folio, where would we unaccount it? Each page can be > unmapped individually (e.g. munmap() part of the folio) so need to account each > page. When PMD mapping, the whole thing is either mapped or unmapped, and its > atomic, so we can account the entire thing. Hi Ryan, There is no problem. for example, a large folio is entirely mapped in process A with CONPTE, and only page2 is mapped in process B. then we will have entire_map = 0 page0.map = -1 page1.map = -1 page2.map = 0 page3.map = -1 .... > > > > > as long as it is a CONTPTE large folio, there is no much difference with > > PMD-mapped large folio. it has all the chance to be DoubleMap and need > > split. > > > > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any > > similar things on a part of the large folio in process A, > > > > this large folio will have partially mapped subpage in A (all CONTPE bits > > in all subpages need to be removed though we only unmap a part of the > > large folioas HW requires consistent CONTPTEs); and it has entire map in > > process B(all PTEs are still CONPTES in process B). > > > > isn't it more sensible for this large folios to have entire_map = 0(for > > process B), and subpages which are still mapped in process A has map_count > > =0? (start from -1). > > > >> Especially, the batch variant of page_try_dup_anon_rmap() would only > >> check once if the folio maybe pinned, and in that case, you can simply > >> drop all references again. So you either have all or no ptes to process, > >> which makes that code easier. > > I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But > fundamentally you can only use entire_mapcount if its only possible to map and > unmap the whole folio atomically. My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, it is partially mapped. if a large folio is mapped in one processes with all CONTPTEs and meanwhile in another process with partial mapping(w/o CONTPTE), it is DoubleMapped. Since we always hold ptl to set or drop CONTPTE bits, set/drop is still atomic in a spinlock area. > > >> > >> But that can be added on top, and I'll happily do that. > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > > Thanks Barry