On 1/3/19 6:44 AM, Jerome Glisse wrote: > On Thu, Jan 03, 2019 at 10:26:54AM +0100, Jan Kara wrote: >> On Wed 02-01-19 20:55:33, Jerome Glisse wrote: >>> On Wed, Dec 19, 2018 at 12:08:56PM +0100, Jan Kara wrote: >>>> On Tue 18-12-18 21:07:24, Jerome Glisse wrote: >>>>> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote: >>>>>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using >>>>>> *only* the tracking pinned pages aspect), given that it is the lightest weight >>>>>> solution for that. >>>>>> >>>>>> So as I understand it, this would use page->_mapcount to store both the real >>>>>> mapcount, and the dma pinned count (simply added together), but only do so for >>>>>> file-backed (non-anonymous) pages: >>>>>> >>>>>> >>>>>> __get_user_pages() >>>>>> { >>>>>> ... >>>>>> get_page(page); >>>>>> >>>>>> if (!PageAnon) >>>>>> atomic_inc(page->_mapcount); >>>>>> ... >>>>>> } >>>>>> >>>>>> put_user_page(struct page *page) >>>>>> { >>>>>> ... >>>>>> if (!PageAnon) >>>>>> atomic_dec(&page->_mapcount); >>>>>> >>>>>> put_page(page); >>>>>> ... >>>>>> } >>>>>> >>>>>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page) >>>>>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you >>>>>> had in mind? >>>>> >>>>> Mostly, with the extra two observations: >>>>> [1] We only need to know the pin count when a write back kicks in >>>>> [2] We need to protect GUP code with wait_for_write_back() in case >>>>> GUP is racing with a write back that might not the see the >>>>> elevated mapcount in time. >>>>> >>>>> So for [2] >>>>> >>>>> __get_user_pages() >>>>> { >>>>> get_page(page); >>>>> >>>>> if (!PageAnon) { >>>>> atomic_inc(page->_mapcount); >>>>> + if (PageWriteback(page)) { >>>>> + // Assume we are racing and curent write back will not see >>>>> + // the elevated mapcount so wait for current write back and >>>>> + // force page fault >>>>> + wait_on_page_writeback(page); >>>>> + // force slow path that will fault again >>>>> + } >>>>> } >>>>> } >>>> >>>> This is not needed AFAICT. __get_user_pages() gets page reference (and it >>>> should also increment page->_mapcount) under PTE lock. So at that point we >>>> are sure we have writeable PTE nobody can change. So page_mkclean() has to >>>> block on PTE lock to make PTE read-only and only after going through all >>>> PTEs like this, it can check page->_mapcount. So the PTE lock provides >>>> enough synchronization. >>>> >>>>> For [1] only needing pin count during write back turns page_mkclean into >>>>> the perfect spot to check for that so: >>>>> >>>>> int page_mkclean(struct page *page) >>>>> { >>>>> int cleaned = 0; >>>>> + int real_mapcount = 0; >>>>> struct address_space *mapping; >>>>> struct rmap_walk_control rwc = { >>>>> .arg = (void *)&cleaned, >>>>> .rmap_one = page_mkclean_one, >>>>> .invalid_vma = invalid_mkclean_vma, >>>>> + .mapcount = &real_mapcount, >>>>> }; >>>>> >>>>> BUG_ON(!PageLocked(page)); >>>>> >>>>> if (!page_mapped(page)) >>>>> return 0; >>>>> >>>>> mapping = page_mapping(page); >>>>> if (!mapping) >>>>> return 0; >>>>> >>>>> // rmap_walk need to change to count mapping and return value >>>>> // in .mapcount easy one >>>>> rmap_walk(page, &rwc); >>>>> >>>>> // Big fat comment to explain what is going on >>>>> + if ((page_mapcount(page) - real_mapcount) > 0) { >>>>> + SetPageDMAPined(page); >>>>> + } else { >>>>> + ClearPageDMAPined(page); >>>>> + } >>>> >>>> This is the detail I'm not sure about: Why cannot rmap_walk_file() race >>>> with e.g. zap_pte_range() which decrements page->_mapcount and thus the >>>> check we do in page_mkclean() is wrong? >>>> >>> >>> Ok so i found a solution for that. First GUP must wait for racing >>> write back. If GUP see a valid write-able PTE and the page has >>> write back flag set then it must back of as if the PTE was not >>> valid to force fault. It is just a race with page_mkclean and we >>> want ordering between the two. Note this is not strictly needed >>> so we can relax that but i believe this ordering is better to do >>> in GUP rather then having each single user of GUP test for this >>> to avoid the race. >>> >>> GUP increase mapcount only after checking that it is not racing >>> with writeback it also set a page flag (SetPageDMAPined(page)). >>> >>> When clearing a write-able pte we set a special entry inside the >>> page table (might need a new special swap type for this) and change >>> page_mkclean_one() to clear to 0 those special entry. >>> >>> >>> Now page_mkclean: >>> >>> int page_mkclean(struct page *page) >>> { >>> int cleaned = 0; >>> + int real_mapcount = 0; >>> struct address_space *mapping; >>> struct rmap_walk_control rwc = { >>> .arg = (void *)&cleaned, >>> .rmap_one = page_mkclean_one, >>> .invalid_vma = invalid_mkclean_vma, >>> + .mapcount = &real_mapcount, >>> }; >>> + int mapcount1, mapcount2; >>> >>> BUG_ON(!PageLocked(page)); >>> >>> if (!page_mapped(page)) >>> return 0; >>> >>> mapping = page_mapping(page); >>> if (!mapping) >>> return 0; >>> >>> + mapcount1 = page_mapcount(page); >>> // rmap_walk need to change to count mapping and return value >>> // in .mapcount easy one >>> rmap_walk(page, &rwc); >> >> So what prevents GUP_fast() to grab reference here and the test below would >> think the page is not pinned? Or do you assume that every page_mkclean() >> call will be protected by PageWriteback (currently it is not) so that >> GUP_fast() blocks / bails out? Continuing this thread, still focusing only on the "how to maintain a PageDmaPinned for each page" question (ignoring, for now, what to actually *do* in response to that flag being set): 1. Jan's point above is still a problem: PageWriteback != "page_mkclean is happening". This is probably less troubling than the next point, but it does undermine all the complicated schemes involving PageWriteback, that try to synchronize gup() with page_mkclean(). 2. Also, the mapcount approach here still does not reliably avoid false negatives (that is, a page may have been gup'd, but page_mkclean could miss that): gup() can always jump in and increment the mapcount, while page_mkclean is in the middle of making (wrong) decisions based on that mapcount. There's no lock to prevent that. Again: mapcount can go up *or* down, so I'm not seeing a true solution yet. > > So GUP_fast() becomes: > > GUP_fast_existing() { ... } > GUP_fast() > { > GUP_fast_existing(); > > for (i = 0; i < npages; ++i) { > if (PageWriteback(pages[i])) { > // need to force slow path for this page > } else { > SetPageDmaPinned(pages[i]); > atomic_inc(pages[i]->mapcount); > } > } > } > > This is a minor slow down for GUP fast and it takes care of a > write back race on behalf of caller. This means that page_mkclean > can not see a mapcount value that increase. This simplify thing > we can relax that. Note that what this is doing is making sure > that GUP_fast never get lucky :) ie never GUP a page that is in > the process of being write back but has not yet had its pte > updated to reflect that. > > >> But I think that detecting pinned pages with small false positive rate is >> OK. The extra page bouncing will cost some performance but if it is rare, >> then we are OK. So I think we can go for the simple version of detecting >> pinned pages as you mentioned in some earlier email. We just have to be >> sure there are no false negatives. > Agree with that sentiment, but there are still false negatives and I'm not yet seeing any solutions for that. thanks, -- John Hubbard NVIDIA