On Tue 18-12-18 21:07:24, Jerome Glisse wrote: > On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote: > > OK, so let's take another look at Jerome's _mapcount idea all by itself (using > > *only* the tracking pinned pages aspect), given that it is the lightest weight > > solution for that. > > > > So as I understand it, this would use page->_mapcount to store both the real > > mapcount, and the dma pinned count (simply added together), but only do so for > > file-backed (non-anonymous) pages: > > > > > > __get_user_pages() > > { > > ... > > get_page(page); > > > > if (!PageAnon) > > atomic_inc(page->_mapcount); > > ... > > } > > > > put_user_page(struct page *page) > > { > > ... > > if (!PageAnon) > > atomic_dec(&page->_mapcount); > > > > put_page(page); > > ... > > } > > > > ...and then in the various consumers of the DMA pinned count, we use page_mapped(page) > > to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you > > had in mind? > > Mostly, with the extra two observations: > [1] We only need to know the pin count when a write back kicks in > [2] We need to protect GUP code with wait_for_write_back() in case > GUP is racing with a write back that might not the see the > elevated mapcount in time. > > So for [2] > > __get_user_pages() > { > get_page(page); > > if (!PageAnon) { > atomic_inc(page->_mapcount); > + if (PageWriteback(page)) { > + // Assume we are racing and curent write back will not see > + // the elevated mapcount so wait for current write back and > + // force page fault > + wait_on_page_writeback(page); > + // force slow path that will fault again > + } > } > } This is not needed AFAICT. __get_user_pages() gets page reference (and it should also increment page->_mapcount) under PTE lock. So at that point we are sure we have writeable PTE nobody can change. So page_mkclean() has to block on PTE lock to make PTE read-only and only after going through all PTEs like this, it can check page->_mapcount. So the PTE lock provides enough synchronization. > For [1] only needing pin count during write back turns page_mkclean into > the perfect spot to check for that so: > > int page_mkclean(struct page *page) > { > int cleaned = 0; > + int real_mapcount = 0; > struct address_space *mapping; > struct rmap_walk_control rwc = { > .arg = (void *)&cleaned, > .rmap_one = page_mkclean_one, > .invalid_vma = invalid_mkclean_vma, > + .mapcount = &real_mapcount, > }; > > BUG_ON(!PageLocked(page)); > > if (!page_mapped(page)) > return 0; > > mapping = page_mapping(page); > if (!mapping) > return 0; > > // rmap_walk need to change to count mapping and return value > // in .mapcount easy one > rmap_walk(page, &rwc); > > // Big fat comment to explain what is going on > + if ((page_mapcount(page) - real_mapcount) > 0) { > + SetPageDMAPined(page); > + } else { > + ClearPageDMAPined(page); > + } This is the detail I'm not sure about: Why cannot rmap_walk_file() race with e.g. zap_pte_range() which decrements page->_mapcount and thus the check we do in page_mkclean() is wrong? > > // Maybe we want to leverage the int nature of return value so that > // we can express more than cleaned/truncated and express cleaned/ > // truncated/pinned for benefit of caller and that way we do not > // even need one bit as page flags above. > > return cleaned; > } > > You do not want to change page_mapped() i do not see a need for that. > > Then the whole discussion between Jan and Dave seems to indicate that > the bounce mechanism will need to be in the fs layer and that we can > not reuse the bio bounce mechanism. This means that more work is needed > at the fs level for that (so that fs do not freak on bounce page). > > Note that they are few gotcha where we need to preserve the pin count > ie mostly in truncate code path that can remove page from page cache > and overwrite the mapcount in the process, this would need to be fixed > to not overwrite mapcount so that put_user_page does not set the map > count to an invalid value turning the page into a bad state that will > at one point trigger kernel BUG_ON(); > > I am not saying block truncate, i am saying make sure it does not > erase pin count and keep truncating happily. The how to handle truncate > is a per existing GUP user discussion to see what they want to do for > that. > > Obviously a bit deeper analysis of all spot that use mapcount is needed > to check that we are not breaking anything but from the top of my head > i can not think of anything bad (migrate will abort and other things will > assume the page is mapped even it is only in hardware page table, ...). Hum, grepping for page_mapped() and page_mapcount(), this is actually going to be non-trivial to get right AFAICT. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR