On 12/12/18 9:49 AM, Dan Williams wrote: > On Wed, Dec 12, 2018 at 9:02 AM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: >> >> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote: >>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@xxxxxxxxxx> wrote: >>>> >>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote: >>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote: >>>>>> Another crazy idea, why not treating GUP as another mapping of the page >>>>>> and caller of GUP would have to provide either a fake anon_vma struct or >>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can >>>>>> have a mix of both private and file page thus only if it is a read only >>>>>> GUP) that would get added to the list of existing mapping. >>>>>> >>>>>> So the flow would be: >>>>>> somefunction_thatuse_gup() >>>>>> { >>>>>> ... >>>>>> GUP(_fast)(vma, ..., fake_anon, fake_vma); >>>>>> ... >>>>>> } >>>>>> >>>>>> GUP(vma, ..., fake_anon, fake_vma) >>>>>> { >>>>>> if (vma->flags == ANON) { >>>>>> // Add the fake anon vma to the anon vma chain as a child >>>>>> // of current vma >>>>>> } else { >>>>>> // Add the fake vma to the mapping tree >>>>>> } >>>>>> >>>>>> // The existing GUP except that now it inc mapcount and not >>>>>> // refcount >>>>>> GUP_old(..., &nanonymous, &nfiles); >>>>>> >>>>>> atomic_add(&fake_anon->refcount, nanonymous); >>>>>> atomic_add(&fake_vma->refcount, nfiles); >>>>>> >>>>>> return nanonymous + nfiles; >>>>>> } >>>>> >>>>> Thanks for your idea! This is actually something like I was suggesting back >>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember >>>>> people pointing out: >>>>> >>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed >>>>> to get necessary locks to insert new entry into the VMA tree in that >>>>> context. So essentially we'd loose get_user_pages_fast() functionality. >>>>> >>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate >>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree. >>>>> Then on IO completion you need to queue work to unpin the pages again as you >>>>> cannot remove the fake VMA directly from interrupt context where the IO is >>>>> completed. >>>>> >>>>> You are right that the cost could be amortized if gup() is called for >>>>> multiple consecutive pages however for small IOs there's no help... >>>>> >>>>> So this approach doesn't look like a win to me over using counter in struct >>>>> page and I'd rather try looking into squeezing HMM public page usage of >>>>> struct page so that we can fit that gup counter there as well. I know that >>>>> it may be easier said than done... >>>> >>>> So i want back to the drawing board and first i would like to ascertain >>>> that we all agree on what the objectives are: >>>> >>>> [O1] Avoid write back from a page still being written by either a >>>> device or some direct I/O or any other existing user of GUP. >>>> This would avoid possible file system corruption. >>>> >>>> [O2] Avoid crash when set_page_dirty() is call on a page that is >>>> considered clean by core mm (buffer head have been remove and >>>> with some file system this turns into an ugly mess). >>>> >>>> [O3] DAX and the device block problems, ie with DAX the page map in >>>> userspace is the same as the block (persistent memory) and no >>>> filesystem nor block device understand page as block or pinned >>>> block. >>>> >>>> For [O3] i don't think any pin count would help in anyway. I believe >>>> that the current long term GUP API that does not allow GUP of DAX is >>>> the only sane solution for now. >>> >>> No, that's not a sane solution, it's an emergency hack. >> >> Then how do you want to solve it ? Knowing pin count does not help >> you, at least i do not see how that would help and if it does then >> my solution allow you to know pin count it is the difference between >> real mapping and mapcount value. > > True, pin count doesn't help, and indefinite waits are intolerable, so > I think we need to make "long term" GUP revokable, but otherwise > hopefully use the put_user_page() scheme to replace the use of the pin > count for dax_layout_busy_page(). > >>>> The real fix would be to teach file- >>>> system about DAX/pinned block so that a pinned block is not reuse >>>> by filesystem. >>> >>> We already have taught filesystems about pinned dax pages, see >>> dax_layout_busy_page(). As much as possible I want to eliminate the >>> concept of "dax pages" as a special case that gets sprinkled >>> throughout the mm. >>> >>>> For [O1] and [O2] i believe a solution with mapcount would work. So >>>> no new struct, no fake vma, nothing like that. In GUP for file back >>>> pages >>> >>> With get_user_pages_fast() we don't know that we have a file-backed >>> page, because we don't have a vma. >> >> You do not need a vma to know that we have PageAnon() for that so my >> solution is just about adding to core GUP page table walker: >> >> if (!PageAnon(page)) >> atomic_inc(&page->mapcount); > > Ah, ok, would need to add proper mapcount manipulation for dax and > audit that nothing makes page-cache assumptions based on a non-zero > mapcount. > >> Then in put_user_page() you add the opposite. In page_mkclean() you >> count the number of real mapping and voilà ... you got an answer for >> [O1]. You could use the same count real mapping to get the pin count >> in other place that cares about it but i fails to see why the actual >> pin count value would matter to any one. > > Sounds like a could work... devil is in the details. > I also like this solution. It uses existing information and existing pointers (PageAnon, page_mapping) plus a tiny bit more (another pointer in struct address_space, probably, and that's not a size-contentious struct) to figure out the DMA pinned count, which neatly avoids the struct page contention that I ran into. Thanks for coming up with this! I'll put together a patchset that shows the details so we can all take a closer look. This is on the top of my list. -- thanks, John Hubbard NVIDIA