On Thu, Apr 19, 2018 at 3:44 AM, Jan Kara <jack@xxxxxxx> wrote: > On Fri 13-04-18 15:03:51, Dan Williams wrote: >> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: >> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <jack@xxxxxxx> wrote: >> >> On Sat 07-04-18 12:38:24, Dan Williams wrote: >> > [..] >> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't >> >>> need to wait for a global quiescent state, just a >> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the >> >>> srcu api? >> >> >> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than >> >> SRCU. It is a more-or-less standard locking mechanism rather than relying >> >> on implementation properties of SRCU which is a data structure protection >> >> method. And the overhead of percpu rwsemaphore for your use case should be >> >> about the same as that of SRCU. >> > >> > I was just about to ask that. Yes, it seems they would share similar >> > properties and it would be better to use the explicit implementation >> > rather than a side effect of srcu. >> >> ...unfortunately: >> >> BUG: sleeping function called from invalid context at >> ./include/linux/percpu-rwsem.h:34 >> [..] >> Call Trace: >> dump_stack+0x85/0xcb >> ___might_sleep+0x15b/0x240 >> dax_layout_lock+0x18/0x80 >> get_user_pages_fast+0xf8/0x140 >> >> ...and thinking about it more srcu is a better fit. We don't need the >> 100% exclusion provided by an rwsem we only need the guarantee that >> all cpus that might have been running get_user_pages_fast() have >> finished it at least once. >> >> In my tests synchronize_srcu is a bit slower than unpatched for the >> trivial 100 truncate test, but certainly not the 200x latency you were >> seeing with syncrhonize_rcu. >> >> Elapsed time: >> 0.006149178 unpatched >> 0.009426360 srcu > > Hum, right. Yesterday I was looking into KSM for a different reason and > I've noticed it also does writeprotect pages and deals with races with GUP. > And what KSM relies on is: > > write_protect_page() > ... > entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > */ > if (page_mapcount(page) + 1 + swapped != page_count(page)) { > page used -> bail Slick. > } > > And this really works because gup_pte_range() does: > > page = pte_page(pte); > head = compound_head(page); > > if (!page_cache_get_speculative(head)) > goto pte_unmap; > > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > bail Need to add a similar check to __gup_device_huge_pmd. > } > > So either write_protect_page() page sees the elevated reference or > gup_pte_range() bails because it will see the pte changed. > > In the truncate path things are a bit different but in principle the same > should work - once truncate blocks page faults and unmaps pages from page > tables, we can be sure GUP will not grab the page anymore or we'll see > elevated page count. So IMO there's no need for any additional locking > against the GUP path (but a comment explaining this is highly desirable I > guess). Yes, those "pte_val(pte) != pte_val(*ptep)" checks should be documented for the same reason we require comments on rmb/wmb pairs. I'll take a look, thanks Jan.