On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote: > On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@xxxxxxxxxx> wrote: > > > Hi, Nadav, > > > > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote: > >> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > >> > >>> From: Nadav Amit <namit@xxxxxxxxxx> > >>> > >>> Using a PTE on x86 with cleared access-bit (aka young-bit) > >>> takes ~600 cycles more than when the access bit is set. At the same > >>> time, setting the access-bit for memory that is not used (e.g., > >>> prefetched) can introduce greater overheads, as the prefetched memory is > >>> reclaimed later than it should be. > >>> > >>> Userfaultfd currently does not set the access-bit (excluding the > >>> huge-pages case). Arguably, it is best to let the user control whether > >>> the access bit should be set or not. The expected use is to request > >>> userfaultfd to set the access-bit when the copy/wp operation is done to > >>> resolve a page-fault, and not to set the access-bit when the memory is > >>> prefetched. > >>> > >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the > >>> young bit to be set. > >> > >> I reply to my own email, but this mostly addresses the concerns that Peter > >> has raised. > >> > >> So I ran the test below on my Haswell (x86), which showed two things: > >> > >> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles > >> more than with dirty+young (depending on the access, of course: dirty > >> does not matter for read, dirty+young both matter for write). > >> > >> 2. I made a mistake in my implementation. PTEs are - at least on x86 - > >> created as young with mk_pte(). So the logic should be similar to > >> do_set_pte(): > >> > >> if (prefault && arch_wants_old_prefaulted_pte()) > >> entry = pte_mkold(entry); > >> else > >> entry = pte_sw_mkyoung(entry); > >> > >> Based on these results, I will send another version for both young and > >> dirty. Let me know if these results are not convincing. > > > > Thanks for trying to verify this idea, but I'm not fully sure this is what > > my concern was on WRITE_LIKELY. > > > > AFAICT the test below was trying to measure the overhead of hardware > > setting either access or dirty or both bits when they're not set for > > read/write. > > Indeed. > > > > > What I wanted as a justification is whether WRITE_LIKELY would be helpful > > in any real world scenario at all. AFAIK the only way to prove it so far > > is to measure any tlb flush difference (probably only on x86, since that > > tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not > > trigger with W=0,D=0 (where W stands for "write bit", and D stands for > > "dirty bit"). > > > > It's not about the slowness when D is cleared. > > > > The core thing is (sorry to rephrase, but just hope we're on the same page) > > we'll set D bit always for all uffd pages so far. Even if we want to > > change that behavior so we skip setting D bit for RO pages (we'll need to > > convert the dirty bit into PageDirty though), we'll still always set D bit > > for writable pages. So we always set D bit as long as possible and we'll > > never suffer from hardware overhead on setting D bit for uffd pages. > > Thanks as usual for your clarifications. As you see, I also do my best to be > on the same page with, even if from time to time I fail. I had some recent > communication challenges on lkml, so I hope that you understand that > everything I say is said with full respect, and if I use double-quotes while > arguing with you, it is in good spirit, and I really appreciate your > feedback. > > Ok. So there is a lot to digest in what you just said, and I politely > disagree with some of the assertions that you made. You focus on discussing > the issue of whether we set the dirty bit for RO pages, which in my opinion > is so intuitively wrong. But, I think that discussing this issue really > digress us from the benefits of not setting the D-bit when it is unnecessary > for RW pages, which is the main question. > > But before we get to it, I want to argue with some of the “facts” that you > present: > > 1. "D bit always for all uffd pages” - This is true for almost all UFFD > operations, but not true to write-unprotect. The moment we use David’s > MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty. > [Yes, the patches that I sent do not deal with that: as I noted before, I > want to send it as part of v2.] Arguably, one of the places that setting the > D-bit matters the most if when you change PTE from RO->RW. And anyhow, as > you can see the API is inconsistent. Yes. This point is WRITE_HINT irrelevant, afaict, but I fully agree we should fix it. It'll be great if you'll cover that upon David's patch. [1] > > 2. "we'll still always set D bit for writable pages” - To continue my > previous bullet - why? Why would we? Besides uffd-wrunprotect path, why > would we always set it for MCOPY_CONTINUE? (more to follow on this one) I wanted to mean that we used to always set both D bits when W=1. IIUC that applies to not only uffd but any general pgtable accesses. E.g. in do_anonymous_page() we'll set both D if W=1. Same to anywhere I can find across the kernel. > > 3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may > not trigger with W=0,D=0” - This is really a boring case, which even if was > underoptimized could have been resolved by its own. This is certainly not > the motivation for the write hints. Hmm? I don't see what benefit we can get from the new WRITE_HINT besides avoiding tlb flushes, or do we? See right below.. > > > So please allow me to go back to the reasons for why you want write-hints, > and RO entries would be mostly left out and implicitly included in the > other arguments - which are mainly about RW entries: > > 1. You can avoid TLB flushes when write-protecting clean writable PTEs (on > x86). Such PTEs can be created when userspace monitor prefaults or > speculatively write-unprotects memory that might be needed. When a monitor > removes the mapping, using MADV_DONTNEED, it would not need to flush > anything. Yep, but afaict this is the "avoid tlb flush benefit" we talked above.. > > 2. Hopefully you agree that write-hints are needed if during > uffd-write-unprotect I think I agree on the unprotect above [1] on fixing D bit set if unprotected. If with that fix then I think whether WRITE_HINT would help or not here for unprotect is the same as whether WRITE_HINT would help in other cases like UFFDIO_COPY. So these are two problems to me: (1) Whether we should fix change_pte_range() to set D properly when unprotect (W=1), again I fully agree. (2) Whether WRITE_HINT helps for unprotect is to be discussed here, and we're trying to reach a consensus.. > we actually write-unprotect the PTE (not just clearing > the logical flag). So you do need write-hint for zero-page (to get a > clear-page) and for write-unprotect, so why not to be consistent and provide > it for all operations? I always agree on the zeropage case. But IMHO that's the only special case here. > > 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with > UFFDIO_CONTINUE, but from the discussion (and skimming the code) I > understood that you can be used for prefetching. In such case, why would you > assume that the page is dirty? Again, I think it's the same as when we faulting in a normal anonymous page: when we grant write bit we always assume it's dirty. Yes it may not really be written, but don't you think that's a more common question to ask rather than uffd only? IOW, why we always set D bit in do_anonymous_page() even if there's possibility that the page may not be written at all? Actually I'd say we'll have problem if we don't set dirty, because then we won't be able to track soft-dirty anymore just like below - we need an explicit page fault to trap soft-dirty. With W=1, how do we do that? > > 4. It allows you to treat softdirty properly. If softdirty is on, > presumably, if you did not get a WRITE_HINT, you would keep it > writeprotected. Sorry I don't get why WRITE_HINT helps soft-dirty. As I mentioned before, I think it will start to complicate soft-dirty rather than making it better. Basically when the hints are enabled in feature bits and when the user has !WRITE_HINT, we'll need to wr-protect the page for soft-dirty if it's enabled. We don't need that extra complexity if we don't apply WRITE_HINT outside ZEROPAGE. That's really even more than "hardware setting D bit overhead" because it'll be a full path page fault just to trap soft-dirty. I am not sure we're really on the same page here, but I think I'll just wait and read your code if it's coming and comment there if I see anything wrong. Maybe that'll be more straightforward. > > 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page. This one is actually point 2 above. Taken. > > Now I will “kill myself” over support of write-hints. Not everything that > I mentioned here is in v1 that I sent. > > But, if you decide you do not want write-hints, I would still need to leave > the UFFDIO_ZERO write-hints (that clear the page) and to find some solution > for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance > of WP page-fault handling). > > If you decide you do want it, I would run some tests to check that indeed > the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was > needed. > > > The other worry of having WRITE_HINT is, after we have it we probably need > > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a > > very light ABI change since we used to always set it), then I'll start to > > worry the hardware setting D bit overhead you just measured because we'll > > have that overhead when user didn't specify WRITE_HINT with the old code. > > Good point, which I have already got to. So, because I was mistaken on the > ACCESS_HINT understanding (young-bit is set by default), this would mean > that the hints should only be regarded when the “hints” feature is enabled > for backward compatibility. I got this code ready for v2. Sounds good. > > > > > So again, I'm totally fine if you want to start with ACCESS_HINT only, but > > I still don't see why we should need WRITE_HINT too.. > > I really hate how the “features” evolve that I think it is better to decide > now. > > I think that in the lack of agreement, the best thing to do is to put all of > the write-hints now in the API, and only regard them for UFFDIO_ZERO (which > would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old > bit). Yes let's fix unprotect on D bit first, because that's inconsistent. Please put it as the 1st patch if you plan to post them together, I think we should have it even earlier if not together with the uffd-hint series. Then, there's one thing we can do as you said - we can introduce WRITE_HINT but only apply it to ZEROPAGE only (note: I still don't see why unprotect is special here, it just has one perf bug to fix). It can be bound to ACCESS_HINT with the same feature bit then we think about whether we want to extend it to other uffd ioctls besides ZEROPAGE. That's probably a good split on the differences. > > Again, thanks for the feedback, and hopefully (at least) I understood you > this time. What you suggested at the end sounds good to me, thanks for being persistent. The only thing uncertain seems to be whether we need WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan. Thanks, -- Peter Xu