Will Deacon's on August 29, 2019 2:12 am: > Hi Nick, > > On Wed, Aug 28, 2019 at 10:35:24AM +1000, Nicholas Piggin wrote: >> Will Deacon's on August 27, 2019 11:18 pm: >> > So far, so good, but the final piece of the puzzle isn't quite so rosy. >> > >> > *** Other architecture maintainers -- start here! *** >> > >> > In the case that one CPU maps a page and then sets a flag to tell another >> > CPU: >> > >> > CPU 0 >> > ----- >> > >> > MOV X0, <valid pte> >> > STR X0, [Xptep] // Store new PTE to page table >> > DSB ISHST >> > ISB >> > MOV X1, #1 >> > STR X1, [Xflag] // Set the flag >> > >> > CPU 1 >> > ----- >> > >> > loop: LDAR X0, [Xflag] // Poll flag with Acquire semantics >> > CBZ X0, loop >> > LDR X1, [X2] // Translates using the new PTE >> > >> > then the final load on CPU 1 can raise a translation fault for the same >> > reasons as mentioned at the start of this description. >> >> powerpc's ptesync instruction is defined to order MMU memory accesses on >> all other CPUs. ptesync does not go out to the fabric though. How does >> it work then? >> >> Because the MMU coherency problem (at least we have) is not that the >> load will begin to "partially" execute ahead of the store, enough to >> kick off a table walk that goes ahead of the store, but not so much >> that it violates the regular CPU barriers. It's just that the loads >> from the MMU don't participate in the LSU pipeline, they don't snoop >> the store queues aren't inserted into load queues so the regular >> memory barrier instructions won't see stores from other threads cuasing >> ordering violations. >> >> In your first example, if powerpc just has a normal memory barrier >> there instead of a ptesync, it could all execute completely >> non-speculatively and in-order but still cause a fault, because the >> table walker's loads didn't see the store in the store queue. > > Ah, so I think this is where our DSB comes in, as opposed to our usual > DMB. DMB provides ordering guarantees and is generally the only barrier > instruction you need for shared memory communication. DSB, on the other > hand, has additional properties such as making page-table updates visible > to the table walker and completing pending TLB invalidation. > > So in practice, DSB is likely to drain the store buffer to ensure that > pending page-table writes are visible at L1, which is coherent with all > CPUs (and their page-table walkers). > >> From the other side of the fabric you have no such problem. The table >> walker is cache coherent apart from the local stores, so we don't need a >> special barrier on the other side. That's why ptesync doesn't broadcast. > > Curious: but do you need to do anything extra to take into account > instruction fetch on remote CPUs if you're mapping an executable page? > We added an IPI to flush_icache_range() in 3b8c9f1cdfc5 to handle this, > because our broadcast I-cache maintenance doesn't force a pipeline flush > for remote CPUs (and may even execute as a NOP on recent cores). Ah, I think the tlbie does not force re-fetch indeed. We may need something like that as well. What do you do on the user side? Require threads to ISB themselves? >> I would be surprised if ARM's issue is different, but interested to >> hear if it is. > > If you take the four scenarios of Map/Unmap for the UP/SMP cases: > > Map+UP // Some sort of fence instruction (DSB;ISB/ptesync) > Map+SMP // Same as above > Unmap+UP // Local TLB invalidation > Unmap+SMP // Broadcast TLB invalidation > > then the most interesting case is Map+SMP, where one CPU transitions a PTE > from invalid to valid and executes its DSB;ISB/PTESYNC sequence without > affecting the remote CPU. That's what I'm trying to get at in my example > below: > >> > CPU 0 CPU 1 >> > ----- ----- >> > spin_lock(&lock); spin_lock(&lock); >> > set_fixmap(0, paddr, prot); if (mapped) >> > mapped = true; foo = *fix_to_virt(0); >> > spin_unlock(&lock); spin_unlock(&lock); >> > >> > could fault. >> >> This is kind of a different issue, or part of a wider one at least. >> Consider speculative execution more generally, any branch mispredict can >> send us off to crazy town executing instructions using nonsense register >> values. CPU0 does not have to be in the picture, or any kernel page >> table modification at all, CPU1 alone will be doing speculative loads >> wildly all over the kernel address space and trying to access pages with >> no pte. >> >> Yet we don't have to flush TLB when creating a new kernel mapping, and >> we don't get spurious kernel faults. The page table walker won't >> install negative entries, at least not "architectural" i.e., that cause >> faults and require flushing. My guess is ARM is similar, or you would >> have seen bigger problems by now? > > Right, we don't allow negative (invalid) entries to be cached in the TLB, > but CPUs can effectively cache the result of a translation for a load/store > instruction whilst that instruction flows down the pipe after its virtual > address is known. /That/ caching is not restricted to valid translations. Ah, I misunderstood you sorry. Yeah that is interesting, I don't think that is explicitly prohibited in the power ISA either. I believe CPU1 would have to do a ptesync to avoid the fault there. > For example, if we just take a simple message passing example where we have > two global variables: a 'mapped' flag (initialised to zero) and a pointer > (initialised to point at a page that is not yet mapped): > > [please excuse the mess I've made of your assembly language] > > CPU 0 > > // Set PTE which maps the page pointed to by pointer > stw r5, 0(r4) > ptesync > lwsync > > // Set 'mapped' flag to 1 > li r9, 1 > stb r9, 0(r3) > > > CPU 1 > > // Poll for 'mapped' flag to be set > loop: lbz r9, 0(r3) > lwsync > cmpdi cr7, r0, 0 > beq cr7, loop > > // Dereference pointer > lwz r4, 0(r5) > > > So in this example, I think it's surprising that CPU 1's dereference of > 'pointer' can fault. The way this happens on arm64 is that CPU 1 can > translate the 'pointer' dereference before the load of the 'mapped' flag has > returned its data. The walker may come back with a fault, but then if the > flag data later comes back as 1, the fault will be taken when the lwz > commits. In other words, translation table walks can occur out-of-order > with respect to the accesses they are translating for, even in the presence > of memory barriers. > > In practice, I think this kind of code sequence is unusual and triggering > the problem relies on CPU 1 knowing the virtual address it's going to > dereference before it's actually mapped. However, this could conceivably > happen with the fixmap and possibly also if the page-table itself was > being written concurrently using cmpxchg(), in which case you might use > the actual pte value in the same way as the 'mapped' flag above. > > But yes, adding the spurious fault handler is belt and braces, which is > why I've kept a WARN_RATELIMIT() in there if it ever triggers. This could be a theoretical problem for powerpc too, I think. Maybe. I'll ask around, I might not be understanding the architecture or Linux code properly. Thanks, Nick