Excerpts from Andy Lutomirski's message of July 16, 2020 3:18 pm: > > >> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@xxxxxxxxx> wrote: >> >> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >>> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@xxxxxxxxx wrote: >>> >>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>>> serialize. There are older kernels for which it does not promise to >>>>>> serialize. And I have plans to make it stop serializing in the >>>>>> nearish future. >>>>> >>>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>>> update the membarrier sync code, at least :) >>>> >>>> Oh, I should actually say Mathieu recently clarified a return from >>>> interrupt doesn't fundamentally need to serialize in order to support >>>> membarrier sync core. >>> >>> Clarification to your statement: >>> >>> Return from interrupt to kernel code does not need to be context serializing >>> as long as kernel serializes before returning to user-space. >>> >>> However, return from interrupt to user-space needs to be context serializing. >> >> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb >> in the right places. >> >> A kernel thread does a use_mm, then it blocks and the user process with >> the same mm runs on that CPU, and then it calls into the kernel, blocks, >> the kernel thread runs again, another CPU issues a membarrier which does >> not IPI this one because it's running a kthread, and then the kthread >> switches back to the user process (still without having unused the mm), >> and then the user process returns from syscall without having done a >> core synchronising instruction. >> >> The cause of the problem is you want to avoid IPI'ing kthreads. Why? >> I'm guessing it really only matters as an optimisation in case of idle >> threads. Idle thread is easy (well, easier) because it won't use_mm, so >> you could check for rq->curr == rq->idle in your loop (in a suitable >> sched accessor function). >> >> But... I'm not really liking this subtlety in the scheduler for all this >> (the scheduler still needs the barriers when switching out of idle). >> >> Can it be improved somehow? Let me forget x86 core sync problem for now >> (that _may_ be a bit harder), and step back and look at what we're doing. >> The memory barrier case would actually suffer from the same problem as >> core sync, because in the same situation it has no implicit mmdrop in >> the scheduler switch code either. >> >> So what are we doing with membarrier? We want any activity caused by the >> set of CPUs/threads specified that can be observed by this thread before >> calling membarrier is appropriately fenced from activity that can be >> observed to happen after the call returns. >> >> CPU0 CPU1 >> 1. user stuff >> a. membarrier() 2. enter kernel >> b. read rq->curr 3. rq->curr switched to kthread >> c. is kthread, skip IPI 4. switch_to kthread >> d. return to user 5. rq->curr switched to user thread >> 6. switch_to user thread >> 7. exit kernel >> 8. more user stuff >> >> As far as I can see, the problem is CPU1 might reorder step 5 and step >> 8, so you have mmdrop of lazy mm be a mb after step 6. >> >> But why? The membarrier call only cares that there is a full barrier >> between 1 and 8, right? Which it will get from the previous context >> switch to the kthread. >> >> I must say the memory barrier comments in membarrier could be improved >> a bit (unless I'm missing where the main comment is). It's fine to know >> what barriers pair with one another, but we need to know which exact >> memory accesses it is ordering >> >> /* >> * Matches memory barriers around rq->curr modification in >> * scheduler. >> */ >> >> Sure, but it doesn't say what else is being ordered. I think it's just >> the user memory accesses, but would be nice to make that a bit more >> explicit. If we had such comments then we might know this case is safe. >> >> I think the funny powerpc barrier is a similar case of this. If we >> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that >> CPU has or will have issued a memory barrier between running user >> code. >> >> So AFAIKS all this membarrier stuff in kernel/sched/core.c could >> just go away. Except x86 because thread switch doesn't imply core >> sync, so CPU1 between 1 and 8 may never issue a core sync instruction >> the same way a context switch must be a full mb. >> >> Before getting to x86 -- Am I right, or way off track here? > > I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()? It's not the thread switch but the return from kernel to user -- at least of architectures that implement membarrier SYNC_CORE, x86 can do that without serializing. The thread switch is muddying the waters a bit, it's not the actual thread switch we care about, that just happens to be used as a point where we try to catch the membarrier IPIs that were skipped due to the PF_KTHREAD optimisation. I think that doing said check in the lazy tlb exit code is both unnecessary for the memory ordering and insufficient for pipeline serialization. > But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: > > int a[2]; > > Thread A: > a[0] = 1; > a[1] = 2: > > Thread B: > > write(fd, a, sizeof(a)); > > Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone. I think kernel accesses probably do matter (or at least they should by principle of least surprise). And so I was doubly misleading by labeling it as "user stuff". I should have distinguished between previous user or kernel accesses, as opposed to the kernel accesses specifically for the implementation of the membarrier call. So I think the membarrier code gets *that* part right (modulo what we have seen already) if the kernel access is being done from process context. But yes if the access is coming from io_uring that has done kthread_use_mm or some other random code running in a kernel thread working on get_user_pages memory or any similar shared vm_insert_pfn memory, then it goes completely to hell. So good catch, PF_KTHREAD check is problematic there even if no actual users exist today. rq->curr == rq->idle test might be better, but can we have interrupts writing completions into user memory? For performance I would hope so, so that makes even that test problematic. Maybe membarrier should close that gap entirely, and work around performance issue by adding _USER_ONLY flags which explicitly only order user mode accesess vs other user accesses. Thanks, Nick