On 29/05/2017 17:50, Paul E. McKenney wrote: > On Mon, May 29, 2017 at 05:05:47PM +0200, Paolo Bonzini wrote: >> Adding Paul McKenney... >> >> On 29/05/2017 16:30, Linu Cherian wrote: >>> In KVM irq injection path, irqfd_wakeup in interrupt context >>> calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject >>> in process context also calls srcu_read_lock inside kvm_set_irq. >>> >>> This can lead to race condition while incrementing the >>> lock_count(using __this_cpu_inc), since atomic instructions being not >>> used. >>> >>> Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the >>> problem which backs up this analysis. >> >> Paul, how would you feel about either using this_cpu_inc in >> __srcu_read_lock and __srcu_read_unlock? Many architectures already >> provide efficient this_cpu_inc, and any architecture that has LL/SC >> could provide it easily. > > You actually are not supposed to call __srcu_read_lock() and > __srcu_read_unlock() from irq context. (Unless you know that you > interrupted some context that is guaranteed never to use the srcu_struct > in question or some such.) > > Given that this is the KVM irq injection path, I am guessing that it > is executed quite frequently, so that you do need a low-cost highly > scalable solution. However, you only really need to exclude between > process level and the various interrupts, so there are three ways > that an architecture could handle this: > > 1. Single memory-to-memory increment instructions, as you say. > > 2. Atomic read-modify-write instructions, again as you say. > > 3. Disable interrupts across __srcu_read_lock() and __srcu_read_unlock(). > > Different architectures would want different approached, depending on what > instructions are available and the relative speed of these instructions. > It would not be hard to make this an architecture choice on the one > hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq() > on the other. But I recently got some serious pushback on RCU API > proliferation on the one hand and SRCU in particular on the other. So > it might be good to look at some alternatives as well. Yeah, I liked Linu's suggestion of this_cpu_inc because it's already used in srcu_read_unlock, and because it can fall back to (3) automatically. Relaxed atomics weren't much of a proposal, rather a way to implement this_cpu_inc on some architectures that do not support it yet (notably 32-bit ARM, MIPS and PPC). The disadvantage would be slowing down SRCU on machines where neither (1) nor (2) are possible, and local_irq_save/restore is slower than disabling preemption. But I can certainly wrap ->irq_srcu lock/unlock into my own helpers that take care of saving/restoring the interrupt state, like #define kvm_irq_srcu_read_lock_irqsave(kvm, flags) #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags) This would be very simple, confined into KVM and respectful of the SRCU API requirements. > But first, one question... Does the entire interrupt run in the guest in > interrupt context in the host? It's interrupt context in the host. > Or is there some sort of wakeup sequence > where the guest's interrupt handler runs in process context from the > host's perspective? From a quick glance at the code, it looks like > the guest's interrupt handler runs in process context (from the host's > perspective) and that the SRCU read-side critical section is confined > to the host's interrupt handler. Correct. > Another (stupid) question... Can the process-level uses of ->irq_srcu > instead use ->srcu? I suppose you would synchronize_srcu on both in the write side? Probably not because because ->srcu has really really long read-side critical sections. Basically the entire EQS of RCU is a read-side ->srcu critical section. Unfortunately the write side of ->irq_srcu is also relatively performance-critical, and since ->irq_srcu read-side critical sections are very short, having a separate SRCU instance solved the issue quite nicely. Originally Christian Borntraeger had proposed turning the relevant synchronize_rcu into synchronize_rcu_expedited, but as you said it's not good for real-time so here we are. Paolo > As the code is written now, it is OK to use > srcu_read_lock() and srcu_read_unlock() from interrupt handlers as > long as they are never used from either process level or from nested > interrupt handlers. Does that work for you? > > If this is the case, another approach would be for updates to wait for > both an SRCU grace period and an RCU-sched grace period, perhaps with > the latter expedited, though that won't make the real-time guys happy. > > Thanx, Paul > >> Thanks, >> >> Paolo >> >>> Possible solution >>> ----------------- >>> One way is to avoid the srcu_read_lock/unlock usage in irq context. >>> In arm/arm64 architecture, >>> if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, >>> false) == -EWOULDBLOCK) >>> in irqfd_wakefup is always true and hence schedule_work can be called >>> directly >>> >>> @@ -195,6 +195,11 @@ int __attribute__((weak)) >>> kvm_arch_set_irq_inatomic( >>> int idx; >>> >>> if (flags & POLLIN) { >>> + if (kvm_arch_set_irq_will_block_always()) { >>> + schedule_work(&irqfd->inject); >>> + goto skiplock; >>> + } >>> + >>> idx = srcu_read_lock(&kvm->irq_srcu); >>> do { >>> seq = >>> read_seqcount_begin(&irqfd->irq_entry_sc); >>> @@ -208,6 +213,7 @@ int __attribute__((weak)) >>> kvm_arch_set_irq_inatomic( >>> srcu_read_unlock(&kvm->irq_srcu, idx); >>> } >>> >>> +skiplock: >>> if (flags & POLLHUP) { >>> /* The eventfd is closing, detach from KVM */ >>> unsigned long flags; >>> >>> This works without giving any warnings as well. >>> >>> Is a patch welcome in that direction ? >>> >>> Appreicate your feedback on this. >>> >>> >>> >> >