On 3/29/21 8:16 PM, Jonas Malaco wrote: > On Mon, Mar 29, 2021 at 06:01:00PM -0700, Guenter Roeck wrote: >> On 3/29/21 5:21 PM, Jonas Malaco wrote: >>> On Mon, Mar 29, 2021 at 02:53:39PM -0700, Guenter Roeck wrote: >>>> On Mon, Mar 29, 2021 at 05:22:01AM -0300, Jonas Malaco wrote: >>>>> To avoid a spinlock, the driver explores concurrent memory accesses >>>>> between _raw_event and _read, having the former updating fields on a >>>>> data structure while the latter could be reading from them. Because >>>>> these are "plain" accesses, those are data races according to the Linux >>>>> kernel memory model (LKMM). >>>>> >>>>> Data races are undefined behavior in both C11 and LKMM. In practice, >>>>> the compiler is free to make optimizations assuming there is no data >>>>> race, including load tearing, load fusing and many others,[1] most of >>>>> which could result in corruption of the values reported to user-space. >>>>> >>>>> Prevent undesirable optimizations to those concurrent accesses by >>>>> marking them with READ_ONCE() and WRITE_ONCE(). This also removes the >>>>> data races, according to the LKMM, because both loads and stores to each >>>>> location are now "marked" accesses. >>>>> >>>>> As a special case, use smp_load_acquire() and smp_load_release() when >>>>> loading and storing ->updated, as it is used to track the validity of >>>>> the other values, and thus has to be stored after and loaded before >>>>> them. These imply READ_ONCE()/WRITE_ONCE() but also ensure the desired >>>>> order of memory accesses. >>>>> >>>>> [1] https://lwn.net/Articles/793253/ >>>>> >>>> >>>> I think you lost me a bit there. What out-of-order accesses that would be >>>> triggered by a compiler optimization are you concerned about here ? >>>> The only "problem" I can think of is that priv->updated may have been >>>> written before the actual values. The impact would be ... zero. An >>>> attribute read would return "stale" data for a few microseconds. >>>> Why is that a concern, and what difference does it make ? >>> >>> The impact of out-of-order accesses to priv->updated is indeed minimal. >>> >>> That said, smp_load_acquire() and smp_store_release() were meant to >>> prevent reordering at runtime, and only affect architectures other than >>> x86. READ_ONCE() and WRITE_ONCE() would already prevent reordering from >>> compiler optimizations, and x86 provides the load-acquire/store-release >>> semantics by default. >>> >>> But the reordering issue is not a concern to me, I got carried away when >>> adding READ_ONCE()/WRITE_ONCE(). While smp_load_acquire() and >>> smp_store_release() make the code work more like I intend it to, they >>> are (small) costs we can spare. >>> >>> I still think that READ_ONCE()/WRITE_ONCE() are necessary, including for >>> priv->updated. Do you agree? >>> >> >> No. What is the point ? The order of writes doesn't matter, the writes won't >> be randomly dropped, and it doesn't matter if the reader reports old values >> for a couple of microseconds either. This would be different if the values >> were used as synchronization primitives or similar, but that isn't the case >> here. As for priv->updated, if you are concerned about lost reports and >> the 4th report is received a few microseconds before the read, I'd suggest >> to loosen the interval a bit instead. >> >> Supposedly we are getting reports every 500ms. We have two situations: >> - More than three reports are lost, making priv->updated somewhat relevant. >> In this case, it doesn't matter if outdated values are reported for >> a few uS since most/many/some reports are outdated more than a second >> anyway. >> - A report is received but old values are reported for a few uS. That >> doesn't matter either because reports are always outdated anyway by >> much more than a few uS anyway, and the code already tolerates up to >> 2 seconds of lost reports. >> >> Sorry, I completely fail to see the problem you are trying to solve here. > > Please disregard the out-of-order accesses, I agree that preventing them > "are a (small) cost we can spare". > > The main problem I still would like to address are the data races. > While the stores and loads cannot be dropped, and we can tolerate their > reordering, they could still be teared, fused, perhaps invented... > According to [1] these types of optimizations are not unheard. > > Load tearing alone could easily produce values that are not stale, but > wrong. Do we also tolerate wrong values, even if they are infrequent? > > Another detail I should have mentioned sooner is that READ_ONCE() and > WRITE_ONCE() cause only minor (gcc) to no (clang) changes to the > generated code for x86_64 and i386.[2] While this seems contrary to the > point I am trying to make, I want to show that, for the most part, these > changes just lock in a reasonable compiler behavior. > > Specifically, on x86_64/gcc (the most relevant arch/compiler for this > driver) the changes are restricted to kraken2_read: > > 1. Loading of priv->updated and jiffies are reordered, because > (with READ_ONCE()) both are volatile and time_after(a, b) is > defined as b - a. > Are you really trying to claim that much of the time_after() code in the Linux kernel is wrong ? > 2. When loading priv->fan_input[channel], > movzx eax,WORD PTR [rdx+rcx*2+0x14] > is split into > add rcx,0x8 > movzx eax,WORD PTR [rdx+rcx*2+0x4] > for no reason I could find in the x86 manual. > > 3. Similarly, when loading priv->temp_input[channel] > movsxd rax,DWORD PTR [rdx+rcx*4+0x10] > turns into > add rcx,0x4 > movsxd rax,DWORD PTR [rdx+rcx*4] > I hardly see how this matters. In both cases, rax enda up with the same value. Maybe rcx is reused later on. If not, maybe the compiler had a bad day. But that is not an argument for using READ_ONCE/WRITE_ONCE; after all, the same will happen with all other indexed accesses. Guenter > Both 2 and 3 admittedly get a bit worse with READ_ONCE()/WRITE_ONCE(). > But that is on gcc, and with the data race it could very well decide to > produce much worse code than that at any moment. > > On Arm64 the code does get a lot more ordered, which we have already > agreed is not really necessary. But removing smp_load_acquire() and > smp_store_release() should still allow the CPU to reorder those, > mitigating some of the impact. > > I hope this email clarifies what I am concerned about. > > Thanks for the patience, > Jonas > > P.S. Tested with gcc 10.2.0 and clang 11.1.0. > > [1] https://lwn.net/Articles/793253/ > [2] (outdated, still with smp_*()): https://github.com/jonasmalacofilho/patches/tree/master/linux/nzxt-kraken2-mark-and-order-concurrent-accesses/objdumps >