On 2018/10/18 08:15:19 -0700, Paul E. McKenney wrote: > On Thu, Oct 18, 2018 at 10:03:56PM +0900, Akira Yokosawa wrote: >> On 2018/10/17 17:37:39 -0700, Paul E. McKenney wrote: >>> On Thu, Oct 18, 2018 at 07:21:38AM +0900, Akira Yokosawa wrote: >>>> On 2018/10/17 08:10:52 -0700, Paul E. McKenney wrote: >>>>> On Tue, Oct 16, 2018 at 08:04:00AM +0900, Akira Yokosawa wrote: >>>>>> >From 7b01fc0f19cfa010536d7eb53e4d0cda1e6b801f Mon Sep 17 00:00:00 2001 >>>>>> From: Akira Yokosawa <akiyks@xxxxxxxxx> >>>>>> Date: Mon, 15 Oct 2018 23:46:52 +0900 >>>>>> Subject: RFC [PATCH] count_lim_sig: Add pair of smp_wmb() and smp_rmb() >>>>>> >>>>>> This message-passing pattern requires smp_wmb()--smp_rmb() pairing. >>>>>> >>>>>> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> >>>>>> --- >>>>>> Hi Paul, >>>>>> >>>>>> I'm not sure this addition of memory barriers is actually required, >>>>>> but it does look like so. >>>>>> >>>>>> And I'm aware that you have avoided using weaker memory barriers in >>>>>> CodeSamples. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Hello, Akira, >>>>> >>>>> I might be missing something, but it looks to me like this ordering is >>>>> covered by heavyweight ordering in the signal handler entry/exit and >>>>> the gblcnt_mutex. So what sequence of events leads to the failiure >>>>> scenario that you are seeing? >>>> >>>> So the fastpaths in add_count() and sub_count() are not protected by >>>> glbcnt_mutex. The slowpath in flush_local_count() waits the transition >>>> of theft from REQ to READY, clears counter and countermax, and finally >>>> assign IDLE to theft. >>>> >>>> So, the fastpaths can see (theft == IDLE) but see a non-zero value of >>>> counter or countermax, can't they? >>> >>> Maybe, maybe not. Please lay out a sequence of events showing a problem, >>> as in load by load, store by store, line by line. Intuition isn't as >>> helpful as one might like for this kind of stuff. ;-) >> >> Gotcha! >> >> I've not exhausted the timing variations, but now I see when >> split_local_count() sees (*theft@[t] == THEFT_READY), counter part of >> add_count() or sub_count() has exited the fastpath (marked by >> counting == 1). >> >> So the race I imagined has never existed. > > I know that feeling!!! > >> Thanks for your nice suggestion! > > Well, there might well be another race. My main concern is whether or not > signal-handler entry/exit really provides full ordering on all platforms. > > Thoughts? Does your concern related to the lack of memory barrier at the entry of flush_local_count_sig() in Listing 5.17? Akira > > Thanx, Paul > >>>> One theory to prevent this from happening is because all the per-thread >>>> variables of a thread reside in a single cache line, and if the fastpaths >>>> see the updated value of theft, they are guaranteed to see the latest >>>> values of both counter and countermax. >>> >>> Good point, but we need to avoid that sort of assumption unless we >>> placed the variables into a struct and told the compiler to align it >>> appropriately. And even then, hardware architectures normally don't >>> make this sort of guarantee. There is too much that can go wrong, from >>> ECC errors to interrupts at just the wrong time, and much else besides. >> >> Absolutely! >> >> Thanks, Akira >> >>> >>> Thanx, Paul >>> >>>> I might be completely missing something, though. >>>> >>>> Thanks, Akira >>>> >>>>> >>>>> Thanx, Paul >>>>> [...]