Add a short member for proper alignment and one will probably pop out. Sent from my tablet, pardon any formatting problems. > On Sep 8, 2014, at 19:56, James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > >> On Mon, 2014-09-08 at 19:30 -0400, Peter Hurley wrote: >> On 09/08/2014 01:50 AM, James Bottomley wrote: >>>> Two things: I think that gcc has given up on combining adjacent writes, >>>> perhaps because unaligned writes on some arches are prohibitive, so >>>> whatever minor optimization was believed to be gained was quickly lost, >>>> multi-fold. (Although this might not be the reason since one would >>>> expect that gcc would know the alignment of the promoted store). >>> >>> Um, gcc assumes architecturally correct alignment; that's why it pads >>> structures. Only when accessing packed structures will it use the >>> lowest unit load/store. >>> >>> if you have a struct { char a, char b }; and load first a then b with a >>> constant gcc will obligingly optimise to a short store. >> >> Um, please re-look at the test above. The exact test you describe is >> coded above and compiled with gcc 4.6.3 cross-compiler for parisc using >> the kernel compiler options. >> >> In the generated code, please note the _absence_ of a combined write >> to two adjacent byte stores. > > So one version of gcc doesn't do it. Others do because I've been > surprised seeing it in assembly. > >>>> But additionally, even if gcc combines adjacent writes _that are part >>>> of the program flow_ then I believe the situation is no worse than >>>> would otherwise exist. >>>> >>>> For instance, given the following: >>>> >>>> struct x { >>>> spinlock_t lock; >>>> long a; >>>> byte b; >>>> byte c; >>>> }; >>>> >>>> void locked_store_b(struct x *p) >>>> { >>>> spin_lock(&p->lock); >>>> p->b = 1; >>>> spin_unlock(&p->lock); >>>> p->c = 2; >>>> } >>>> >>>> Granted, the author probably expects ordered writes of >>>> STORE B >>>> STORE C >>>> but that's not guaranteed because there is no memory barrier >>>> ordering B before C. >>> >>> Yes, there is: loads and stores may not migrate into or out of critical >>> sections. >> >> That's a common misconception. >> >> The processor is free to re-order this to: >> >> STORE C >> STORE B >> UNLOCK >> >> That's because the unlock() only guarantees that: >> >> Stores before the unlock in program order are guaranteed to complete >> before the unlock completes. Stores after the unlock _may_ complete >> before the unlock completes. >> >> My point was that even if compiler barriers had the same semantics >> as memory barriers, the situation would be no worse. That is, code >> that is sensitive to memory barriers (like the example I gave above) >> would merely have the same fragility with one-way compiler barriers >> (with respect to the compiler combining writes). >> >> That's what I meant by "no worse than would otherwise exist". > > Actually, that's not correct. This is actually deja vu with me on the > other side of the argument. When we first did spinlocks on PA, I argued > as you did: lock only a barrier for code after and unlock for code > before. The failing case is that you can have a critical section which > performs an atomically required operation and a following unit which > depends on it being performed. If you begin the following unit before > the atomic requirement, you may end up losing. It turns out this kind > of pattern is inherent in a lot of mail box device drivers: you need to > set up the mailbox atomically then poke it. Setup is usually atomic, > deciding which mailbox to prime and actually poking it is in the > following unit. Priming often involves an I/O bus transaction and if > you poke before priming, you get a misfire. > >>>> I see no problem with gcc write-combining in the absence of >>>> memory barriers (as long as alignment is still respected, >>>> ie., the size-promoted write is still naturally aligned). The >>>> combined write is still atomic. >>> >>> Actual alignment is pretty irrelevant. That's why all architectures >>> which require alignment also have to implement misaligned traps ... this >>> is a fundamental requirement of the networking code, for instance. >>> >>> The main problem is gcc thinking there's a misalignment (or a packed >>> structure). That causes splitting of loads and stores and that destroys >>> atomicity. >> >> Yeah, the extra requirement I added is basically nonsense, since the >> only issue is what instructions the compiler is emitting. So if compiler >> thinks the alignment is natural and combines the writes -- ok. If the >> compiler thinks the alignment is off and doesn't combine the writes -- >> also ok. > > Yes, I think I can agree that the only real problem is gcc thinking the > store or load needs splitting. > > James > > -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html