Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Nov 26, 2023 at 11:39:37AM -0500, Guo Ren wrote:
> On Wed, Nov 22, 2023 at 09:20:53AM -0800, Linus Torvalds wrote:
> > On Tue, 21 Nov 2023 at 23:19, Guo Ren <guoren@xxxxxxxxxx> wrote:
> > >
> > > We discussed x86 qspinlock code generation. It looked not too bad as I
> > > thought because qspinlock_spin_value_unlocked is much cheaper than the
> > > ticket-lock. But the riscv ticket-lock code generation is terrible
> > > because of the shift left & right 16-bit.
> > > https://lore.kernel.org/all/ZNG2tHFOABSXGCVi@xxxxxxxxx
> > 
> > No, it's not the 16-bit shifts in the spin_value_unlocked() check,
> > that just generates simple and straightforward code:
> > 
> >   a0:   0107569b                srlw    a3,a4,0x10
> >   a4:   00c77733                and     a4,a4,a2
> >   a8:   04e69063                bne     a3,a4,e8 <.L12>
> > 
> > (plus two stupid instructions for generating the immediate in a2 for
> > 0xffff, but hey, that's the usual insane RISC-V encoding thing - you
> > can load a 20-bit U-immediate only shifted up by 12, if it's in the
> > lower bits you're kind of screwed and limited to 12-bit immediates).
> > 
> > The *bad* code generation is from the much simpler
> > 
> >         new.count++;
> > 
> > which sadly neither gcc not clang is quite smart enough to understand
> > that "hey, I can do that in 64 bits".
> > 
> > It's incrementing the higher 32-bit word in a 64-bit union, and with a
> > smarter compiler it *should* basically become
> > 
> >         lock_count += 1 << 32;
> > 
> > but the compiler isn't that clever, so it splits the 64-bit word into
> > two 32-bit words, increments one of them, and then merges the two
> > words back into 64 bits:
> > 
> >   98:   4207d693                sra     a3,a5,0x20
> >   9c:   02079713                sll     a4,a5,0x20
> >   a0:   0016869b                addw    a3,a3,1
> >   a4:   02069693                sll     a3,a3,0x20
> >   a8:   02075713                srl     a4,a4,0x20
> >   ac:   00d76733                or      a4,a4,a3
> > 
> > which is pretty sad.
> 9c & a8 is for word-zero-extend; riscv would have zext.w in the future.
> Your patch may improve above with:
> 	li      a4,1
> 	slli    a4,a4,32
> 	add     a4,a5,a4
> 
> v.s.
> 	sra     a3,a5,0x20
> 	zext.w	a4,a5
> 	addw    a3,a3,1
> 	or      a4,a4,a3
> You win one instruction "or a4,a4,a3", which is less than one cycle.
Sorry, I forgot "sll     a3,a3,0x20", so it's 1.5 cycles, but it didn't
affect my opinion here; local core operations are the lower optimization
priority than memory transations.

> 
> The zext.w is important, and it could replace sll+srl a lot, so I think
> it's a current ISA design short.
> 
> Here, what I want to improve is to prevent stack frame setup in the fast
> path, and that's the most benefit my patch could give out. Unnecessary
> memory access is the most important performance killer in SMP.
> 
> My patch removes the stack frame setup from the fast path.
> void lockref_get(struct lockref *lockref)
> {
>   78:   00053783                ld      a5,0(a0)
> 000000000000007c <.LBB212>:
>   7c:   00010637                lui     a2,0x10
> 
> 0000000000000080 <.LBE212>:
>   80:   06400593                li      a1,100
> 
> 0000000000000084 <.LBB216>:
>   84:   fff60613                add     a2,a2,-1 # ffff <.LLST8+0xf4aa>
> 
> 0000000000000088 <.L8>:
>   88:   0007871b                sext.w  a4,a5
> 
> 000000000000008c <.LBB217>:
>   8c:   0107d69b                srlw    a3,a5,0x10
>   90:   00c77733                and     a4,a4,a2
>   94:   04e69063                bne     a3,a4,d4 <.L12> --------+
> 						      		|
> 0000000000000098 <.LBB218>:					|
>   98:   4207d693                sra     a3,a5,0x20		|
>   9c:   02079713                sll     a4,a5,0x20		|
>   a0:   0016869b                addw    a3,a3,1			|
>   a4:   02069693                sll     a3,a3,0x20		|
>   a8:   02075713                srl     a4,a4,0x20		|
>   ac:   00d76733                or      a4,a4,a3		|
> 								|
> 00000000000000b0 <.L0^B1>:					|
>   b0:   100536af                lr.d    a3,(a0)			|
>   b4:   00f69863                bne     a3,a5,c4 <.L1^B1>	|
>   b8:   1ae5382f                sc.d.rl a6,a4,(a0)		|
>   bc:   fe081ae3                bnez    a6,b0 <.L0^B1>		|
>   c0:   0330000f                fence   rw,rw			|
> 								|
> 00000000000000c4 <.L1^B1>:					|
>   c4:   04d78a63                beq     a5,a3,118 <.L18>	|
> 								|
> 00000000000000c8 <.LBE228>:					|
>   c8:   fff5859b                addw    a1,a1,-1		|	
> 								|
> 00000000000000cc <.LBB229>:					|
>   cc:   00068793                mv      a5,a3			|
> 								|
> 00000000000000d0 <.LBE229>:					|
>   d0:   fa059ce3                bnez    a1,88 <.L8>		|
> 						     		|
> 00000000000000d4 <.L12>: <--------------------------------------+
> {						      slow_path
>   d4:   fe010113                add     sp,sp,-32
>   d8:   00113c23                sd      ra,24(sp)
>   dc:   00813823                sd      s0,16(sp)
>   e0:   02010413                add     s0,sp,32
> 
> 
> > 
> > If you want to do the optimization that the compiler misses by hand,
> > it would be something like the attached patch.
> > 
> > NOTE! Very untested. But that *should* cause the compiler to just
> > generate a single "add" instruction (in addition to generating the
> > constant 0x100000000, of course).
> > 
> > Of course, on a LL/SC architecture like RISC-V, in an *optimal* world,
> > the whole sequence would actually be done with one single LL/SC,
> > rather than the "load,add,cmpxchg" thing.
> > 
> > But then you'd have to do absolutely everything by hand in assembly.
> No, it's not worth to do that.
>  - There are only atomic primitives in Linux, but no ll/sc primitive in
>    the real world. The world belongs to AMO and the only usage of ll/sc
>    is to implement AMO and CAS.
>  - Using single ll/sc primitive instead of cmpxchg is similar to your
>    patch, and you may win 1 cycle or not.
>  - The critical work here are reducing bus transactions, preventing
>    cache dance, and forward progress guarantee.
> 
> Here is my optimization advice:
> 
> #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                        \
>         int retry = 100;                                                        \
>         struct lockref old;                                                     \
>         BUILD_BUG_ON(sizeof(old) != 8);                                         \
> +       prefetchw(lockref);                                                     \
>         old.lock_count = READ_ONCE(lockref->lock_count);                        \
>         while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
>                 struct lockref new = old;                                       \
>                 CODE                                                            \
>                 if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
>                                                  &old.lock_count,               \
>                                                  new.lock_count))) {            \
>                         SUCCESS;                                                \
>                 }                                                               \
> 
> Micro-arch could give prefetchw more guarantee:
>  - Prefetch.w must guarantee cache line exclusiveness even when a
>    shareable state cache line hits.
>  - Hold the exclusive cache line for several cycles until the next
>    store or timeout
>  - Mask interrupt during the holding cycles (Optional)
> 
> The lockref slow path is killed in this micro-architecture, which
> means there is no chance to execute the spinlock.
> 
> I've written down more details in my ppt:
> https://docs.google.com/presentation/d/1UudBcj4cL_cjJexMpZNF9ppRzYxeYqsdBotIvU7sO2Q/edit?usp=sharing
> 
> This type of prefetchw could help large-size atomic operations within
> one cache line. Compared to the transaction memory model, prefetchw
> could give a forward progress guarantee and easier landing in Linux
> without any new primitive.
> 
> > 
> >                   Linus
> 
> >  lib/lockref.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..481b102a6476 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -26,6 +26,17 @@
> >  	}									\
> >  } while (0)
> >  
> > +/*
> > + * The compiler isn't smart enough to the the count
> > + * increment in the high 32 bits of the 64-bit value,
> > + * so do this optimization by hand.
> > + */
> > +#if defined(__LITTLE_ENDIAN) && BITS_PER_LONG == 64
> > + #define LOCKREF_INC(n) ((n).lock_count += 1ul<<32)
> > +#else
> > + #define LOCKREF_INC(n) ((n).count++)
> > +#endif
> > +
> >  #else
> >  
> >  #define CMPXCHG_LOOP(CODE, SUCCESS) do { } while (0)
> > @@ -42,7 +53,7 @@
> >  void lockref_get(struct lockref *lockref)
> >  {
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  	,
> >  		return;
> >  	);
> > @@ -63,7 +74,7 @@ int lockref_get_not_zero(struct lockref *lockref)
> >  	int retval;
> >  
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  		if (old.count <= 0)
> >  			return 0;
> >  	,
> > @@ -174,7 +185,7 @@ int lockref_get_not_dead(struct lockref *lockref)
> >  	int retval;
> >  
> >  	CMPXCHG_LOOP(
> > -		new.count++;
> > +		LOCKREF_INC(new);
> >  		if (old.count < 0)
> >  			return 0;
> >  	,
> 
> 




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux