Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT

Alexander Lobakin <alobakin@xxxxx> · Thu, 25 Feb 2021 11:53:29 +0000

From: Yury Norov <yury.norov@xxxxxxxxx>
Date: Wed, 24 Feb 2021 07:44:16 -0800

> On Wed, Feb 24, 2021 at 11:52:55AM +0000, Alexander Lobakin wrote:
> > From: Yury Norov <yury.norov@xxxxxxxxx>
> > Date: Sat, 5 Dec 2020 08:54:06 -0800
> >
> > Hi,
> >
> > > ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> > > enable it in config. It leads to using find_next_bit() which is less
> > > efficient:
> > >
> > > 0000000000000000 <find_first_bit>:
> > >    0:	aa0003e4 	mov	x4, x0
> > >    4:	aa0103e0 	mov	x0, x1
> > >    8:	b4000181 	cbz	x1, 38 <find_first_bit+0x38>
> > >    c:	f9400083 	ldr	x3, [x4]
> > >   10:	d2800802 	mov	x2, #0x40                  	// #64
> > >   14:	91002084 	add	x4, x4, #0x8
> > >   18:	b40000c3 	cbz	x3, 30 <find_first_bit+0x30>
> > >   1c:	14000008 	b	3c <find_first_bit+0x3c>
> > >   20:	f8408483 	ldr	x3, [x4], #8
> > >   24:	91010045 	add	x5, x2, #0x40
> > >   28:	b50000c3 	cbnz	x3, 40 <find_first_bit+0x40>
> > >   2c:	aa0503e2 	mov	x2, x5
> > >   30:	eb02001f 	cmp	x0, x2
> > >   34:	54ffff68 	b.hi	20 <find_first_bit+0x20>  // b.pmore
> > >   38:	d65f03c0 	ret
> > >   3c:	d2800002 	mov	x2, #0x0                   	// #0
> > >   40:	dac00063 	rbit	x3, x3
> > >   44:	dac01063 	clz	x3, x3
> > >   48:	8b020062 	add	x2, x3, x2
> > >   4c:	eb02001f 	cmp	x0, x2
> > >   50:	9a829000 	csel	x0, x0, x2, ls  // ls = plast
> > >   54:	d65f03c0 	ret
> > >
> > >   ...
> > >
> > > 0000000000000118 <_find_next_bit.constprop.1>:
> > >  118:	eb02007f 	cmp	x3, x2
> > >  11c:	540002e2 	b.cs	178 <_find_next_bit.constprop.1+0x60>  // b.hs, b.nlast
> > >  120:	d346fc66 	lsr	x6, x3, #6
> > >  124:	f8667805 	ldr	x5, [x0, x6, lsl #3]
> > >  128:	b4000061 	cbz	x1, 134 <_find_next_bit.constprop.1+0x1c>
> > >  12c:	f8667826 	ldr	x6, [x1, x6, lsl #3]
> > >  130:	8a0600a5 	and	x5, x5, x6
> > >  134:	ca0400a6 	eor	x6, x5, x4
> > >  138:	92800005 	mov	x5, #0xffffffffffffffff    	// #-1
> > >  13c:	9ac320a5 	lsl	x5, x5, x3
> > >  140:	927ae463 	and	x3, x3, #0xffffffffffffffc0
> > >  144:	ea0600a5 	ands	x5, x5, x6
> > >  148:	54000120 	b.eq	16c <_find_next_bit.constprop.1+0x54>  // b.none
> > >  14c:	1400000e 	b	184 <_find_next_bit.constprop.1+0x6c>
> > >  150:	d346fc66 	lsr	x6, x3, #6
> > >  154:	f8667805 	ldr	x5, [x0, x6, lsl #3]
> > >  158:	b4000061 	cbz	x1, 164 <_find_next_bit.constprop.1+0x4c>
> > >  15c:	f8667826 	ldr	x6, [x1, x6, lsl #3]
> > >  160:	8a0600a5 	and	x5, x5, x6
> > >  164:	eb05009f 	cmp	x4, x5
> > >  168:	540000c1 	b.ne	180 <_find_next_bit.constprop.1+0x68>  // b.any
> > >  16c:	91010063 	add	x3, x3, #0x40
> > >  170:	eb03005f 	cmp	x2, x3
> > >  174:	54fffee8 	b.hi	150 <_find_next_bit.constprop.1+0x38>  // b.pmore
> > >  178:	aa0203e0 	mov	x0, x2
> > >  17c:	d65f03c0 	ret
> > >  180:	ca050085 	eor	x5, x4, x5
> > >  184:	dac000a5 	rbit	x5, x5
> > >  188:	dac010a5 	clz	x5, x5
> > >  18c:	8b0300a3 	add	x3, x5, x3
> > >  190:	eb03005f 	cmp	x2, x3
> > >  194:	9a839042 	csel	x2, x2, x3, ls  // ls = plast
> > >  198:	aa0203e0 	mov	x0, x2
> > >  19c:	d65f03c0 	ret
> > >
> > >  ...
> > >
> > > 0000000000000238 <find_next_bit>:
> > >  238:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
> > >  23c:	aa0203e3 	mov	x3, x2
> > >  240:	d2800004 	mov	x4, #0x0                   	// #0
> > >  244:	aa0103e2 	mov	x2, x1
> > >  248:	910003fd 	mov	x29, sp
> > >  24c:	d2800001 	mov	x1, #0x0                   	// #0
> > >  250:	97ffffb2 	bl	118 <_find_next_bit.constprop.1>
> > >  254:	a8c17bfd 	ldp	x29, x30, [sp], #16
> > >  258:	d65f03c0 	ret
> > >
> > > Enabling this functions would also benefit for_each_{set,clear}_bit().
> > > Would it make sense to enable this config for all such architectures by
> > > default?
> >
> > I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and
> > fast code on MIPS (32 R2) where there is also no architecture-specific
> > bitsearching routines.
> > So, if it's okay for other folks, I'd suggest to go for it and enable
> > for all similar arches.
>
> As far as I understand the idea of GENERIC_FIND_FIRST_BIT=n, it's
> intended to save some space in .text. But in fact it bloats the
> kernel:
>
>         yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
>         add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
>         ...

Same for MIPS, enabling GENERIC_FIND_FIRST_BIT saves a bunch of .text
memory despite that it introduces new entries.

> For the next cycle, I'm going to submit a patch that removes the
> GENERIC_FIND_FIRST_BIT completely and forces all architectures to
> use find_first{_zero}_bit()

I like that idea. I'm almost sure there'll be no arch that benefits
from CONFIG_GENERIC_FIND_FIRST_BIT=n (and has no arch-optimized
versions).

> > (otherwise, I'll publish a separate entry for mips-next after 5.12-rc1
> >  release and mention you in "Suggested-by:")
>
> I think it worth to enable GENERIC_FIND_FIRST_BIT for mips and arm now
> and see how it works for people. If there'll be no complains I'll remove
> the config entirely. I'm OK if you submit the patch for mips now, or we
> can make a series and submit together. Works either way.

Lez make a series and see how it goes. I'll send you MIPS part soon.

Al