Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

Mateusz Guzik <mjguzik@xxxxxxxxx> · Fri, 1 Sep 2023 17:20:41 +0200

On 8/30/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> Side note, if you want to play around with the user copy routines (or
> maybe Borislav wants to), I have a patch that handles a couple of
> common cases statically.
>

No thanks mate, I'm good ;) (but see below)

> In particular, it should help with copying structures - notably the
> 'stat' structure in cp_new_stat().
>

So my first side note is that I agree this is going to help a bunch of
cases, but I disagree about this one.

cp_new_stat and the counterpart for statx can dodge this rep mov by
filling user memory directly.

I know for a fact statx already had that implemented, but it was
before (or around the time) SMAP was introduced -- afterwards all the
__put_user were shafting performance big time. As the kernel did not
have support for delineating several accesses in a row, code got
reworked to the usual massage locally and copyout. I can't be arsed to
check but I suspect it was the same story with cp_new_stat. The
necessary machinery is implemented for quite some time now but nobody
revisited the area.

I'm going to patch this, but first I want to address the bigger
problem of glibc implementing fstat as newfstatat, demolishing perf of
that op. In their defense currently they have no choice as this is the
only exporter of the "new" struct stat. I'll be sending a long email
to fsdevel soon(tm) with a proposed fix.

> The attached patch is entirely untested, except for me checking code
> generation for some superficial sanity in a couple of places.
>
> I'm not convinced that
>
>     len >= 64 && !(len & 7)
>
> is necessarily the "correct" option, but I resurrected an older patch
> for this, and decided to use that as the "this is what
> rep_movs_alternative would do anyway" test.
>
> And obviously I expect that FSRM also does ok with "rep movsq", even
> if technically "movsb" is the simpler case (because it doesn't have
> the alignment issues that "rep movsq" has).
>

So I was wondering if rep movsq is any worse than ERMS'ed rep movsb
when there is no tail to handle and the buffer is aligned to a page,
or more to the point if clear_page gets any benefit for going with
movsb.

The routine got patched to use it in e365c9df2f2f ("x86, mem:
clear_page_64.S: Support clear_page() with enhanced REP MOVSB/STOSB"),
which was a part of a larger patchset sprinkling ERMS over string ops.
The commit does not come with any numbers for it though.

I did a quick test with page_fault1 from will-it-scale with both
variants on Sapphire Rapids and got the same result, but then again
that's just one arch -- maybe something really old suffers here?

Not having to mess with alternatives would be nice here. That said I'm
just throwing it out there, I have no intention of pursuing it at the
moment.

As a heads up what I *do* intend to do later is add back rep to memcpy
and memset for sizes >= 64.

Gathering stats over kernel build: bpftrace -e
'kprobe:memcpy,kprobe:memset  { @ = lhist(arg2, 64, 16384, 512); }'

@[kprobe:memcpy]:
(..., 64)        3477477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 576)        1245601 |@@@@@@@@@@@@@@@@@@                                  |
[576, 1088)        44061 |                                                    |
[1088, 1600)          11 |                                                    |
[1600, 2112)           9 |                                                    |
[2624, 3136)           1 |                                                    |
[3136, 3648)           2 |                                                    |
[3648, 4160)         148 |                                                    |
[4160, 4672)           0 |                                                    |

@[kprobe:memset]:
(..., 64)       20094711 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 576)       11141718 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[576, 1088)        28948 |                                                    |
[1088, 1600)       59981 |                                                    |
[1600, 2112)       11362 |                                                    |
[2112, 2624)       13218 |                                                    |
[2624, 3136)       14781 |                                                    |
[3136, 3648)       15839 |                                                    |
[3648, 4160)       48670 |                                                    |
[7744, 8256)           1 |                                                    |
[16K, ...)         32491 |                                                    |

[I removed rows with no hits; also I had a crap hack so that bpftrace
can attach to these]

This collects sizes bigger than 64, with a 512 byte step. While vast
majority of calls are for sizes which are very small, multikilobyte
uses keep showing up making it worthwhile.

I'm going to try to work out alternative_jump which avoids the current
extra nops and which works on both gcc and clang (or in the worst case
I'll massage what you had for gcc and let clang eat the nops, with the
macro hiding the ugliness).

I make no commitments when I get to it, but probably early September.

btw these string ops would probably benefit from overlapping stores
instead of 1 or 8 byte loops, but I'm not submitting any patches on
that front for now (and in fact would appreciate if someone else
sorted it out) :)

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>