Re: [RFC][CFT][PATCHSET v1] uaccess unification

Russell King - ARM Linux <linux@xxxxxxxxxxxxxxx> · Fri, 31 Mar 2017 00:21:47 +0100

On Thu, Mar 30, 2017 at 01:59:58PM -0700, Linus Torvalds wrote:
> On Thu, Mar 30, 2017 at 1:40 PM, Vineet Gupta
> <Vineet.Gupta1@xxxxxxxxxxxx> wrote:
> >
> > So it's a mix bag really. Maybe we need some better directed test to really drill
> > it down.
> 
> As mentioned inn the discussion about ARM, I seriously doubt that the
> inlining will even be noticeable compared to other effects here.

(Sorry to switch sub-threads.)

I'm running tests on that point, concentrating on hdparm -T and perfing
that.  You're right in so far as perf identifies the hotspot as the
copy_to_user() function for that workload, rather than the inlined bits
- the top hits in perf of hdparm -T are:

+   66.52%  hdparm  [k] __copy_to_user_std
+    8.49%  hdparm  [k] generic_file_read_iter
+    3.82%  hdparm  [k] lock_acquire
+    2.80%  hdparm  [k] copy_page_to_iter
+    2.49%  hdparm  [k] find_get_entry
+    1.19%  hdparm  [k] lock_release

Note: perf on ARM does is affected by IRQ-disabled regions, so hotspots
can be off.

The generic_file_read_iter() one is definitely affected by an IRQ-
disabled region in there.

Here's the average hdparm -T transfer rates and standard deviation over
20 samples:

Unpatched:        Average=320.42 MB/s sigma=0.878657
Uaccess+inline:   Average=318.77 MB/s sigma=1.003332
Uaccess+noinline: Average=319.40 MB/s sigma=1.088354

This pattern - where the noinline version sits between the inlined
version and unpatched version seems to be a pattern in all the
measurements I've done so far, and it points to inlining that code
having a slight detrimental effect.  What we don't know is whether
uninlining the code without Al's patch would see a slight boost,
but I'm not about to go there.

However, this all points towards there being a very slight advantage
to dropping the INLINE_COPY_TO_USER and INLINE_COPY_FROM_USER for
ARM, but I'd say it's really down in the noise - I'm not concerned.

> (On ARM, hopefully the UAO bit is faster to set, but it's still
> "another instruction before and after", so even if it's not as
> expensive as clac/stac are on current x86 chips, it's an argument
> against inlining)

The UAO set/clear does show up as a hotspot within copy_page_to_iter(),
but as we can see, overall its about 3% of the workload.  Within
copy_page_to_iter(), it's the __put_user() based loop inside
fault_in_pages_writeable() which has the hotspot, due to the repeated
enable+disable sequence (more the instruction barriers that we need.)

Perf reports that the barriers account for 8.33 and 17.59% of the
time spent within that function, so we're actually talking about
maybe .25% and .5% of this workload spent doing the UAO thing.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.