On Thu, Mar 30, 2017 at 01:59:58PM -0700, Linus Torvalds wrote: > On Thu, Mar 30, 2017 at 1:40 PM, Vineet Gupta > <Vineet.Gupta1@xxxxxxxxxxxx> wrote: > > > > So it's a mix bag really. Maybe we need some better directed test to really drill > > it down. > > As mentioned inn the discussion about ARM, I seriously doubt that the > inlining will even be noticeable compared to other effects here. (Sorry to switch sub-threads.) I'm running tests on that point, concentrating on hdparm -T and perfing that. You're right in so far as perf identifies the hotspot as the copy_to_user() function for that workload, rather than the inlined bits - the top hits in perf of hdparm -T are: + 66.52% hdparm [k] __copy_to_user_std + 8.49% hdparm [k] generic_file_read_iter + 3.82% hdparm [k] lock_acquire + 2.80% hdparm [k] copy_page_to_iter + 2.49% hdparm [k] find_get_entry + 1.19% hdparm [k] lock_release Note: perf on ARM does is affected by IRQ-disabled regions, so hotspots can be off. The generic_file_read_iter() one is definitely affected by an IRQ- disabled region in there. Here's the average hdparm -T transfer rates and standard deviation over 20 samples: Unpatched: Average=320.42 MB/s sigma=0.878657 Uaccess+inline: Average=318.77 MB/s sigma=1.003332 Uaccess+noinline: Average=319.40 MB/s sigma=1.088354 This pattern - where the noinline version sits between the inlined version and unpatched version seems to be a pattern in all the measurements I've done so far, and it points to inlining that code having a slight detrimental effect. What we don't know is whether uninlining the code without Al's patch would see a slight boost, but I'm not about to go there. However, this all points towards there being a very slight advantage to dropping the INLINE_COPY_TO_USER and INLINE_COPY_FROM_USER for ARM, but I'd say it's really down in the noise - I'm not concerned. > (On ARM, hopefully the UAO bit is faster to set, but it's still > "another instruction before and after", so even if it's not as > expensive as clac/stac are on current x86 chips, it's an argument > against inlining) The UAO set/clear does show up as a hotspot within copy_page_to_iter(), but as we can see, overall its about 3% of the workload. Within copy_page_to_iter(), it's the __put_user() based loop inside fault_in_pages_writeable() which has the hotspot, due to the repeated enable+disable sequence (more the instruction barriers that we need.) Perf reports that the barriers account for 8.33 and 17.59% of the time spent within that function, so we're actually talking about maybe .25% and .5% of this workload spent doing the UAO thing. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net.