On 16/01/2019 15:56, Julien Thierry wrote: > On 14/01/2019 12:26, Mark Rutland wrote: >> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote: >>> On Fri, Jan 04, 2019 at 05:50:18PM +0000, Mark Rutland wrote: >>>> Hi Torsten, >>>> >>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote: >>>>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning >>>>> of each function. Replace the first NOP thus generated with a quick LR >>>>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call >>>>> to ftrace, does not clobber the value. Ftrace will then generate the >>>>> standard stack frames. >>> >>> Do we know what the overhead would be, if this was a link time change >>> for the first instruction? >> >> No, but it should be possible to benchamrk that for a given workload, >> which is what I'd like to see. >> > > So, I hacked up something to have the -fpachable-function-entry=2 in the > build and then have ftrace_init() patch in the "mov x9, lr" in the first > nop of the function preludes. > > I tested it on a 8 x Cortex A-57 machine and compared with a version > that just has the two nops in the function prelude. > > On workloads like hackbench, the average difference is within the noise > (<1%). Time results below are in seconds. > > +------------+--------------------+ > | "nop; nop" | "mov x9, lr; nop" | > +------------+--------------------+ > | 43.497 | 42.694 | > | 43.464 | 43.148 | > | 43.599 | 43.131 | > | 43.785 | 43.63 | > | 43.458 | 43.281 | > | 44.3 | 43.328 | > | 43.541 | 43.059 | > | 43.529 | 43.298 | > | 43.58 | 43.937 | > | 43.385 | 43.122 | > | 43.514 | 43.825 | > | 45.508 | 43.268 | > | 43.757 | 43.316 | > | 43.392 | 43.146 | > | 44.029 | 43.236 | > | 43.515 | 43.139 | > | 43.22 | 43.108 | > | 43.496 | 43.836 | > | 43.669 | 43.083 | > | 43.388 | 43.38 | > +------------+--------------------+ > average | 43.6813 | 43.29825 | > +------------+--------------------+ > Here are also some results running hackbench on 4 x Cortex-A53 (pay no attention to the fact that the timescales are similar, I changed the number of iteration done by hackbench so it wouldn't take too long) +------------+-------------------+ | "nop; nop" | "mov x9, lr; nop" | +------------+-------------------+ | 43.815 | 44.455 | | 43.758 | 45.173 | | 44.075 | 43.95 | | 44.021 | 44.185 | | 43.959 | 44.826 | | 44.039 | 44.478 | | 43.836 | 44.626 | | 44.071 | 45.177 | | 43.619 | 45.033 | | 44.052 | 45.095 | | 43.903 | 44.802 | | 43.773 | 44.955 | | 43.908 | 45.02 | | 43.441 | 44.986 | | 44.167 | 45.182 | | 44.106 | 45.229 | | 43.974 | 45.07 | | 43.859 | 45.283 | | 43.706 | 44.892 | | 43.897 | 44.194 | +------------+-------------------+ average | 43.899 | 44.835 | +------------+-------------------+ So, in this case the performance take a ~2% hit from keeping the mov always present in the function prelude instead of a nop. Makes it a bit less obvious whether the always having that mov there (whether patched at build time or run time) is good enough. Cheers, -- Julien Thierry