On 2018-03-09 10:13 AM, Johannes Thumshirn wrote:
On Tue, Mar 06, 2018 at 10:21:33PM -0500, Douglas Gilbert wrote:ChangeLog: - add likely() and unlikely() hints to conditionals on or near the fastpath - fix a comment in scsi_io_completion_action()Do you have any numbers what this patch buys us performance wise?
How does 7 nanoseconds sound? Measuring the time taken by scsi_io_completion() gives figures that jump around seemingly randomly between 1.2 and 2.0 microseconds (when the system is under heavy load). Turns out that almost all of that is caused by the call to scsi_end_request() (and the functions it calls). So the timing I was after was from the call into scsi_io_completion() to the point on the fastpath just before scsi_end_request() is called. ftrace function-graph doesn't even show this time (perhaps it has a lower threshold value). So I wrote my own timer, starting and stopping at those points, reading the hr timer with ktime and only timing on one CPU (to avoid atomics an mutexes). The input was a fio script used previously for testing scsi_mq. The "before" timing was on a kernel build with MKP's tree which now has my "[PATCH v4] Make SCSI Status CONDITION MET equivalent to GOOD" patch applied, but not this patchset. Also some of the time being measured is from the measuring code itself :-) Here is the "before", averaged over 100 millisecond windows: hrtimer: interrupt took 16045 ns print_timings: cum_time=100000156 ns, count=630299, avg=158 ns print_timings: cum_time=100000126 ns, count=623939, avg=160 ns print_timings: cum_time=100000020 ns, count=625046, avg=159 ns print_timings: cum_time=100000098 ns, count=624375, avg=160 ns print_timings: cum_time=100000060 ns, count=619438, avg=161 ns print_timings: cum_time=100000132 ns, count=616604, avg=162 ns print_timings: cum_time=100000050 ns, count=617839, avg=161 ns print_timings: cum_time=100000079 ns, count=623943, avg=160 ns print_timings: cum_time=100000128 ns, count=617545, avg=161 ns print_timings: cum_time=100000083 ns, count=617902, avg=161 ns Here is the "after": print_timings: cum_time=100000022 ns, count=628977, avg=158 ns print_timings: cum_time=100000021 ns, count=642545, avg=155 ns print_timings: cum_time=100000042 ns, count=647620, avg=154 ns print_timings: cum_time=100000115 ns, count=649420, avg=153 ns print_timings: cum_time=100000001 ns, count=646527, avg=154 ns print_timings: cum_time=100000103 ns, count=648608, avg=154 ns print_timings: cum_time=100000074 ns, count=650341, avg=153 ns print_timings: cum_time=100000115 ns, count=652359, avg=153 ns print_timings: cum_time=100000031 ns, count=636437, avg=157 ns print_timings: cum_time=100000096 ns, count=632879, avg=158 ns Attached is some ftrace function-graph output of scsi_io_completion() which is gzipped for brevity. Looked at in lines of C code, in the window I'm interested in the "before" code does: - 4 local variable initializations - zeros two locals, sets false in another - wait_for = (cmd->allowed + 1) * req->timeout; (local initialization) - 5 failed "if"s (one is "else if") with no hints - fails to trigger on BUG_ON - doesn't print a SCSI_LOG_HLCOMPLETE - before it gets to the fastpath exit: if (!(blk_rq_bytes(req) == 0 && error) && !scsi_end_request(req, error, good_bytes, 0)) return; The "after" code does: - 4 local variable initializations - 2 failed "if"s with hints - one failed "if" with hint in place of BUG_ON - doesn't print a SCSI_LOG_HLCOMPLETE - before it gets to the fastpath exit which has hints: if (likely(blk_rq_bytes(req) > 0 || blk_stat == BLK_STS_OK)) { if (likely(!scsi_end_request(req, blk_stat, good_bytes, 0))) return; /* no bytes remaining */ } So the difference measured on a Lenovo X270 (i5-7200U) amounts to around 7 nanoseconds. Doug Gilbert
Attachment:
scsi_io_completion_ftrace.txt.gz
Description: application/gzip