Re: [PATCH v2 for-next 0/8] io_uring: tw contention improvments

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/22/22 7:40 AM, Dylan Yudaken wrote:
> Task work currently uses a spin lock to guard task_list and
> task_running. Some use cases such as networking can trigger task_work_add
> from multiple threads all at once, which suffers from contention here.
> 
> This can be changed to use a lockless list which seems to have better
> performance. Running the micro benchmark in [1] I see 20% improvment in
> multithreaded task work add. It required removing the priority tw list
> optimisation, however it isn't clear how important that optimisation is.
> Additionally it has fairly easy to break semantics.
> 
> Patch 1-2 remove the priority tw list optimisation
> Patch 3-5 add lockless lists for task work
> Patch 6 fixes a bug I noticed in io_uring event tracing
> Patch 7-8 adds tracing for task_work_run

I ran some IRQ driven workloads on this. Basic 512b random read, DIO,
IRQ, and then at queue depths 1-64, doubling every time. Once we get to
QD=8, start doing submit/complete batch of 1/4th of the QD so we ramp up
there too. Results below, first set is 5.19-rc3 + for-5.20/io_uring,
second set is that plus this series.

This is what I ran:

sudo taskset -c 12 t/io_uring -d<QD> -b512 -s<batch> -c<batch> -p0 -F1 -B1 -n1 -D0 -R0 -X1 -R1 -t1 -r5 /dev/nvme0n1

on a gen2 optane drive.

tldr - looks like an improvement there too, and no ill effects seen on
latency.

5.19-rc3 + for-5.20/io_uring:

QD1, Batch=1
Maximum IOPS=244K
1509: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3996],  5.0000th=[ 3996], 10.0000th=[ 3996],
     | 20.0000th=[ 4036], 30.0000th=[ 4036], 40.0000th=[ 4036],
     | 50.0000th=[ 4036], 60.0000th=[ 4036], 70.0000th=[ 4036],
     | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4196],
     | 99.0000th=[ 4437], 99.5000th=[ 5421], 99.9000th=[ 7590],
     | 99.9500th=[ 9518], 99.9900th=[32289]

QD=2, Batch=1
Maximum IOPS=483K
1533: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3714],  5.0000th=[ 3755], 10.0000th=[ 3795],
     | 20.0000th=[ 3795], 30.0000th=[ 3835], 40.0000th=[ 3955],
     | 50.0000th=[ 4036], 60.0000th=[ 4076], 70.0000th=[ 4076],
     | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4156],
     | 99.0000th=[ 4518], 99.5000th=[ 6144], 99.9000th=[ 7510],
     | 99.9500th=[ 9839], 99.9900th=[32289]

QD=4, Batch=1
Maximum IOPS=907K
1583: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3393],  5.0000th=[ 3514], 10.0000th=[ 3594],
     | 20.0000th=[ 3634], 30.0000th=[ 3795], 40.0000th=[ 3875],
     | 50.0000th=[ 3955], 60.0000th=[ 4076], 70.0000th=[ 4156],
     | 80.0000th=[ 4277], 90.0000th=[ 4397], 95.0000th=[ 4477],
     | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9357],
     | 99.9500th=[11004], 99.9900th=[32289]

QD=8, Batch=2
Maximum IOPS=1688K
1631: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3353],  5.0000th=[ 3554], 10.0000th=[ 3634],
     | 20.0000th=[ 3755], 30.0000th=[ 3875], 40.0000th=[ 4036],
     | 50.0000th=[ 4156], 60.0000th=[ 4277], 70.0000th=[ 4437],
     | 80.0000th=[ 4678], 90.0000th=[ 4839], 95.0000th=[ 5040],
     | 99.0000th=[ 6305], 99.5000th=[ 7028], 99.9000th=[10080],
     | 99.9500th=[15502], 99.9900th=[32932]

QD=16, Batch=4
Maximum IOPS=2613K
1680: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3955],  5.0000th=[ 4397], 10.0000th=[ 4558],
     | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120],
     | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743],
     | 80.0000th=[ 5903], 90.0000th=[ 6305], 95.0000th=[ 6706],
     | 99.0000th=[ 8393], 99.5000th=[ 8955], 99.9000th=[11325],
     | 99.9500th=[31968], 99.9900th=[34217]

QD=32, Batch=8
Maximum IOPS=3573K
1706: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 4919],  5.0000th=[ 5662], 10.0000th=[ 5903],
     | 20.0000th=[ 6144], 30.0000th=[ 6465], 40.0000th=[ 6626],
     | 50.0000th=[ 6867], 60.0000th=[ 7188], 70.0000th=[ 7510],
     | 80.0000th=[ 7992], 90.0000th=[ 8714], 95.0000th=[ 9357],
     | 99.0000th=[11325], 99.5000th=[11967], 99.9000th=[16626],
     | 99.9500th=[34217], 99.9900th=[37108]

QD=64, Batch=16
Maximum IOPS=3953K
1735: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 6626],  5.0000th=[ 7188], 10.0000th=[ 7510],
     | 20.0000th=[ 7992], 30.0000th=[ 8393], 40.0000th=[ 9116],
     | 50.0000th=[10160], 60.0000th=[11164], 70.0000th=[11646],
     | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735],
     | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[34217],
     | 99.9500th=[38072], 99.9900th=[40964]


============


5.19-rc3 + for-5.20/io_uring + this series:

QD=1, Batch=1
Maximum IOPS=246K
909: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3955],  5.0000th=[ 3996], 10.0000th=[ 3996],
     | 20.0000th=[ 3996], 30.0000th=[ 3996], 40.0000th=[ 3996],
     | 50.0000th=[ 3996], 60.0000th=[ 3996], 70.0000th=[ 4036],
     | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116],
     | 99.0000th=[ 4196], 99.5000th=[ 5341], 99.9000th=[ 7590],
     | 99.9500th=[ 9357], 99.9900th=[32289]

QD=2, Batch=1
Maximum IOPS=487K
932: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3714],  5.0000th=[ 3755], 10.0000th=[ 3755],
     | 20.0000th=[ 3755], 30.0000th=[ 3795], 40.0000th=[ 3795],
     | 50.0000th=[ 3996], 60.0000th=[ 4036], 70.0000th=[ 4036],
     | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116],
     | 99.0000th=[ 4437], 99.5000th=[ 6224], 99.9000th=[ 7510],
     | 99.9500th=[ 9598], 99.9900th=[32289]

QD=4, Batch=1
aximum IOPS=921K
955: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3393],  5.0000th=[ 3433], 10.0000th=[ 3514],
     | 20.0000th=[ 3594], 30.0000th=[ 3674], 40.0000th=[ 3795],
     | 50.0000th=[ 3875], 60.0000th=[ 3996], 70.0000th=[ 4036],
     | 80.0000th=[ 4156], 90.0000th=[ 4317], 95.0000th=[ 4678],
     | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9116],
     | 99.9500th=[10522], 99.9900th=[32289]

QD=8, Batch=2
Maximum IOPS=1658K
981: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3313],  5.0000th=[ 3514], 10.0000th=[ 3594],
     | 20.0000th=[ 3714], 30.0000th=[ 3835], 40.0000th=[ 3996],
     | 50.0000th=[ 4116], 60.0000th=[ 4196], 70.0000th=[ 4397],
     | 80.0000th=[ 4598], 90.0000th=[ 4718], 95.0000th=[ 4919],
     | 99.0000th=[ 6385], 99.5000th=[ 6947], 99.9000th=[10000],
     | 99.9500th=[15180], 99.9900th=[32932]

QD=16, Batch=4
Maximum IOPS=2749K
1010: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 3955],  5.0000th=[ 4437], 10.0000th=[ 4558],
     | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120],
     | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743],
     | 80.0000th=[ 5903], 90.0000th=[ 6224], 95.0000th=[ 6626],
     | 99.0000th=[ 8313], 99.5000th=[ 9036], 99.9000th=[11967],
     | 99.9500th=[32289], 99.9900th=[34217]

QD=32, Batch=8
Maximum IOPS=3583K
1050: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 4879],  5.0000th=[ 5582], 10.0000th=[ 5903],
     | 20.0000th=[ 6224], 30.0000th=[ 6465], 40.0000th=[ 6626],
     | 50.0000th=[ 6787], 60.0000th=[ 7028], 70.0000th=[ 7349],
     | 80.0000th=[ 7911], 90.0000th=[ 8634], 95.0000th=[ 9196],
     | 99.0000th=[11164], 99.5000th=[11967], 99.9000th=[16305],
     | 99.9500th=[34217], 99.9900th=[37108]

QD=64, Batch=16
Maximum IOPS=3959K
1081: Latency percentiles:
    percentiles (nsec):
     |  1.0000th=[ 6546],  5.0000th=[ 7108], 10.0000th=[ 7429],
     | 20.0000th=[ 7992], 30.0000th=[ 8313], 40.0000th=[ 8955],
     | 50.0000th=[10000], 60.0000th=[11004], 70.0000th=[11646],
     | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735],
     | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[33253],
     | 99.9500th=[38072], 99.9900th=[41446]

-- 
Jens Axboe




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux