On 6/22/22 7:40 AM, Dylan Yudaken wrote: > Task work currently uses a spin lock to guard task_list and > task_running. Some use cases such as networking can trigger task_work_add > from multiple threads all at once, which suffers from contention here. > > This can be changed to use a lockless list which seems to have better > performance. Running the micro benchmark in [1] I see 20% improvment in > multithreaded task work add. It required removing the priority tw list > optimisation, however it isn't clear how important that optimisation is. > Additionally it has fairly easy to break semantics. > > Patch 1-2 remove the priority tw list optimisation > Patch 3-5 add lockless lists for task work > Patch 6 fixes a bug I noticed in io_uring event tracing > Patch 7-8 adds tracing for task_work_run I ran some IRQ driven workloads on this. Basic 512b random read, DIO, IRQ, and then at queue depths 1-64, doubling every time. Once we get to QD=8, start doing submit/complete batch of 1/4th of the QD so we ramp up there too. Results below, first set is 5.19-rc3 + for-5.20/io_uring, second set is that plus this series. This is what I ran: sudo taskset -c 12 t/io_uring -d<QD> -b512 -s<batch> -c<batch> -p0 -F1 -B1 -n1 -D0 -R0 -X1 -R1 -t1 -r5 /dev/nvme0n1 on a gen2 optane drive. tldr - looks like an improvement there too, and no ill effects seen on latency. 5.19-rc3 + for-5.20/io_uring: QD1, Batch=1 Maximum IOPS=244K 1509: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3996], 5.0000th=[ 3996], 10.0000th=[ 3996], | 20.0000th=[ 4036], 30.0000th=[ 4036], 40.0000th=[ 4036], | 50.0000th=[ 4036], 60.0000th=[ 4036], 70.0000th=[ 4036], | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4196], | 99.0000th=[ 4437], 99.5000th=[ 5421], 99.9000th=[ 7590], | 99.9500th=[ 9518], 99.9900th=[32289] QD=2, Batch=1 Maximum IOPS=483K 1533: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3795], | 20.0000th=[ 3795], 30.0000th=[ 3835], 40.0000th=[ 3955], | 50.0000th=[ 4036], 60.0000th=[ 4076], 70.0000th=[ 4076], | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4156], | 99.0000th=[ 4518], 99.5000th=[ 6144], 99.9000th=[ 7510], | 99.9500th=[ 9839], 99.9900th=[32289] QD=4, Batch=1 Maximum IOPS=907K 1583: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3393], 5.0000th=[ 3514], 10.0000th=[ 3594], | 20.0000th=[ 3634], 30.0000th=[ 3795], 40.0000th=[ 3875], | 50.0000th=[ 3955], 60.0000th=[ 4076], 70.0000th=[ 4156], | 80.0000th=[ 4277], 90.0000th=[ 4397], 95.0000th=[ 4477], | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9357], | 99.9500th=[11004], 99.9900th=[32289] QD=8, Batch=2 Maximum IOPS=1688K 1631: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3353], 5.0000th=[ 3554], 10.0000th=[ 3634], | 20.0000th=[ 3755], 30.0000th=[ 3875], 40.0000th=[ 4036], | 50.0000th=[ 4156], 60.0000th=[ 4277], 70.0000th=[ 4437], | 80.0000th=[ 4678], 90.0000th=[ 4839], 95.0000th=[ 5040], | 99.0000th=[ 6305], 99.5000th=[ 7028], 99.9000th=[10080], | 99.9500th=[15502], 99.9900th=[32932] QD=16, Batch=4 Maximum IOPS=2613K 1680: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 4397], 10.0000th=[ 4558], | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], | 80.0000th=[ 5903], 90.0000th=[ 6305], 95.0000th=[ 6706], | 99.0000th=[ 8393], 99.5000th=[ 8955], 99.9000th=[11325], | 99.9500th=[31968], 99.9900th=[34217] QD=32, Batch=8 Maximum IOPS=3573K 1706: Latency percentiles: percentiles (nsec): | 1.0000th=[ 4919], 5.0000th=[ 5662], 10.0000th=[ 5903], | 20.0000th=[ 6144], 30.0000th=[ 6465], 40.0000th=[ 6626], | 50.0000th=[ 6867], 60.0000th=[ 7188], 70.0000th=[ 7510], | 80.0000th=[ 7992], 90.0000th=[ 8714], 95.0000th=[ 9357], | 99.0000th=[11325], 99.5000th=[11967], 99.9000th=[16626], | 99.9500th=[34217], 99.9900th=[37108] QD=64, Batch=16 Maximum IOPS=3953K 1735: Latency percentiles: percentiles (nsec): | 1.0000th=[ 6626], 5.0000th=[ 7188], 10.0000th=[ 7510], | 20.0000th=[ 7992], 30.0000th=[ 8393], 40.0000th=[ 9116], | 50.0000th=[10160], 60.0000th=[11164], 70.0000th=[11646], | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[34217], | 99.9500th=[38072], 99.9900th=[40964] ============ 5.19-rc3 + for-5.20/io_uring + this series: QD=1, Batch=1 Maximum IOPS=246K 909: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 3996], 10.0000th=[ 3996], | 20.0000th=[ 3996], 30.0000th=[ 3996], 40.0000th=[ 3996], | 50.0000th=[ 3996], 60.0000th=[ 3996], 70.0000th=[ 4036], | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], | 99.0000th=[ 4196], 99.5000th=[ 5341], 99.9000th=[ 7590], | 99.9500th=[ 9357], 99.9900th=[32289] QD=2, Batch=1 Maximum IOPS=487K 932: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3755], | 20.0000th=[ 3755], 30.0000th=[ 3795], 40.0000th=[ 3795], | 50.0000th=[ 3996], 60.0000th=[ 4036], 70.0000th=[ 4036], | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], | 99.0000th=[ 4437], 99.5000th=[ 6224], 99.9000th=[ 7510], | 99.9500th=[ 9598], 99.9900th=[32289] QD=4, Batch=1 aximum IOPS=921K 955: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3393], 5.0000th=[ 3433], 10.0000th=[ 3514], | 20.0000th=[ 3594], 30.0000th=[ 3674], 40.0000th=[ 3795], | 50.0000th=[ 3875], 60.0000th=[ 3996], 70.0000th=[ 4036], | 80.0000th=[ 4156], 90.0000th=[ 4317], 95.0000th=[ 4678], | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9116], | 99.9500th=[10522], 99.9900th=[32289] QD=8, Batch=2 Maximum IOPS=1658K 981: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3313], 5.0000th=[ 3514], 10.0000th=[ 3594], | 20.0000th=[ 3714], 30.0000th=[ 3835], 40.0000th=[ 3996], | 50.0000th=[ 4116], 60.0000th=[ 4196], 70.0000th=[ 4397], | 80.0000th=[ 4598], 90.0000th=[ 4718], 95.0000th=[ 4919], | 99.0000th=[ 6385], 99.5000th=[ 6947], 99.9000th=[10000], | 99.9500th=[15180], 99.9900th=[32932] QD=16, Batch=4 Maximum IOPS=2749K 1010: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 4437], 10.0000th=[ 4558], | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], | 80.0000th=[ 5903], 90.0000th=[ 6224], 95.0000th=[ 6626], | 99.0000th=[ 8313], 99.5000th=[ 9036], 99.9000th=[11967], | 99.9500th=[32289], 99.9900th=[34217] QD=32, Batch=8 Maximum IOPS=3583K 1050: Latency percentiles: percentiles (nsec): | 1.0000th=[ 4879], 5.0000th=[ 5582], 10.0000th=[ 5903], | 20.0000th=[ 6224], 30.0000th=[ 6465], 40.0000th=[ 6626], | 50.0000th=[ 6787], 60.0000th=[ 7028], 70.0000th=[ 7349], | 80.0000th=[ 7911], 90.0000th=[ 8634], 95.0000th=[ 9196], | 99.0000th=[11164], 99.5000th=[11967], 99.9000th=[16305], | 99.9500th=[34217], 99.9900th=[37108] QD=64, Batch=16 Maximum IOPS=3959K 1081: Latency percentiles: percentiles (nsec): | 1.0000th=[ 6546], 5.0000th=[ 7108], 10.0000th=[ 7429], | 20.0000th=[ 7992], 30.0000th=[ 8313], 40.0000th=[ 8955], | 50.0000th=[10000], 60.0000th=[11004], 70.0000th=[11646], | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[33253], | 99.9500th=[38072], 99.9900th=[41446] -- Jens Axboe