From: John Hubbard <jhubbard@xxxxxxxxxx> Hi, here is fodder for conversation during LPC. Changes since v1: a) Uses a simpler set/clear/test bit approach in the page flags. b) Added some initial performance results in the cover letter here, below. c) Rebased to latest linux.git d) Puts pages back on the LRU when its done with them. Here is an updated proposal for tracking pages that have had get_user_pages*() called on them. This is in support of fixing the problem discussed in [1]. This RFC only shows how to set up a reliable PageDmaPinned flag. What to *do* with that flag is left for a later discussion. The sequence would be: -- apply patches 1-2, -- convert the rest of the subsystems to call put_user_page*(). Patch #3 is an example of that. It converts infiniband. -- apply patches 4-6, -- apply more patches, to actually use the new PageDmaPinned flag. One question up front was, "how do we ensure that either put_user_page() or put_page() are called, depending on whether the page came from get_user_pages() or not?". From this series, you can see that: -- It's possible to assert within put_page(), that we are probably in the right place. In practice this assertion definitely helps. -- In the other direction, if put_user_page() is called when put_page() should have been used instead, then a "clear" report of LRU list corruption shows up reliably, because the first thing put_user_page() attempts is to decrement the lru's list.prev pointer, and so you'll see pointer values that are one less than an aligned pointer value. This is not great, but it's usable. So I think that the conversion will turn into an exercise of trying to get code coverage, and that should work out. I have lots of other patches, not shown here, in various stages of polish, to convert enough of the kernel in order to run fio [2]. I did that, and got some rather noisy performance results. Here they are again: Performance notes: 1. These fio results are noisy. The std deviation is large enough that some of this could be noise. In order to highlight that, I did 5 runs each of with, and without the patch, and while there is definitely a performance drop on average, it's also true that there is overlap in the results. In other words, the best "with patch" run is about the same as the worst "without patch" run. 2. Initial profiling shows that we're adding about 1% total to the this particular test run...I think. I may have to narrow this down some more, but I don't seem to see any real lock contention. Hints or ideas on measurement methods are welcome, btw. -- 0.59% in put_user_page -- 0.2% (or 0.54%, depending on how you read the perf out below) in get_user_pages_fast: 1.36%--iov_iter_get_pages | --1.27%--get_user_pages_fast | --0.54%--pin_page_for_dma 0.59%--put_user_page 1.34% 0.03% fio [kernel.vmlinux] [k] _raw_spin_lock 0.95% 0.55% fio [kernel.vmlinux] [k] do_raw_spin_lock 0.17% 0.03% fio [kernel.vmlinux] [k] isolate_lru_page 0.06% 0.00% fio [kernel.vmlinux] [k] putback_lru_page 4. Here's some raw fio data: one run without the patch, and two with the patch: ------------------------------------------------------ WITHOUT the patch: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov 6 20:18:06 2018 read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec) slat (usec): min=25, max=4870, avg=68.19, stdev=85.21 clat (usec): min=74, max=19814, avg=4525.40, stdev=1844.03 lat (usec): min=183, max=19927, avg=4593.69, stdev=1866.65 clat percentiles (usec): | 1.00th=[ 3687], 5.00th=[ 3720], 10.00th=[ 3720], 20.00th=[ 3752], | 30.00th=[ 3752], 40.00th=[ 3752], 50.00th=[ 3752], 60.00th=[ 3785], | 70.00th=[ 4178], 80.00th=[ 4490], 90.00th=[ 6652], 95.00th=[ 8225], | 99.00th=[13173], 99.50th=[14353], 99.90th=[16581], 99.95th=[17171], | 99.99th=[18220] bw ( KiB/s): min=49920, max=59320, per=100.00%, avg=55742.24, stdev=2224.20, samples=37 iops : min=12480, max=14830, avg=13935.35, stdev=556.05, samples=37 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=68.78%, 10=28.14%, 20=3.08% cpu : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=54.4MiB/s (57.0MB/s), 54.4MiB/s-54.4MiB/s (57.0MB/s-57.0MB/s), io=1024MiB (1074MB), run=18826-18826msec Disk stats (read/write): nvme0n1: ios=259490/1, merge=0/0, ticks=14822/0, in_queue=19241, util=100.00% ------------------------------------------------------ With the patch applied: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=51.2MiB/s,w=0KiB/s][r=13.1k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2568: Tue Nov 6 20:03:50 2018 read: IOPS=12.8k, BW=50.1MiB/s (52.5MB/s)(1024MiB/20453msec) slat (usec): min=33, max=4365, avg=74.05, stdev=85.79 clat (usec): min=39, max=19818, avg=4916.61, stdev=1961.79 lat (usec): min=100, max=20002, avg=4990.78, stdev=1985.23 clat percentiles (usec): | 1.00th=[ 4047], 5.00th=[ 4080], 10.00th=[ 4080], 20.00th=[ 4080], | 30.00th=[ 4113], 40.00th=[ 4113], 50.00th=[ 4113], 60.00th=[ 4146], | 70.00th=[ 4178], 80.00th=[ 4817], 90.00th=[ 7308], 95.00th=[ 8717], | 99.00th=[14091], 99.50th=[15270], 99.90th=[17433], 99.95th=[18220], | 99.99th=[19006] bw ( KiB/s): min=45370, max=55784, per=100.00%, avg=51332.33, stdev=1843.77, samples=40 iops : min=11342, max=13946, avg=12832.83, stdev=460.92, samples=40 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=96.44%, 20=3.53% cpu : usr=2.91%, sys=95.18%, ctx=398, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=50.1MiB/s (52.5MB/s), 50.1MiB/s-50.1MiB/s (52.5MB/s-52.5MB/s), io=1024MiB (1074MB), run=20453-20453msec Disk stats (read/write): nvme0n1: ios=261399/0, merge=0/0, ticks=16019/0, in_queue=20910, util=100.00% ------------------------------------------------------ OR, here's a better run WITH the patch applied, and you can see that this is nearly as good as the "without" case: ------------------------------------------------------ reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 fio-3.3 Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 00m:00s] reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov 6 20:01:33 2018 read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec) slat (usec): min=30, max=12458, avg=69.71, stdev=88.01 clat (usec): min=39, max=25590, avg=4687.42, stdev=1925.29 lat (usec): min=97, max=25704, avg=4757.25, stdev=1946.06 clat percentiles (usec): | 1.00th=[ 3884], 5.00th=[ 3884], 10.00th=[ 3916], 20.00th=[ 3916], | 30.00th=[ 3916], 40.00th=[ 3916], 50.00th=[ 3949], 60.00th=[ 3949], | 70.00th=[ 3982], 80.00th=[ 4555], 90.00th=[ 6915], 95.00th=[ 8848], | 99.00th=[13566], 99.50th=[14877], 99.90th=[16909], 99.95th=[17695], | 99.99th=[24249] bw ( KiB/s): min=48905, max=58016, per=100.00%, avg=53855.79, stdev=2115.03, samples=38 iops : min=12226, max=14504, avg=13463.79, stdev=528.76, samples=38 lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=71.80%, 10=24.66%, 20=3.51%, 50=0.02% cpu : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=52.5MiB/s (55.1MB/s), 52.5MiB/s-52.5MiB/s (55.1MB/s-55.1MB/s), io=1024MiB (1074MB), run=19499-19499msec Disk stats (read/write): nvme0n1: ios=260720/0, merge=0/0, ticks=15036/0, in_queue=19876, util=100.00% [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" [2] fio: https://github.com/axboe/fio John Hubbard (6): mm/gup: finish consolidating error handling mm: introduce put_user_page*(), placeholder versions infiniband/mm: convert put_page() to put_user_page*() mm: introduce page->dma_pinned_flags, _count mm: introduce zone_gup_lock, for dma-pinned pages mm: track gup pages with page->dma_pinned_* fields drivers/infiniband/core/umem.c | 7 +- drivers/infiniband/core/umem_odp.c | 2 +- drivers/infiniband/hw/hfi1/user_pages.c | 11 +- drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +- drivers/infiniband/hw/qib/qib_user_pages.c | 11 +- drivers/infiniband/hw/qib/qib_user_sdma.c | 6 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 7 +- include/linux/mm.h | 13 ++ include/linux/mm_types.h | 22 +++- include/linux/mmzone.h | 6 + include/linux/page-flags.h | 61 +++++++++ mm/gup.c | 58 +++++++- mm/memcontrol.c | 8 ++ mm/page_alloc.c | 1 + mm/swap.c | 138 ++++++++++++++++++++ 15 files changed, 320 insertions(+), 37 deletions(-) -- 2.19.1