On 11/11/24 7:10 AM, Jens Axboe wrote: > On 11/11/24 6:04 AM, Stefan Metzmacher wrote: >> Hi Jens, >> >>> If the same test case is run with RWF_UNCACHED set for the buffered read, >>> the output looks as follows: >>> >>> Reading bs 65536, uncached 0 >>> 1s: 153144MB/sec >>> 2s: 156760MB/sec >>> 3s: 158110MB/sec >>> 4s: 158009MB/sec >>> 5s: 158043MB/sec >>> 6s: 157638MB/sec >>> 7s: 157999MB/sec >>> 8s: 158024MB/sec >>> 9s: 157764MB/sec >>> 10s: 157477MB/sec >>> 11s: 157417MB/sec >>> 12s: 157455MB/sec >>> 13s: 157233MB/sec >>> 14s: 156692MB/sec >>> >>> which is just chugging along at ~155GB/sec of read performance. Looking >>> at top, we see: >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached >>> 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top >>> >>> where just the test app is using CPU, no reclaim is taking place outside >>> of the main thread. Not only is performance 65% better, it's also using >>> half the CPU to do it. >> >> Do you have numbers of similar code using O_DIRECT just to >> see the impact of the memcpy from the page cache to the userspace >> buffer... > > I don't, but I can surely generate those. I didn't consider them that > interesting for this comparison which is why I didn't do them, O_DIRECT > reads for bigger blocks sizes (or even smaller block sizes, if using > io_uring + registered buffers) will definitely have lower overhead than > uncached and buffered IO. Copying 160GB/sec isn't free :-) > > For writes it's a bit more complicated to do an apples to apples > comparison, as uncached IO isn't synchronous like O_DIRECT is. It only > kicks off the IO, doesn't wait for it. Here's the read side - same test as above, using 64K reads: 1s: 24947MB/sec 2s: 24840MB/sec 3s: 24666MB/sec 4s: 24549MB/sec 5s: 24575MB/sec 6s: 24669MB/sec 7s: 24611MB/sec 8s: 24369MB/sec 9s: 24261MB/sec 10s: 24125MB/sec which is in fact pretty depressing. As before, this is 32 threads, each reading a file from separate XFS mount points, so 32 file systems in total. If I bump the read size to 128K, then it's about 42GB/sec. 256K gets you to 71-72GB/sec. Just goes to show you, you need parallellism to get the best performance out of the devices with O_DIRECT. If I run io_uring + dio + registered buffers, I can get ~172GB/sec out of reading the same 32 files from 32 threads. -- Jens Axboe