Re: [PATCH 08/13] fs: add read support for RWF_UNCACHED

Jens Axboe <axboe@xxxxxxxxx> · Mon, 11 Nov 2024 08:44:58 -0700

On 11/11/24 7:10 AM, Jens Axboe wrote:
> On 11/11/24 6:04 AM, Stefan Metzmacher wrote:
>> Hi Jens,
>>
>>> If the same test case is run with RWF_UNCACHED set for the buffered read,
>>> the output looks as follows:
>>>
>>> Reading bs 65536, uncached 0
>>>    1s: 153144MB/sec
>>>    2s: 156760MB/sec
>>>    3s: 158110MB/sec
>>>    4s: 158009MB/sec
>>>    5s: 158043MB/sec
>>>    6s: 157638MB/sec
>>>    7s: 157999MB/sec
>>>    8s: 158024MB/sec
>>>    9s: 157764MB/sec
>>>   10s: 157477MB/sec
>>>   11s: 157417MB/sec
>>>   12s: 157455MB/sec
>>>   13s: 157233MB/sec
>>>   14s: 156692MB/sec
>>>
>>> which is just chugging along at ~155GB/sec of read performance. Looking
>>> at top, we see:
>>>
>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>> 7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
>>> 8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top
>>>
>>> where just the test app is using CPU, no reclaim is taking place outside
>>> of the main thread. Not only is performance 65% better, it's also using
>>> half the CPU to do it.
>>
>> Do you have numbers of similar code using O_DIRECT just to
>> see the impact of the memcpy from the page cache to the userspace
>> buffer...
> 
> I don't, but I can surely generate those. I didn't consider them that
> interesting for this comparison which is why I didn't do them, O_DIRECT
> reads for bigger blocks sizes (or even smaller block sizes, if using
> io_uring + registered buffers) will definitely have lower overhead than
> uncached and buffered IO. Copying 160GB/sec isn't free :-)
> 
> For writes it's a bit more complicated to do an apples to apples
> comparison, as uncached IO isn't synchronous like O_DIRECT is. It only
> kicks off the IO, doesn't wait for it.

Here's the read side - same test as above, using 64K reads:

  1s: 24947MB/sec
  2s: 24840MB/sec
  3s: 24666MB/sec
  4s: 24549MB/sec
  5s: 24575MB/sec
  6s: 24669MB/sec
  7s: 24611MB/sec
  8s: 24369MB/sec
  9s: 24261MB/sec
 10s: 24125MB/sec

which is in fact pretty depressing. As before, this is 32 threads, each
reading a file from separate XFS mount points, so 32 file systems in
total. If I bump the read size to 128K, then it's about 42GB/sec. 256K
gets you to 71-72GB/sec.

Just goes to show you, you need parallellism to get the best performance
out of the devices with O_DIRECT. If I run io_uring + dio + registered
buffers, I can get ~172GB/sec out of reading the same 32 files from 32
threads.

-- 
Jens Axboe