On Mon, Nov 13, 2023 at 05:57:52PM -0800, Ming Lin wrote: > Hi, > > We are currently conducting performance tests on an application that > involves writing/reading data to/from ext4 or a raw block device. > Specifically, for raw block device access, we have implemented a > simple "userspace filesystem" directly on top of it. > > All write/read operations are being tested using buffer_io. However, > we have observed that the ext4+buffer_io performance significantly > outperforms raw_block_device+buffer_io: > > ext4: write 18G/s, read 40G/s > raw block device: write 18G/s, read 21G/s Can you share your exact test case? I tried the following fio test on both ext4 over nvme and raw nvme, and the result is the opposite: raw block device throughput is 2X ext4, and it can be observed in both VM and read hardware. 1) raw NVMe fio --direct=0 --size=128G --bs=64k --runtime=20 --numjobs=8 --ioengine=psync \ --group_reporting=1 --filename=/dev/nvme0n1 --name=test-read --rw=read 2) ext4 fio --size=1G --time_based --bs=4k --runtime=20 --numjobs=8 \ --ioengine=psync --directory=$DIR --group_reporting=1 \ --unlink=0 --direct=0 --fsync=0 --name=f1 --stonewall --rw=read > > We are exploring potential reasons for this difference. One hypothesis > is related to the page cache radix tree being per inode. Could it be > that, for the raw_block_device, there is only one radix tree, leading > to increased lock contention during write/read buffer_io operations? 'perf record/report' should show the hot spot if lock contention is the reason. Thanks, Ming