Hi, We are currently conducting performance tests on an application that involves writing/reading data to/from ext4 or a raw block device. Specifically, for raw block device access, we have implemented a simple "userspace filesystem" directly on top of it. All write/read operations are being tested using buffer_io. However, we have observed that the ext4+buffer_io performance significantly outperforms raw_block_device+buffer_io: ext4: write 18G/s, read 40G/s raw block device: write 18G/s, read 21G/s We are exploring potential reasons for this difference. One hypothesis is related to the page cache radix tree being per inode. Could it be that, for the raw_block_device, there is only one radix tree, leading to increased lock contention during write/read buffer_io operations? Your insights on this matter would be greatly appreciated. Thanks, Ming