On Thu, 14 Jan 2021, Mikulas wrote: >> I'm working with Mingkai on optimizations for Ext4-dax. > > What specific patch are you working on? Please, post it somewhere. Here is the work-in-progress patch: https://ipads.se.sjtu.edu.cn:1312/opensource/linux/-/tree/ext4-read It only contains the "read" implementation for Ext4-dax now, though, we will put other optimizations on it later. > What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 ) > - detect in the "read_iter" method that there is just one segment and > treat it like a "read" method. I think that it should improve performance > for your case. Note that the original Ext4-dax does not implement the "read" method. Instead, it calls the "dax_iomap_rw" method provided by VFS. So we firstly rewrite the "read-iter" method which iterates struct iov_iter and calls our "read" method as a baseline for comparison. Overall time of 2^26 4KB read: "read-iter" method with dax-iomap-rw (original) - 36.477s "read_iter" method wraps our "read" method - 28.950s "read_iter" method tests for one entry proposed by Mikulas - 27.947s "read" method - 26.899s As we mentioned in the previous email (https://lkml.org/lkml/2021/1/12/710), the overhead mainly consists of two parts. The first is constructing struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and iov_iter_init). The second is the dax io mechanism provided by VFS (i.e., dax_iomap_rw, iomap_apply and ext4_iomap_begin). For Ext4-dax, the overhead of dax_iomap_rw is significant compared to the overhead of struct iov_iter. Although methods proposed by Mikulas can eliminate the overhead of iov_iter well, they can not be applied in Ext4-dax unless we implement an internal "read" method in Ext4-dax. For Ext4-dax, there could be two approaches to optimizing: 1) implementing the internal "read" method without the complexity of iterators and dax_iomap_rw; 2) optimizing how dax_iomap_rw works. Since dax_iomap_rw requires ext4_iomap_begin, which further involves the iomap structure and others (e.g., journaling status locks in Ext4), we think implementing the internal "read" method would be easier. As for whether the external .read interface in VFS should be reserved, since there is still a performance gap (3.9%) between the "read" method and the optimized "read_iter" method, we think reserving it is better. Thanks, Zhongwei