Ted, Thanks a lot for your reply! On Fri, Oct 16, 2009 at 12:15 PM, Theodore Tso <tytso@xxxxxxx> wrote: > On Fri, Oct 09, 2009 at 04:34:08PM -0700, Jiaying Zhang wrote: >> Recently, we are evaluating the ext4 performance on a high speed SSD. >> One problem we found is that ext4 performance doesn't scale well with >> multiple threads or multiple AIOs reading a single file with O_DIRECT. >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4 >> can lose up to 50% throughput compared to the results we get via RAW IO. >> >> After some initial analysis, we think the ext4 performance problem is caused >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default >> DIO_LOCKING from the generic fs code. I did a quick test by calling >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read >> got 99% performance as raw IO. > > Your diagnosis makes sense, and the test of using > blockdev_direct_IO_nol_locking() tends to confirm things, but before > we start surgery, there's a very easy way of confirming that i_mutex > is under contention. Enable CONFIG_LOCK_STAT, do "echo 0> > > /proc/lock_stat" before running your benchmark, and then capture the > contents of /proc/lock_stat after running your benchmark. For more > information see Documentation/lockstat.txt in the kernel sources. > It's always nice to grab before and after snapshots, and it's a handy > way of finding other contention points. > I tried lock_stat as you suggested. I did see a lot of contentions with multiple threads reading a single file via DIO AIO and a large amount of time was spent on waiting for the i_mutex. I think this confirms our theory that the use of i_mutex lock during DIO caused most of the overhead we saw on the ext4 SSD. >> That uninitialized extent will be marked as initialized at the >> end_io callback. We are wondering whether we can extend this idea to >> buffer write as well. I.e., we always allocate an uninitialized >> extent first during any write and convert it as initialized at the >> time of end_io callback. This will eliminate the need to hold >> i_mutex lock during direct read because a DIO read should never get >> a block marked initialized before the block has been written with >> new data. > > Obviously this will only work for extent-mapped inodes. So it means > complicating the callback code somewhat, but in theory, this should > work. > > On Thu, Oct 15, 2009 at 06:27:12PM -0700, Jiaying Zhang wrote: >> Thinking about this again, I think there is another benefit to always >> allocate uninitialized blocks first in all kinds of write, including >> size-extending write. >> >> Currently, we avoid exposing stale data during size-extending >> write by adding the inode to the orphan list before changing the >> file size. This helps us with journaled ext4 because the journal >> guarantees that the on-disk orphan list will be updated first before >> data is written to disk and i_size is updated. But with non-journal ext4, >> we don't update orphan list during size-extending write because >> without a journal, there is no guarantee on the ordering of writes to disk. >> So if the system crashes after the on-disk i_size is updated but before >> data hits to disk (this is rare but can happen during fsync), we may >> end up exposing stale data with non-journal ext4. > > Actually, that's not quite correct. That's the plan for how works in > ext3's guarded mode (which hasn't been merged into mainline nor ported > to ext4), but in ext3 and ext4's ordered mode the way we avoid > exposing stale data is by forcing data blocks to disk before the > commit is allowed to complete. You're right that either way, it doesn't > help you if you're not using a journal. (This is also going to be a > problem in supporting TRIM/DISCARD, since we are depending on the > commit mechanism to determine when it's safe to send the TRIM command > to the block device --- but that's an separate issue.) > >> I think allocating uninitialized extend first during such size-extending >> write and then converting the extend into initialized during end_io >> time can help close this security hole with non-journal ext4. > > Well..... strictly speaking, using end_io won't guarantee that no > stale data will be avoided, since the end_io callback is called when > the data has been transfered to the disk controller, and not when data > has actually be written to iron oxide. In other words, since there's > no barrier operation, there are no guarantees on a power drop. Still, > it's certainly better than nothing, and it will make things much > better than they were previously. > I agree. It will be hard to eliminate the exposing stale data problem in the non-journal case with disk buffer cache, but I think it can help us with SSD, or battery-backed disks. > > By the way, something which would be worth testing is making sure that > direct I/O read racing with a buffered write with delayed allocation > works correctly today. What should happen (although it may not be > what application programmers expect) is that direct I/O read from part > of a file which has just been written via buffered write with delayed > allocation enabled and the block has been't been allocated yet, the > direct I/O read should behave a read from a hole. (XFS behaves the > same way; we don't gurantee cache coherence when direct I/O and > non-direct I/O is mixed.) > Hmm, do you mean that if a DIO read happens right after a buffered write with delayed allocation, it should return zero instead of the data written by the buffered write? Jiaying > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html