Re: ext4 DIO read performance issue on SSD

Mingming <cmm@xxxxxxxxxx> · Thu, 15 Oct 2009 16:28:24 -0700



On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@xxxxxxxxxx> wrote:
> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
> >> Mingming,
> >>
> >
> > Hi Jiaying,
> >
> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@xxxxxxxxxx> wrote:
> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> Hello,
> >> >>
> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >>
> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> got 99% performance as raw IO.
> >> >>
> >> >
> >> > This is very interesting...and impressive number.
> >> >
> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> > but then realize that we can't do this all the time, as ext4 support
> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> > ext3 format file.
> >> >
> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> will be marked as initialized at the end_io callback.
> >> >
> >> > Though I need to clarify that with all the patches in mainline, we only
> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> > to the end of file. I actually have proposed to treat the write to the
> >> > end of file also as unintialized extents, but there is some concerns
> >> > that this getting tricky with updating inode size when it is async IO
> >> > direct IO. So it didn't getting done yet.
> >>
> >> I read you previous email thread again. As I understand, the main
> >> concern for allocating uninitialized blocks in i_size extending write
> >> is that we may end up having uninitialized blocks beyond i_size
> >> if the system crashes during write. Can we protect this case by
> >> adding the inode into the orphan list in ext4_ext_direct_IO,
> >> i.e., same as we have done in ext4_ind_direct_IO?
> >>
> >
> > Sure we could do that, though initially I hoped we could get rid of
> > that:)
> >
> > The tricky part is async direct write to the end of file. By the time
> > the IO is completed, the inode may be truncated or extended larger.
> > Updating the most "safe" size is the part I haven't thought through yet.
> >
> 
> Ok. I think I understand the problem better now :).
> 
> Looking at the __blockdev_direct_IO(), I saw it actually treats
> size-extending aio dio write as synchronous and expects the dio to
> complete before return (fs/direct-io.c line 1204 and line 1056-1061).

Oh? It seems it will keep the write async as long as it's not a partial
write 
        /*
         * The only time we want to leave bios in flight is when a successful
         * partial aio read or full aio write have been setup.  In that case
         * bio completion will call aio_complete.  The only time it's safe to
         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
         * This had *better* be the only place that raises -EIOCBQUEUED.
         */
        BUG_ON(ret == -EIOCBQUEUED);
        if (dio->is_async && ret == 0 && dio->result &&
            ((rw & READ) || (dio->result == dio->size)))
                ret = -EIOCBQUEUED;

        if (ret != -EIOCBQUEUED)
                dio_await_completion(dio);

> Can we follow the same rule and check whether it is a size-extending
> aio write in ext4_end_io_dio()? In such cases, we can call
> ext4_end_aio_dio_nolock() synchronously instead of queuing
> the work. I think this will protect us from truncate because we
> are still holding i_mutex and i_alloc_sem.
> 
> Jiaying
> 
> >
> >
> >> Jiaying
> >>
> >> >
> >> >>  We are wondering
> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> a block marked initialized before the block has been written with new data.
> >> >>
> >> >
> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >> > then read from disk. So if there is concurrent buffered write and direct
> >> > read, removing the i_mutex locks from the direct IO path should still
> >> > gurantee the right order, without having to treat buffered allocation
> >> > with uninitialized extent/end_io.
> >> >
> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> > write to hole and concurrent direct IO read, we should able to remove
> >> > this lock for extent based ext4 file.
> >> >
> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> see whether this proposal makes sense to you.
> >> >>
> >> >
> >> > It does make sense to me.
> >> >
> >> > Mingming
> >> >> Regards,
> >> >>
> >> >> Jiaying
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> >
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html