On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@xxxxxx> wrote: > On 08/16/2011 09:53 PM, Jan Kara wrote: >> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@xxxxxxxxxx> wrote: >>>> 15.08.2011 12:00, Michael Tokarev wrote: >>>> [....] >>>> >>>> So, it looks like this (starting with cold cache): >>>> >>>> 1. rename the redologs and copy them over - this will >>>> make a hot copy of redologs >>>> 2. startup oracle - it will complain that the redologs aren't >>>> redologs, the header is corrupt >>>> 3. shut down oracle, start it up again - it will succeed. >>>> >>>> If between 1 and 2 you'll issue sync(1) everything will work. >>>> When shutting down, oracle calls fsync(), so that's like >>>> sync(1) again. >>>> >>>> If there will be some time between 1. and 2., everything >>>> will work too. >>>> >>>> Without dioread_nolock I can't trigger the problem no matter >>>> how I tried. >>>> >>>> >>>> A smaller test case. I used redo1.odf file (one of the >>>> redologs) as a test file, any will work. >>>> >>>> $ cp -p redo1.odf temp >>>> $ dd if=temp of=foo iflag=direct count=20 >>> Isn't this the expected behavior here? When doing >>> 'cp -p redo1.odf temp', data is copied to temp through >>> buffer write, but there is no guarantee when data will be >>> actually written to disk. Then with 'dd if=temp of=foo >>> iflag=direct count=20', data is read directly from disk. >>> Very likely, the written data hasn't been flushed to disk >>> yet so ext4 returns zero in this case. >> No it's not. Buffered and direct IO is supposed to work correctly >> (although not fast) together. In this particular case we take care to flush >> dirty data from page cache before performing direct IO read... But >> something is broken in this path obviously. I see. Thanks a lot for the explanation. I wonder whether the following patch will solve the problem: diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 6c27111..ca90d73 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, } retry: - if (rw == READ && ext4_should_dioread_nolock(inode)) + if (rw == READ && ext4_should_dioread_nolock(inode)) { + if (unlikely(!list_empty(&ei->i_completed_io_list))) { + mutex_lock(&inode->i_mutex); + ext4_flush_completed_IO(inode); + mutex_unlock(&inode->i_mutex); + } ret = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ext4_get_block, NULL, NULL, 0); - else { + } else { ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, I tested the patch a little bit and it seems to resolve the race on dioread_nolock in my case. Michael, I would very appreciate if you can try this patch with your test case and see whether it works. >> >> I don't have time to dig into this in detail now but what seems to be the >> problem is that with dioread_nolock option, we don't acquire i_mutex for >> direct IO reads anymore. Thus these reads can compete with >> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under >> i_mutex so without dioread_nolock the race cannot happen). >> >> Hmm, the new writepages code seems to be broken in combination with direct >> IO. Direct IO code expects that when filemap_write_and_wait() finishes, >> data is on disk but with new bio submission code this is not true because >> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in >> ext4_end_io_buffer_write() but do extent conversion only after that in >> convert workqueue. So the race seems to be there all the time, just without >> dioread_nolock it's much smaller. I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled. So I think we should be ok with the default mount options. Jiaying > You are absolutely right. The really problem is that ext4_direct_IO > begins to work *after* we clear the page writeback flag and *before* we > convert unwritten extent to a valid state. Some of my trace does show > that. I am working on it now. >> >> Fixing this is going to be non-trivial - I'm not sure we can really move >> clearing of PageWriteback bit to conversion workqueue. I thienk we already >> tried that once but it caused deadlocks for some reason... > I just did as what you described and yes I met with another problem and > try to resolve it now. Once it is OK, I will send out the patch. > > Thanks > Tao > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html