writing processes are blocking in log_wait_common with data=ordered

akpm@zip.com.au (Andrew Morton) · Fri, 03 May 2002 13:15:42 -0700

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Thu, May 02, 2002 at 01:29:49PM -0400, David Mansfield wrote:
> 
> > The last thing I tried is running the program on a server here running the
> > beta of 'Red Hat Advanced Server' which is running their kernel
> > '2.4.9-26beta.31' and I got:
> >
> > 1399 blks/sec ext3-ordered !!!
> >
> > It turns out that this kernel ignores the O_SYNC on open... ha.
> 
> It looks like a generic 2.4.9 bug.
> 
> What's happening is that ext3 has its own extra logic for tracking
> buffers when it comes to writing pages from the page cache, and part
> of what the jbd layer does marks the pages as dirty.
> 
> Then, the core VFS __block_prepare_write comes along, and it does
> 
>                         if (!atomic_set_buffer_dirty(bh)) {
>                                 __mark_dirty(bh);
>                                 buffer_insert_inode_queue(bh, inode);
>                                 need_balance_dirty = 1;
>                         }
> 
> so if the buffer becomes dirty, it gets hooked onto the appropriate
> inode's queue where O_SYNC can find it.  Unfortunately, ext3 has
> already marked it dirty, so this code doesn't run.

I found the same bug in reiserfs the other day ;)

> It looks as if the same problem is still present in 2.4.18, except
> ameliorated somewhat by new code in ext3_file_write:
> 
>         /*
>          * Nasty: if the file is subject to synchronous writes then we need
>          * to force generic_osync_inode() to call ext3_write_inode().
>          * We do that by marking the inode dirty.  This adds much more
>          * computational expense than we need, but we're going to sync
>          * anyway.
>          */
>         if (IS_SYNC(inode) || (file->f_flags & O_SYNC))
>                 mark_inode_dirty(inode);
> 
> That's actually the wrong fix --- for a multi-page write, it only
> guarantees that the first page gets flushed (and not even that if
> another task happens to flush the inode out before the page write
> completes); and it forces an unnecessary inode sync even when we
> haven't modified i_size and don't need to sync metadata.

Yeah, it's a hack.  It's a way of forcing a commit on
the way out of write(2).  I'm not sure why I didn't
just test O_SYNC _after_ running generic_file_write(),
and force a commit if needed.

I don't quite understand your patch. These tests:

+       if (bh->b_inode != inode)

are a performance optimisation only, yes?

+               buffer_insert_inode_queue(bh, inode);

Should this not be buffer_insert_inode_data_queue()?

I think what you're doing here is allowing generic_osync_inode
to perform the writeback, and to optimise away a commit if
no bitmaps or indirects or inodes were altered, yes?