"Stephen C. Tweedie" wrote: > > Hi, > > On Thu, May 02, 2002 at 01:29:49PM -0400, David Mansfield wrote: > > > The last thing I tried is running the program on a server here running the > > beta of 'Red Hat Advanced Server' which is running their kernel > > '2.4.9-26beta.31' and I got: > > > > 1399 blks/sec ext3-ordered !!! > > > > It turns out that this kernel ignores the O_SYNC on open... ha. > > It looks like a generic 2.4.9 bug. > > What's happening is that ext3 has its own extra logic for tracking > buffers when it comes to writing pages from the page cache, and part > of what the jbd layer does marks the pages as dirty. > > Then, the core VFS __block_prepare_write comes along, and it does > > if (!atomic_set_buffer_dirty(bh)) { > __mark_dirty(bh); > buffer_insert_inode_queue(bh, inode); > need_balance_dirty = 1; > } > > so if the buffer becomes dirty, it gets hooked onto the appropriate > inode's queue where O_SYNC can find it. Unfortunately, ext3 has > already marked it dirty, so this code doesn't run. I found the same bug in reiserfs the other day ;) > It looks as if the same problem is still present in 2.4.18, except > ameliorated somewhat by new code in ext3_file_write: > > /* > * Nasty: if the file is subject to synchronous writes then we need > * to force generic_osync_inode() to call ext3_write_inode(). > * We do that by marking the inode dirty. This adds much more > * computational expense than we need, but we're going to sync > * anyway. > */ > if (IS_SYNC(inode) || (file->f_flags & O_SYNC)) > mark_inode_dirty(inode); > > That's actually the wrong fix --- for a multi-page write, it only > guarantees that the first page gets flushed (and not even that if > another task happens to flush the inode out before the page write > completes); and it forces an unnecessary inode sync even when we > haven't modified i_size and don't need to sync metadata. Yeah, it's a hack. It's a way of forcing a commit on the way out of write(2). I'm not sure why I didn't just test O_SYNC _after_ running generic_file_write(), and force a commit if needed. I don't quite understand your patch. These tests: + if (bh->b_inode != inode) are a performance optimisation only, yes? + buffer_insert_inode_queue(bh, inode); Should this not be buffer_insert_inode_data_queue()? I think what you're doing here is allowing generic_osync_inode to perform the writeback, and to optimise away a commit if no bitmaps or indirects or inodes were altered, yes?