Re: background on the ext3 batching performance issue

Jan Kara <jack@xxxxxxx> · Thu, 28 Feb 2008 17:41:05 +0100



> Josef Bacik wrote:
> >On Thursday 28 February 2008 10:05:11 am Josef Bacik wrote:
> >>On Thursday 28 February 2008 7:09:17 am Ric Wheeler wrote:
> >>>At the LSF workshop, I mentioned that we have tripped across an
> >>>embarrassing performance issue in the jbd transaction code which is
> >>>clearly not tuned for low latency devices.
> >>>
> >>>The short summary is that we can do say 800 10k files/sec in a
> >>>write/fsync/close loop with a single thread, but drop down to under 250
> >>>files/sec with 2 or more threads.
> >>>
> >>>This is pretty easy to reproduce with any small file write synchronous
> >>>workload (i.e., fsync() each file before close).  We used my fs_mark
> >>>tool to reproduce.
> >>>
> >>>The core of the issue is the call in the jbd transaction code call out
> >>>to schedule_timeout_uninterruptible(1) which causes us to sleep for 4ms:
> >>>
> >>>        pid = current->pid;
> >>>        if (handle->h_sync && journal->j_last_sync_writer != pid) {
> >>>                journal->j_last_sync_writer = pid;
> >>>                do {
> >>>                        old_handle_count = transaction->t_handle_count;
> >>>                        schedule_timeout_uninterruptible(1);
> >>>                } while (old_handle_count !=
> >>>transaction->t_handle_count); }
> >>>
> >>>This is quite topical to the concern we had with low latency devices in
> >>>general, but specifically things like SSD's.
> >>Your testcase does in fact show a weakness in this optimization, but look
> >>at the more likely case, where you have multiple writers on the same
> >>filesystem rather than one guy doing write/fsync.  If we wait we could
> >>potentially add quite a few more buffers to this transaction before
> >>flushing it, rather than flushing a buffer or two at a time.  What would
> >>you propose as a solution?
> >>
> >
> >Forgive me, I said that badly, now that I've had my morning coffee let me 
> >try again.  You are ping-ponging the j_last_sync_writer back and forth 
> >between the two threads, so you don't get the speedup you would get with 
> >one thread where we would just bypass the next sleep since we know we've 
> >got one thread doing write/sync.  So this brings up the question, should 
> >we try and figure out if we have the situation where we have multiple 
> >threads doing write/sync and therefore exploiting the weakness in this 
> >optimization, and if we should, how would we do this properly?  The only 
> >thing I can think to do is to track sync writers on a transaction, and if 
> >its more than one bypass this little snippet.  In fact I think I'll go 
> >ahead and do that and see what fs_mark comes up with.  Thank you,
> >
> >Josef
> >
> 
> One more thought - what we really want here is to have a sense of the 
> latency of the device. In the S-ATA disk case, this optimization works 
> well for batching since we "spend" an extra 4ms worst case in the chance 
> of combining multiple, slow 18ms operations.
> 
> With the clariion box we tested, the optimization fails badly since the 
> cost is only 1.3 ms so we optimize by waiting 3-4 times longer than it 
> would take to do the operation immediately.
> 
> This problem has also seemed to me to be the same problem that IO 
> schedulers do with plugging - we want to dynamically figure out when to 
> plug and unplug here without hard coding in device specific tunings.
> 
> If we bypass the snippet for multi-threaded writers, we would probably 
> slow down this workload on normal S-ATA/ATA drives (or even higher 
> performance non-RAID disks).
  Exactly. I can run some tests next week but I guess for standard disk
you have in your desktop, this optimization is really worthwhile since
transaction commit has a significant cost on such drive (and we already
suck in fsync() performance in ext3 for other reasons so I wouldn't like
to make it even worse ;).
  The question is how we could tell in JBD whether the optimisation is
worth it or not. Journal flag (settable via tunefs) is always an option
but if somebody has a better idea... But if mkfs did some magic and
automatically set the flag when it found out the device has low latency,
it might be actually quite satisfactory solution. Also this option might
be useful also for people preferring lower fsync latency over general
throughput.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html