After several years of helping tune file systems for normal (ATA/S-ATA)
drives, we have been doing some performance work on ext3 & reiserfs on
disk arrays.
One thing that jumps out is that the way we currently batch synchronous
work loads into transactions does really horrible things to performance
for storage devices which have really low latency.
For example, one a mid-range clariion box, we can use a single thread to
write around 750 (10240 byte) files/sec to a single directory in ext3.
That gives us an average time around 1.3ms per file.
With 2 threads writing to the same directory, we instantly drop down to
234 files/sec.
The culprit seems to be the assumptions in journal_stop() which throw in
a call to schedule_timeout_uninterruptible(1):
/*
* Implement synchronous transaction batching. If the handle
* was synchronous, don't force a commit immediately. Let's
* yield and let another thread piggyback onto this transaction.
* Keep doing that while new threads continue to arrive.
* It doesn't cost much - we're about to run a commit and sleep
* on IO anyway. Speeds up many-threaded, many-dir operations
* by 30x or more...
*
* But don't do this if this process was the most recent one to
* perform a synchronous write. We do this to detect the case
where a
* single process is doing a stream of sync writes. No point
in waiting
* for joiners in that case.
*/
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
journal->j_last_sync_writer = pid;
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction->t_handle_count);
}
reiserfs and ext4 have similar if not exactly the same logic.
What seems to be needed here is either a static per file system/storage
device tunable to allow us to change this timeout (maybe with "0"
defaulting back to the old reiserfs trick of simply doing a yield()?) or
a more dynamic, per device way to keep track of the average time it
takes to commit a transaction to disk. Based on that rate, we could
dynamically adjust our logic to account for lower latency devices.
A couple of last thoughts. One, if for some reason you don't have a low
latency storage array handy and want to test this for yourselves, you
can test the worst case by using a ram disk.
The test we used was fs_mark with 10240 bytes files, writing to one
shared directory with varying the numbers of threads from 1 up to 40. In
the ext3 case, it takes 8 concurrent threads to catch up to the single
thread writing case.
We are continuing to play with the code and try out some ideas, but I
wanted to bounce this off the broader list to see if this makes sense...
ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html