Re: some hard numbers on ext3 & batching performance issue

Ric Wheeler <ric@xxxxxxx> · Fri, 07 Mar 2008 15:08:32 -0500

Josef Bacik wrote:
On Wednesday 05 March 2008 2:19:48 pm Ric Wheeler wrote:
After the IO/FS workshop last week, I posted some details on the slow
down we see with ext3 when we have a low latency back end instead of a
normal local disk (SCSI/S-ATA/etc).
...
...
...
It would be really interesting to rerun some of these tests on xfs which
Dave explained in the thread last week has a more self tuning way to
batch up transactions....

Note that all of those poor users who have a synchronous write workload
today are in the "1" row for each of the above tables.

Mind giving this a whirl?  The fastest thing I've got here is an Apple X RAID 
and its being used for something else atm, so I've only tested this on local 
disk to make sure it didn't make local performance suck (which it doesn't btw).  
This should be equivalent with what David says XFS does.  Thanks much,

Josef

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index c6cbb6c..4596e1c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1333,8 +1333,7 @@ int journal_stop(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
-	int old_handle_count, err;
-	pid_t pid;
+	int err;
 
 	J_ASSERT(journal_current_handle() == handle);
 
@@ -1353,32 +1352,22 @@ int journal_stop(handle_t *handle)
 
 	jbd_debug(4, "Handle %p going down\n", handle);
 
-	/*
-	 * Implement synchronous transaction batching.  If the handle
-	 * was synchronous, don't force a commit immediately.  Let's
-	 * yield and let another thread piggyback onto this transaction.
-	 * Keep doing that while new threads continue to arrive.
-	 * It doesn't cost much - we're about to run a commit and sleep
-	 * on IO anyway.  Speeds up many-threaded, many-dir operations
-	 * by 30x or more...
-	 *
-	 * But don't do this if this process was the most recent one to
-	 * perform a synchronous write.  We do this to detect the case where a
-	 * single process is doing a stream of sync writes.  No point in waiting
-	 * for joiners in that case.
-	 */
-	pid = current->pid;
-	if (handle->h_sync && journal->j_last_sync_writer != pid) {
-		journal->j_last_sync_writer = pid;
-		do {
-			old_handle_count = transaction->t_handle_count;
-			schedule_timeout_uninterruptible(1);
-		} while (old_handle_count != transaction->t_handle_count);
-	}
-
 	current->journal_info = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
+
+	if (journal->j_committing_transaction && handle->h_sync) {
+		tid_t tid = journal->j_committing_transaction->t_tid;
+
+		spin_unlock(&transaction->t_handle_lock);
+		spin_unlock(&journal->j_state_lock);
+
+		err = log_wait_commit(journal, tid);
+
+		spin_lock(&journal->j_state_lock);
+		spin_lock(&transaction->t_handle_lock);
+	}
+
 	transaction->t_outstanding_credits -= handle->h_buffer_credits;
 	transaction->t_updates--;
 	if (!transaction->t_updates) {




Running with Josef's patch, I was able to see a clear improvement for 
batching these synchronous operations on ext3 with the RAM disk and 
array. It is not too often that you get to do a simple change and see a 
27 times improvement ;-)

On the bad side, the local disk case took as much as a 30% drop in 
performance.  The specific disk is not one that I have a lot of 
experience with, I would like to retry on a disk that has been qualified 
 by our group (i.e., we have reasonable confidence that there are no 
firmware issues, etc).

Now for the actual results.

The results are the average value of 5 runs for each number of threads.

Type     Threads   Baseline    Josef    Speedup (Josef/Baseline)
array	    1	     320.5      325.4      1.01
array	    2	     174.9      351.9      2.01
array	    4	     382.7      593.5      1.55
array	    8	     644.1      963.0      1.49
array	    10	     842.9     1038.7      1.23
array	    20	    1319.6     1432.3      1.08

RAM disk    1       5621.4     5595.1      0.99
RAM disk    2        281.5     7613.3     27.04
RAM disk    4        579.9     9111.5     15.71
RAM disk    8        891.1     9357.3     10.50
RAM disk    10      1116.3     9873.6      8.84
RAM disk    20      1952.0    10703.6      5.48

S-ATA disk  1         19.0       15.1      0.79
S-ATA disk  2         19.9       14.4      0.72
S-ATA disk  4         41.0       27.9      0.68
S-ATA disk  8         60.4       43.2      0.71
S-ATA disk  10        67.1       48.7      0.72
S-ATA disk  20       102.7       74.0      0.72

Background on the tests:

All of this is measured on three devices - a relatively old & slow 
array, the local (slow!) 2.5" S-ATA disk in the box and a RAM disk.

These numbers are used fs_mark to write 4096 byte files with the 
following commands:

fs_mark  -d  /home/test/t  -s  4096  -n  40000  -N  50  -D  64  -t  1
...
fs_mark  -d  /home/test/t  -s  4096  -n  20000  -N  50  -D  64  -t  2
...
fs_mark  -d  /home/test/t  -s  4096  -n  10000  -N  50  -D  64  -t  4
...
fs_mark  -d  /home/test/t  -s  4096  -n  5000  -N  50  -D  64  -t  8
...
fs_mark  -d  /home/test/t  -s  4096  -n  4000  -N  50  -D  64  -t  10
...
fs_mark  -d  /home/test/t  -s  4096  -n  2000  -N  50  -D  64  -t  20
...

Note that this spreads the files across 64 subdirectories, each thread 
writes 50 files and then moves on to the next in a round robin.

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html