Re: transaction batching performance & multi-threaded synchronous writers

Ric Wheeler <rwheeler@xxxxxxxxxx> · Tue, 15 Jul 2008 07:29:21 -0400

Andreas Dilger wrote:
On Jul 14, 2008  12:58 -0400, Josef Bacik wrote:

Perhaps we track the average time a commit takes to occur, and then if
the current transaction start time is < than the avg commit time we sleep
and wait for more things to join the transaction, and then we commit.
How does that idea sound?  Thanks,

The drawback of this approach is that if the thread waits an extra "average
transaction time" for the transaction to commit then this will increase the
average transaction time each time, and it still won't tell you if there
needs to be a wait at all.

What might be more interesting is tracking how many processes had sync
handles on the previous transaction(s), and once that number of processes
have done that work, or the timeout reached, the transaction is committed.

While this might seem like a hack for the particular benchmark, this
will also optimize real-world workloads like mailserver, NFS/fileserver,
http where the number of threads running at one time is generally fixed.

The best way to do that would be to keep a field in the task struct to
track whether a given thread has participated in transaction "T" when
it starts a new handle, and if not then increment the "number of sync
threads on this transaction" counter.

In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
the transaction can be committed earlier, and if not then it does a
wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).

While the number of sync threads is growing or constant the commits will 
be rapid, and any "slow" threads will block on the next transaction and
increment its num_sync_thr until the thread count stabilizes (i.e. a small
number of transactions at startup).  After that the wait will be exactly
as long as needed for each thread to participate.  If some threads are
too slow, or stop processing then there will be a single sleep and the
next transaction will wait for fewer threads the next time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

This really sounds like one of those math problems (queuing theory?) 
that I never was able to completely wrap my head around back at 
university, but the basic things that we we have are:

   (1) the average time it takes to complete an independent 
transaction. This will be different for each target device and will 
possibly change over time (specific odd case is a shared disk, like an 
array).
   (2) the average cost it takes to add "one more" thread to a 
transaction. I think that the assumption is that this cost is close to zero.
   (3) the rate of arrival of threads trying to join a transaction.
   (4) come knowledge of the history of which threads did the past 
transactions. It is quite reasonable to never wait if a single thread is 
the author of the last (most of the last?) sequence which is the good 
thing in there now.
   (5) the minimum time we can effectively wait with a given mechanism 
(4ms or 1ms for example depending on the HZ in the code today)

I think the trick here is to try and get a heuristic that works without 
going nuts in complexity.

The obvious thing we need to keep is the heuristic to not wait if we 
detect a single thread workload.

It would seem reasonable not to wait if the latency of the device (1 
above) is lower than the time the chosen mechanism can wait (5). For 
example, if transactions are done in microseconds like for a ramdisk, 
just blast away ;-)

What would be left would be the need to figure out if (3) arrival rate 
would predict a new thread will come along before we would be able to 
finish the current transaction without waiting.

Does this make any sense? This sounds close to the idea that Josef 
proposed above, we would just tweak his proposal to avoid sleeping in 
the single threaded case.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html