RE: fsync stuck at jbd2_log_wait_commit on NVMe devices

Roy Yang <bsdnet@xxxxxxxxxxx> · Wed, 12 Aug 2015 16:34:17 -0700

Hi Ted, 

   Appreciate your time, and will get the log from normal work load.

   BTW, the previous JBD2 log is from the node which is in bad state. 
  Normal node finish within 3 seconds; while on this node, it takes 10 seconds.

   Thank you very much!

   Rpu

----------------------------------------
> Date: Wed, 12 Aug 2015 18:34:56 -0400
> From: tytso@xxxxxxx
> To: bsdnet@xxxxxxxxxxx
> CC: linux-ext4@xxxxxxxxxxxxxxx
> Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
>
> On Wed, Aug 12, 2015 at 02:43:22PM -0700, Roy Yang wrote:
>> Hi Ted,
>>
>> I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so. Compared to the normal system, this test only takes 2 seconds.
>> We saw the spike of running, but checking with kernel source. I don't get what it means. Any insights?
>
> You need to get the trace on your normal system --- a system when
> under load, you are seeing the long times for a journal commit to
> complete. The jbd2_run_stats gives us critical insight into what
> might be going wrong --- but just as it doesn't help much to use a
> diagnostic tool on a pet dog when you are trying to determine why the
> child is running a fever, it doesn't help much to take traces on a
> test program/system when we're trying to actually debug why your
> production server is having trouble....
>
>
>> <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
>
> There will be one of these trace points emitted for each jd2
> transaction commit. Most of the fields, with the exception of the
> last three (handle_count, blocks, and blocks logged) are in units of
> 1000 HZ "jiffies. So for example, if you don't have any fsync()'s,
> transactions by default will run for 5 seconds before being committed,
> so the "running" time will be a bit larger than 5000 jiffies (the
> timer goes off at 5 seconds, but there may be some delay before the
> jbd2 kernel thread gets scheduled):
>
> jbd2/sda3-8-275 [003] ...2 70449.637938: jbd2_run_stats: dev 8,3 tid 789663 wait 0 request_delay 0 running 5030 locked 0 flushing 0 logging 0 handle_count 18 blocks 5 blocks_logged 6
>
> The "wait" field indicates the maximum time that a handle associated
> with a transaction had to wait before it could get started. Typically
> this is because we couldn't start a new transactions because the old
> transaction was still in the process of being committed.
>
> The "request_delay" field indicates delay between when transaction
> commit was requested (typically due to an fsync), and when the jbd2
> kernel thread actually started working on the commit. (Again, all time
> fields are in units of jiffies, so you can read the integer in
> milliseconds --> 5030 jiffies === 5030 milliseconds == 5.03 seconds.
>
> The "running" field indicates how long transaction was running; as
> mentioned above, this will typically be around 5000 jiffies unless
> your workload was doing fsync()'s, or if the transaction was big
> enough that it was forced to close because journal space reasons.
>
> The "locked" field indicates how long it takes for all of the pending
> handles to be completed. When the jbd2 layer starts working on the
> commit, the first thing it does is lock down the transaction so no new
> handles can be started. It then waits for all curently started
> handles can complete. While we are waiting for handles to close, no
> other handles can be started (see discussion of the maximum wait time
> above). We try to keep handle processing time as short as possible
> (for example, by prefetching blocks before starting a handle), but
> there are cases where this is not possible, or at least not easy.
> Analysis of the jbd2_handle_stats trances can help identify
> longer-running handles. If these long-running handles happen right
> before journal commit (and for a workload that is overly fsync-happy,
> that may be most handles), they can prolong how long a journal commit
> stays in the "locked" phase.
>
> The "flushing" field indicates how long the transaction commit spent
> writing out data blocks. We only need to write out data blocks in
> data=ordered mode associated with inodes which have new blocks
> allocated during the last transaction.
>
> The "logging" field indicates how long it took to write out the blocks
> that are actually written to the journal. The number of blocks
> actually written are reflected in the "blocks" and "blocks_logged"
> fields, where the "blocks_logged" fields includes the overhead blocks
> (jbd2 descriptor blocks, jbd2 revoke blocks, etc.)
>
> The "handle_count" field indicates how many handles are associated
> with a particular transaction. In general each simple system call
> (chmod, creat, chown, symlink, read, etc.) that modifies the file
> system requires at most a single handle. The big exception to this
> rule of thumb is the truncate(2) and unlink(2) system calls, which may
> require multiple handles in order to complete. The truncate() and
> unlink system calls are also the ones which tend to hold a handle open
> for longer-than-average amounts of time.
>
> By the way, this is why calling truncate() and doing block allocations
> and doing fsync() a lot is *not* free. If you are purchasing $$$ NVME
> storage, in general it's a wise idea for your application to simply do
> non-allocating writes (i.e., overwriting previous blocks in a
> preallocated file), and tracking which blocks are used for what in
> your application, instead of trying to ask general purpose file system
> code to do this for you. If you can used a specialized data structure
> (say, a circular buffer) and manage the head and tail pointers
> yourself, it will almost always require much less overhead than what
> the file system will have to do since the file system doesn't have the
> advantage of specialized knowledge of exactly what your requirements
> might be for your workload.
>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html