RE: fsync stuck at jbd2_log_wait_commit on NVMe devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ted, 

   Appreciate your time, and will get the log from normal work load.

   BTW, the previous JBD2 log is from the node which is in bad state. 
  Normal node finish within 3 seconds; while on this node, it takes 10 seconds.
 
   Thank you very much!

   Rpu

----------------------------------------
> Date: Wed, 12 Aug 2015 18:34:56 -0400
> From: tytso@xxxxxxx
> To: bsdnet@xxxxxxxxxxx
> CC: linux-ext4@xxxxxxxxxxxxxxx
> Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices
>
> On Wed, Aug 12, 2015 at 02:43:22PM -0700, Roy Yang wrote:
>> Hi Ted,
>>
>> I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so. Compared to the normal system, this test only takes 2 seconds.
>> We saw the spike of running, but checking with kernel source. I don't get what it means. Any insights?
>
> You need to get the trace on your normal system --- a system when
> under load, you are seeing the long times for a journal commit to
> complete. The jbd2_run_stats gives us critical insight into what
> might be going wrong --- but just as it doesn't help much to use a
> diagnostic tool on a pet dog when you are trying to determine why the
> child is running a fever, it doesn't help much to take traces on a
> test program/system when we're trying to actually debug why your
> production server is having trouble....
>
>
>> <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2
>
> There will be one of these trace points emitted for each jd2
> transaction commit. Most of the fields, with the exception of the
> last three (handle_count, blocks, and blocks logged) are in units of
> 1000 HZ "jiffies. So for example, if you don't have any fsync()'s,
> transactions by default will run for 5 seconds before being committed,
> so the "running" time will be a bit larger than 5000 jiffies (the
> timer goes off at 5 seconds, but there may be some delay before the
> jbd2 kernel thread gets scheduled):
>
> jbd2/sda3-8-275 [003] ...2 70449.637938: jbd2_run_stats: dev 8,3 tid 789663 wait 0 request_delay 0 running 5030 locked 0 flushing 0 logging 0 handle_count 18 blocks 5 blocks_logged 6
>
> The "wait" field indicates the maximum time that a handle associated
> with a transaction had to wait before it could get started. Typically
> this is because we couldn't start a new transactions because the old
> transaction was still in the process of being committed.
>
> The "request_delay" field indicates delay between when transaction
> commit was requested (typically due to an fsync), and when the jbd2
> kernel thread actually started working on the commit. (Again, all time
> fields are in units of jiffies, so you can read the integer in
> milliseconds --> 5030 jiffies === 5030 milliseconds == 5.03 seconds.
>
> The "running" field indicates how long transaction was running; as
> mentioned above, this will typically be around 5000 jiffies unless
> your workload was doing fsync()'s, or if the transaction was big
> enough that it was forced to close because journal space reasons.
>
> The "locked" field indicates how long it takes for all of the pending
> handles to be completed. When the jbd2 layer starts working on the
> commit, the first thing it does is lock down the transaction so no new
> handles can be started. It then waits for all curently started
> handles can complete. While we are waiting for handles to close, no
> other handles can be started (see discussion of the maximum wait time
> above). We try to keep handle processing time as short as possible
> (for example, by prefetching blocks before starting a handle), but
> there are cases where this is not possible, or at least not easy.
> Analysis of the jbd2_handle_stats trances can help identify
> longer-running handles. If these long-running handles happen right
> before journal commit (and for a workload that is overly fsync-happy,
> that may be most handles), they can prolong how long a journal commit
> stays in the "locked" phase.
>
> The "flushing" field indicates how long the transaction commit spent
> writing out data blocks. We only need to write out data blocks in
> data=ordered mode associated with inodes which have new blocks
> allocated during the last transaction.
>
> The "logging" field indicates how long it took to write out the blocks
> that are actually written to the journal. The number of blocks
> actually written are reflected in the "blocks" and "blocks_logged"
> fields, where the "blocks_logged" fields includes the overhead blocks
> (jbd2 descriptor blocks, jbd2 revoke blocks, etc.)
>
> The "handle_count" field indicates how many handles are associated
> with a particular transaction. In general each simple system call
> (chmod, creat, chown, symlink, read, etc.) that modifies the file
> system requires at most a single handle. The big exception to this
> rule of thumb is the truncate(2) and unlink(2) system calls, which may
> require multiple handles in order to complete. The truncate() and
> unlink system calls are also the ones which tend to hold a handle open
> for longer-than-average amounts of time.
>
> By the way, this is why calling truncate() and doing block allocations
> and doing fsync() a lot is *not* free. If you are purchasing $$$ NVME
> storage, in general it's a wise idea for your application to simply do
> non-allocating writes (i.e., overwriting previous blocks in a
> preallocated file), and tracking which blocks are used for what in
> your application, instead of trying to ask general purpose file system
> code to do this for you. If you can used a specialized data structure
> (say, a circular buffer) and manage the head and tail pointers
> yourself, it will almost always require much less overhead than what
> the file system will have to do since the file system doesn't have the
> advantage of specialized knowledge of exactly what your requirements
> might be for your workload.
>
> Regards,
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux