Hi Ted, Appreciate your time, and will get the log from normal work load. BTW, the previous JBD2 log is from the node which is in bad state. Normal node finish within 3 seconds; while on this node, it takes 10 seconds. Thank you very much! Rpu ---------------------------------------- > Date: Wed, 12 Aug 2015 18:34:56 -0400 > From: tytso@xxxxxxx > To: bsdnet@xxxxxxxxxxx > CC: linux-ext4@xxxxxxxxxxxxxxx > Subject: Re: fsync stuck at jbd2_log_wait_commit on NVMe devices > > On Wed, Aug 12, 2015 at 02:43:22PM -0700, Roy Yang wrote: >> Hi Ted, >> >> I got some JBD2 tracing log by running our test program, most of them will take 10 seconds or so. Compared to the normal system, this test only takes 2 seconds. >> We saw the spike of running, but checking with kernel source. I don't get what it means. Any insights? > > You need to get the trace on your normal system --- a system when > under load, you are seeing the long times for a journal commit to > complete. The jbd2_run_stats gives us critical insight into what > might be going wrong --- but just as it doesn't help much to use a > diagnostic tool on a pet dog when you are trying to determine why the > child is running a fever, it doesn't help much to take traces on a > test program/system when we're trying to actually debug why your > production server is having trouble.... > > >> <...>-220601 [015] .... 181555.139321: jbd2_run_stats: dev 259,3 tid 1064541 wait 0 request_delay 0 running 0 locked 0 flushing 0 logging 0 handle_count 4 blocks 1 blocks_logged 2 > > There will be one of these trace points emitted for each jd2 > transaction commit. Most of the fields, with the exception of the > last three (handle_count, blocks, and blocks logged) are in units of > 1000 HZ "jiffies. So for example, if you don't have any fsync()'s, > transactions by default will run for 5 seconds before being committed, > so the "running" time will be a bit larger than 5000 jiffies (the > timer goes off at 5 seconds, but there may be some delay before the > jbd2 kernel thread gets scheduled): > > jbd2/sda3-8-275 [003] ...2 70449.637938: jbd2_run_stats: dev 8,3 tid 789663 wait 0 request_delay 0 running 5030 locked 0 flushing 0 logging 0 handle_count 18 blocks 5 blocks_logged 6 > > The "wait" field indicates the maximum time that a handle associated > with a transaction had to wait before it could get started. Typically > this is because we couldn't start a new transactions because the old > transaction was still in the process of being committed. > > The "request_delay" field indicates delay between when transaction > commit was requested (typically due to an fsync), and when the jbd2 > kernel thread actually started working on the commit. (Again, all time > fields are in units of jiffies, so you can read the integer in > milliseconds --> 5030 jiffies === 5030 milliseconds == 5.03 seconds. > > The "running" field indicates how long transaction was running; as > mentioned above, this will typically be around 5000 jiffies unless > your workload was doing fsync()'s, or if the transaction was big > enough that it was forced to close because journal space reasons. > > The "locked" field indicates how long it takes for all of the pending > handles to be completed. When the jbd2 layer starts working on the > commit, the first thing it does is lock down the transaction so no new > handles can be started. It then waits for all curently started > handles can complete. While we are waiting for handles to close, no > other handles can be started (see discussion of the maximum wait time > above). We try to keep handle processing time as short as possible > (for example, by prefetching blocks before starting a handle), but > there are cases where this is not possible, or at least not easy. > Analysis of the jbd2_handle_stats trances can help identify > longer-running handles. If these long-running handles happen right > before journal commit (and for a workload that is overly fsync-happy, > that may be most handles), they can prolong how long a journal commit > stays in the "locked" phase. > > The "flushing" field indicates how long the transaction commit spent > writing out data blocks. We only need to write out data blocks in > data=ordered mode associated with inodes which have new blocks > allocated during the last transaction. > > The "logging" field indicates how long it took to write out the blocks > that are actually written to the journal. The number of blocks > actually written are reflected in the "blocks" and "blocks_logged" > fields, where the "blocks_logged" fields includes the overhead blocks > (jbd2 descriptor blocks, jbd2 revoke blocks, etc.) > > The "handle_count" field indicates how many handles are associated > with a particular transaction. In general each simple system call > (chmod, creat, chown, symlink, read, etc.) that modifies the file > system requires at most a single handle. The big exception to this > rule of thumb is the truncate(2) and unlink(2) system calls, which may > require multiple handles in order to complete. The truncate() and > unlink system calls are also the ones which tend to hold a handle open > for longer-than-average amounts of time. > > By the way, this is why calling truncate() and doing block allocations > and doing fsync() a lot is *not* free. If you are purchasing $$$ NVME > storage, in general it's a wise idea for your application to simply do > non-allocating writes (i.e., overwriting previous blocks in a > preallocated file), and tracking which blocks are used for what in > your application, instead of trying to ask general purpose file system > code to do this for you. If you can used a specialized data structure > (say, a circular buffer) and manage the head and tail pointers > yourself, it will almost always require much less overhead than what > the file system will have to do since the file system doesn't have the > advantage of specialized knowledge of exactly what your requirements > might be for your workload. > > Regards, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html