On Tue, Oct 08, 2019 at 09:22:40AM -0400, Brian Foster wrote: > On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote: > > On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote: > > > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote: > > > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > > > > > Hmm, I'm also not sure the lockless reservation algorithm is totally > > > > > immune to increased concurrency in this regard. What prevents multiple > > > > > tasks from racing through xlog_grant_head_check() and blowing past the > > > > > log head, for example? > > > > > > > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > > > > > BBTOB(tail_blocks)" messages when that happens. It's pretty > > > > difficult to do this in real world conditions, even when there is > > > > lots of concurrency being used. > > > > > > > > > > Hm, Ok. Though I've seen that alert enough times that I > > > (unintentionally) ignore it at this point, so it can't be that hard to > > > reproduce. ;) That is usually during fstests however, and not a typical > > > workload that I recall. > > > > I can't say I've seen it for a long time now - I want to say "years" > > but I may well have simply missed it on the rare occasion it has > > occurred and fstests hasn't captured it. i.e. fstests is supposed to > > capture unusual things like this appearing in dmesg during a > > test.... > > > > > Of course, there's a difference between > > > reproducing the basic condition and taking it to the point where it > > > manifests into a problem. > > > > *nod* > > > > > > But here's the rub: it's not actually the end of the world because > > > > the reservation doesn't actually determine how much of the log is > > > > currently being used by running transactions - the reservation is > > > > for a maximal rolling iteration of a permanent transaction, not the > > > > initial transaction will be running. Hence if we overrun > > > > occassionally we don't immediately run out of log space and corrupt > > > > the log. > > > > > > > > > > Ok, that much is evident from the amount of time this mechanism has been > > > in place without any notable issues. > > > > > > > Yes, if none of the rolling transactions complete and they all need > > > > to use their entire reservation, and the tail of the log cannot be > > > > moved forward because it is pinned by one of the transactions that > > > > is running, then we'll likely get a log hang on a regrant on the > > > > write head. But if any of the transactions don't use all of their > > > > reservation, then the overrun gets soaked up by the unused parts of > > > > the transactions that are completed and returned to reservation > > > > head, and nobody even notices taht there was a temporary overrun of > > > > the grant head space. > > > > > > > > > > Ok, I didn't expect this to be some catastrophic problem or really a > > > problem with your patch simply based on the lifetime of the code and how > > > the grant heads are actually used. I was going to suggest an assert or > > > something to detect whether batching behavior as a side effect of the > > > commit throttle would ever increase likelihood of this situation, but it > > > looks like the grant verify function somewhat serves that purpose > > > already. > > > > Yeah - xlog_verify_grant_tail() will the report reservation > > overruns, but the serious log space problems (i.e. head overwritting > > the tail) are detected by xlog_verify_tail_lsn() when we stamp the > > tail_lsn into the current iclog header. That's still done under the > > icloglock and the AIL lock, so the comparison of the tail with the > > current log head is still completely serialised. > > > > > I'd _prefer_ to see something, at least in DEBUG mode, that indicates > > > the frequency of the fundamental incorrect accounting condition as > > > opposed to just the side effect of blowing the tail (because the latter > > > depends on other difficult to reproduce factors), but I'd have to think > > > about that some more as it would need to balance against normal/expected > > > execution flow. Thanks for the background. > > > > You can test that just by removing the XLOG_TAIL_WARN flag setting, > > then it will warn on every reservation overrun rather than just the > > first. > > > > That's not quite what I'm after. That just removes the oneshot nature of > the existing check. The current debug check looks for a side effect of > the current algorithm in the form of overrunning the tail. What I would > like to see, somehow or another, is something that provides earlier and > more useful information on the frequency/scale of occurrence where > multiple reservations have been made based on the same grant baseline. > > This is not so much an error check so not something that should be an > assert or warning, but rather more of a metric that provides a base for > comparison whenever we might have code changes or workloads that > potentially change this behavior. For example, consider a debug mode > stat or sysfs file that could be used to dump out a counter of "same > value" grant head samples after a benchmark workload or fstests run. The > fact that this value might go up or down is not necessarily a problem. > But that would provide some debug mode data to identify and analyze > potentially unintended side effects like transient spikes in concurrent > grant head checks due to blocking/scheduling changes in log reservation > tasks. > > See the appended RFC for a quick idea of what I mean. This is slightly > racy, but I think conservative enough to provide valid values with > respect to already racy reservation implementation. On my slow 4xcpu vm, > I see occasional sample counts of 2 running a -p8 fsstress. If I stick a > smallish delay in xfs_log_reserve(), the frequency of the output > increases and I see occasional bumps to 3, a spike of 8 in one case, > etc. Of course I'd probably factor the atomic calls into DEBUG only > inline functions that can be compiled out on production kernels and > replace the tracepoint with a global counter (exported via stats or > sysfs knob), but this just illustrates the idea. The global counter > could also be replaced with (or kept in addition to) something that > tracks a max concurrency value if that is more useful. Any thoughts on > something like this? > BTW, this doesn't necessarily have to be a task based counter. Another approach could be to keep a "lockless reservation slop" counter that is incremented by the ticket reservation size down in xlog_grant_head_check() (only in the lockless case) and flagged/decremented once the reservation is added to the grant head. Somewhere between the atomic add/sub we check the max slop value we've encountered and either track it and dump it in a sysfs file and/or just assert that it never reaches something unexpected/outrageous (based on some % of the log size, for example). Brian > > > > Hence occasional overruns on the reservation head before they start > > > > blocking isn't really a problem in practice because the probability > > > > of all the transaction reservation of all transactions running being > > > > required to make forwards progress is extremely small. > > > > > > > > Basically, we gave up "perfect reservation space grant accounting" > > > > because performance was extremely important and risk of log hangs as > > > > a result of overruns was considered to be extremely low and worth > > > > taking for the benefits the algorithm provided. This was just a > > > > simple, pragmatic risk based engineering decision. > > > > > > > > > > FWIW, the comment for xlog_verify_tail() also suggests the potential for > > > false positives and references a panic tag, which all seems kind of > > > erratic and misleading compared to what you explain here. > > > > Well, it's fundamentally an unserialised check, so it can race with > > other reservation grants, commits that release grant space and tail > > lsn updates. Hence it's not a 100% reliable debug check. > > > > Right. Yet there is a panic knob... > > > It also used to be run at all times, not just under > > XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with > > it. When we first deployed it, we weren't 100% sure there weren't > > customer workloads that would trip over this and hang the log, so > > we gave ourselves a way of triggering kernel dumps the instant an > > overrun was detected. Hence a site that had log hangs with this > > message in the logs could turn on the panic tag and we'd get a > > kernel dump to analyse... > > > > Ok, makes sense. This kind of speaks to the earlier point around having > more useful data. While this isn't necessarily a problem right now, we > have no real data to tell us whether some particular code change alters > this behavior. If this was enough of a concern when the change was first > put in place to insert a panic hook, then it stands to reason it's > something we should at least be cognizant of when making changes that > could impact its behavior. > > > Since then, this code has been relegated to debug code but the panic > > tag still exists. It could be turned back into a ASSERT now, but > > it's still useful the way it is as it means debug kernels don't fall > > over the moment a spurious overrun occurs... > > > > Yeah, ISTM the panic bit could be removed at this point. The warning (as > opposed to an assert) is probably reasonable so long as the check itself > is racy so as to not cause false positive panics with fatal asserts > enabled. > > Brian > > --- 8< --- > > RFC: crude concurrency stat on reserve grant head > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index a2beee9f74da..1d3056176e6e 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -175,6 +175,7 @@ xlog_grant_head_init( > xlog_assign_grant_head(&head->grant, 1, 0); > INIT_LIST_HEAD(&head->waiters); > spin_lock_init(&head->lock); > + atomic64_set(&head->sample_cnt, 0); > } > > STATIC void > @@ -446,6 +447,7 @@ xfs_log_reserve( > struct xlog_ticket *tic; > int need_bytes; > int error = 0; > + int64_t sample_cnt; > > ASSERT(client == XFS_TRANSACTION || client == XFS_LOG); > > @@ -465,13 +467,19 @@ xfs_log_reserve( > > error = xlog_grant_head_check(log, &log->l_reserve_head, tic, > &need_bytes); > + atomic64_inc(&log->l_reserve_head.sample_cnt); > if (error) > goto out_error; > > + sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0); > xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes); > xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); > trace_xfs_log_reserve_exit(log, tic); > xlog_verify_grant_tail(log); > + > + if (sample_cnt > 1) > + trace_printk("%d: %lld\n", __LINE__, sample_cnt); > + > return 0; > > out_error: > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index b880c23cb6e4..62e4949f91c4 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -339,6 +339,7 @@ struct xlog_grant_head { > spinlock_t lock ____cacheline_aligned_in_smp; > struct list_head waiters; > atomic64_t grant; > + atomic64_t sample_cnt; > }; > > /*