On Fri, Sep 23, 2016 at 7:28 AM, Glauber Costa <glauber@xxxxxxxxxxxx> wrote: > On Sep 23, 2016 6:22 AM, "Paolo Valente" <paolo.valente@xxxxxxxxxx> wrote: >> >> >> > Il giorno 23 set 2016, alle ore 02:59, Glauber Costa >> > <glauber@xxxxxxxxxxxx> ha scritto: >> > >> > While debugging timeouts happening in my application workload >> > (ScyllaDB), I have >> > observed calls to open() taking a long time, ranging everywhere from 2 >> > seconds - >> > the first ones that are enough to time out my application - to more than >> > 30 >> > seconds. >> > >> > The problem seems to happen because XFS may block on pending metadata >> > updates >> > under certain circumnstances, and that's confirmed with the following >> > backtrace >> > taken by the offcputime tool (iovisor/bcc): >> > >> > ffffffffb90c57b1 finish_task_switch >> > ffffffffb97dffb5 schedule >> > ffffffffb97e310c schedule_timeout >> > ffffffffb97e1f12 __down >> > ffffffffb90ea821 down >> > ffffffffc046a9dc xfs_buf_lock >> > ffffffffc046abfb _xfs_buf_find >> > ffffffffc046ae4a xfs_buf_get_map >> > ffffffffc046babd xfs_buf_read_map >> > ffffffffc0499931 xfs_trans_read_buf_map >> > ffffffffc044a561 xfs_da_read_buf >> > ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 >> > ffffffffc0452b90 xfs_dir2_leaf_lookup_int >> > ffffffffc0452e0f xfs_dir2_leaf_lookup >> > ffffffffc044d9d3 xfs_dir_lookup >> > ffffffffc047d1d9 xfs_lookup >> > ffffffffc0479e53 xfs_vn_lookup >> > ffffffffb925347a path_openat >> > ffffffffb9254a71 do_filp_open >> > ffffffffb9242a94 do_sys_open >> > ffffffffb9242b9e sys_open >> > ffffffffb97e42b2 entry_SYSCALL_64_fastpath >> > 00007fb0698162ed [unknown] >> > >> > Inspecting my run with blktrace, I can see that the xfsaild kthread >> > exhibit very >> > high "Dispatch wait" times, on the dozens of seconds range and >> > consistent with >> > the open() times I have saw in that run. >> > >> > Still from the blktrace output, we can after searching a bit, identify >> > the >> > request that wasn't dispatched: >> > >> > 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) >> > 141696240 >> > 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 >> > [xfsaild/sda1] >> > 8,0 11 154 81.092473207 804 G WM 141698288 + 8 >> > [xfsaild/sda1] >> > 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) >> > [xfsaild/sda1] >> > <==== 'I' means Inserted (into the IO scheduler) >> > ===================================> >> > 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 >> > (15626265317) [swapper/0] >> > <==== Only 15s later the CFQ scheduler dispatches the request >> > ======================> >> > >> > As we can see above, in this particular example CFQ took 15 seconds to >> > dispatch >> > this request. Going back to the full trace, we can see that the xfsaild >> > queue >> > had plenty of opportunity to run, and it was selected as the active >> > queue many >> > times. It would just always be preempted by something else (example): >> > >> > 8,0 1 0 81.117912979 0 m N cfq1618SN / >> > insert_request >> > 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr >> > 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt >> > 8,0 1 0 81.117914398 0 m N cfq767A / slice expired >> > t=1 >> > 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 >> > 8,0 1 0 81.117915340 0 m N / served: vt=1948520448 >> > min_vt=1948520448 >> > 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 >> > disp=0 charge=0 iops=1 sect=0 >> > >> > where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the >> > ScyllaDB >> > IO dispatchers. >> > >> > The requests preempting the xfsaild queue are synchronous requests. >> > That's a >> > characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT >> > requests. >> > While it can be argued that preempting ASYNC requests in favor of SYNC >> > is part >> > of the CFQ logic, I don't believe that doing so for 15+ seconds is >> > anyone's >> > goal. >> > >> > Moreover, unless I am misunderstanding something, that breaks the >> > expectation >> > set by the "fifo_expire_async" tunable, which in my system is set to the >> > default. >> > >> > Looking at the code, it seems to me that the issue is that after we make >> > an async queue active, there is no guarantee that it will execute any >> > request. >> > >> > When the queue itself tests if it cfq_may_dispatch() it can bail if it >> > sees SYNC >> > requests in flight. An incoming request from another queue can also >> > preempt it >> > in such situation before we have the chance to execute anything (as seen >> > in the >> > trace above). >> > >> >> The BFQ scheduler should not suffer from this issue, because it does >> not consider this condition at all for preemption. Even better, BFQ >> performs preemption as a function of just one condition, which is >> formally proven not to break any request-delay guarantee. >> >> Would you be willing to give it a try? It might solve or at least >> mitigate the specific latency problem addressed by your patch, and at >> the same time provide many other benefits, in terms of low latency and >> high throughput. > > We are definitely interested in trying different elevators that could > benefit our workload. Specially because right now we don't really have a > default recommendation. > > But just to be clear, I saw very early on that the problem didn't happen > with noop or deadline, so if this was just a characteristic of the CFQ I > would have stopped there. > > I do believe however, as I have said in my commitlog message, that this is > just a bug. And no bug shall remain unfixed in my presence. Sorry this was sent in HTML. Just got mail from vger.kernel.org saying the message was refused due to that. I don't deal with LKML in ages and my gmail phone interface was sending HTML messages by default. > >> >> If you are willing to try it, then in [1] you can find the last BFQ >> development branch, rebased against 4.8-rc7. These commits add BFQ as an >> extra scheduler. >> >> Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3]. >> >> If you prefer patches, then you can find them in [4] for 4.7, or in >> [5] for 4.6. >> >> Finally, if you need details about BFQ, have a look at [6], or just ask! >> >> Thanks, >> Paolo >> >> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8 >> [2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7 >> [3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6 >> [4] >> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/ >> [5] >> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/ >> [6] http://algogroup.unimore.it/people/paolo/disk_sched/ >> >> > This patch sets the must_dispatch flag if we notice that we have >> > requests >> > that are already fifo_expired. This flag is always cleared after >> > cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't >> > pin >> > the queue for subsequent requests (unless they are themselves expired) >> > >> > Care is taken during preempt to still allow rt requests to preempt us >> > regardless. >> > >> > Testing my workload with this patch applied produces much better >> > results. >> > From the application side I see no timeouts, and the open() latency >> > histogram >> > generated by systemtap looks much better, with the worst outlier at >> > 131ms: >> > >> >> >> >> > Latency histogram of xfs_buf_lock acquisition (microseconds): >> > value |-------------------------------------------------- count >> > 0 | 11 >> > 1 |@@@@ 161 >> > 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 >> > 4 |@ 54 >> > 8 | 36 >> > 16 | 7 >> > 32 | 0 >> > 64 | 0 >> > ~ >> > 1024 | 0 >> > 2048 | 0 >> > 4096 | 1 >> > 8192 | 1 >> > 16384 | 2 >> > 32768 | 0 >> > 65536 | 0 >> > 131072 | 1 >> > 262144 | 0 >> > 524288 | 0 >> > >> > Signed-off-by: Glauber Costa <glauber@xxxxxxxxxxxx> >> > CC: Jens Axboe <axboe@xxxxxxxxx> >> > CC: linux-block@xxxxxxxxxxxxxxx >> > CC: linux-kernel@xxxxxxxxxxxxxxx >> > >> > Signed-off-by: Glauber Costa <glauber@xxxxxxxxxxxx> >> > --- >> > block/cfq-iosched.c | 13 ++++++++++--- >> > 1 file changed, 10 insertions(+), 3 deletions(-) >> > >> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c >> > index cc2f6db..5e24d88 100644 >> > --- a/block/cfq-iosched.c >> > +++ b/block/cfq-iosched.c >> > @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct >> > cfq_queue *cfqq) >> > if (ktime_get_ns() < rq->fifo_time) >> > rq = NULL; >> > >> > - cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); >> > return rq; >> > } >> > >> > @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data >> > *cfqd, struct cfq_queue *cfqq) >> > { >> > unsigned int max_dispatch; >> > >> > + if (cfq_cfqq_must_dispatch(cfqq)) >> > + return true; >> > + >> > /* >> > * Drain async requests before we start sync IO >> > */ >> > @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data >> > *cfqd, struct cfq_queue *cfqq) >> > >> > BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); >> > >> > + rq = cfq_check_fifo(cfqq); >> > + if (rq) >> > + cfq_mark_cfqq_must_dispatch(cfqq); >> > + >> > if (!cfq_may_dispatch(cfqd, cfqq)) >> > return false; >> > >> > /* >> > * follow expired path, else get first next available >> > */ >> > - rq = cfq_check_fifo(cfqq); >> > if (!rq) >> > rq = cfqq->next_rq; >> > + else >> > + cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq); >> > >> > /* >> > * insert request into driver dispatch list >> > @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct >> > cfq_queue *new_cfqq, >> > * if the new request is sync, but the currently running queue is >> > * not, let the sync request have priority. >> > */ >> > - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) >> > + if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) && >> > !cfq_cfqq_must_dispatch(cfqq)) >> > return true; >> > >> > /* >> > -- >> > 2.5.5 >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-block" >> > in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> Paolo Valente >> Algogroup >> Dipartimento di Scienze Fisiche, Informatiche e Matematiche >> Via Campi 213/B >> 41125 Modena - Italy >> http://algogroup.unimore.it/people/paolo/ >> >> >> >> >> -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html