> Il giorno 28 giu 2019, alle ore 06:57, Guenter Roeck <linux@xxxxxxxxxxxx> ha scritto: > > On Thu, Jun 27, 2019 at 09:44:09PM -0700, Douglas Anderson wrote: >> In reboot tests on several devices we were seeing a "use after free" >> when slub_debug or KASAN was enabled. The kernel complained about: >> >> Unable to handle kernel paging request at virtual address 6b6b6c2b >> >> ...which is a classic sign of use after free under slub_debug. The >> stack crawl in kgdb looked like: >> >> 0 test_bit (addr=<optimized out>, nr=<optimized out>) >> 1 bfq_bfqq_busy (bfqq=<optimized out>) >> 2 bfq_select_queue (bfqd=<optimized out>) >> 3 __bfq_dispatch_request (hctx=<optimized out>) >> 4 bfq_dispatch_request (hctx=<optimized out>) >> 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) >> 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) >> 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) >> 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) >> 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) >> 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) >> >> Digging in kgdb, it could be found that, though bfqq looked fine, >> bfqq->bic had been freed. >> >> Through further digging, I postulated that perhaps it is illegal to >> access a "bic" (AKA an "icq") after bfq_exit_icq() had been called >> because the "bic" can be freed at some point in time after this call >> is made. I confirmed that there certainly were cases where the exact >> crashing code path would access the "bic" after bfq_exit_icq() had >> been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and > ^^^ > >> saw that the bic was 0x7 at the time of the crash. >> >> To understand a bit more about why this crash was fairly uncommon (I >> saw it only once in a few hundred reboots), you can see that much of >> the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't >> access the ->bic anymore. The only case it doesn't is if >> bfq_put_queue() sees a reference still held. >> >> However, even in the case when bfqq isn't freed, the crash is still >> rare. Why? I tracked what happened to the "bic" after the exit >> routine. It doesn't get freed right away. Rather, >> put_io_context_active() eventually called put_io_context() which >> queued up freeing on a workqueue. The freeing then actually happened >> later than that through call_rcu(). Despite all these delays, some >> extra debugging showed that all the hoops could be jumped through in >> time and the memory could be freed causing the original crash. Phew! >> >> To make a long story short, assuming it truly is illegal to access an >> icq after the "exit_icq" callback is finished, this patch is needed. >> Nice work! I think this bug has been there since when BFQ was born ... Reviewed-by: Paolo Valente <paolo.valente@xxxxxxxxxx> Thanks, Paolo >> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx> > > Nicely done ... thanks! > > Reviewed-by: Guenter Roeck <groeck@xxxxxxxxxxxx> > >> --- >> Most of the testing of this was done on the Chrome OS 4.19 kernel with >> BFQ backported (thanks to Paolo's help). I did manage to reproduce a >> crash on mainline Linux (v5.2-rc6) though. >> >> To see some of the techniques used to debug this, see >> <https://crrev.com/c/1679134> and <https://crrev.com/c/1681258/1>. >> >> I'll also note that on linuxnext (next-20190627) I saw some other >> use-after-frees that seemed related to BFQ but haven't had time to >> debug. They seemed unrelated. >> >> block/bfq-iosched.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c >> index f8d430f88d25..6c0cff03f8f6 100644 >> --- a/block/bfq-iosched.c >> +++ b/block/bfq-iosched.c >> @@ -4584,6 +4584,7 @@ static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync) >> unsigned long flags; >> >> spin_lock_irqsave(&bfqd->lock, flags); >> + bfqq->bic = NULL; >> bfq_exit_bfqq(bfqd, bfqq); >> bic_set_bfqq(bic, NULL, is_sync); >> spin_unlock_irqrestore(&bfqd->lock, flags); >> -- >> 2.22.0.410.gd8fdbe21b5-goog