Re: [PATCH] nbd: make starting request more reasonable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/3/20 8:08 AM, Yufen Yu wrote:
Our test robot reported a warning for refcount_dec trying to decrease
value '0'. The reason is that blk_mq_dispatch_rq_list() try to complete
the failed request from nbd driver, while the request have finished in
nbd timeout handle function. The race as following:

CPU1                             CPU2

//req->ref = 1
blk_mq_dispatch_rq_list
nbd_queue_rq
   nbd_handle_cmd
     blk_mq_start_request
                                  blk_mq_check_expired
                                    //req->ref = 2
                                    blk_mq_rq_timed_out
                                      nbd_xmit_timeout
                                        blk_mq_complete_request
                                          //req->ref = 1
                                          refcount_dec_and_test(&req->ref)

                                    refcount_dec_and_test(&req->ref)
                                    //req->ref = 0
                                      __blk_mq_free_request(req)
   ret = BLK_STS_IOERR
blk_mq_end_request
// req->ref = 0, req have been free
refcount_dec_and_test(&rq->ref)

In fact, the bug also have been reported by syzbot:
   https://lkml.org/lkml/2018/12/5/1308

Since the request have been freed by timeout handle, it can be reused
by others. Then, blk_mq_end_request() may get the re-initialized request
and free it, which is unexpected.

To fix the problem, we move blk_mq_start_request() down until the driver
will handle the request actully. If .queue_rq return something error in
preparation phase, timeout handle may don't need. Thus, moving start
request down may be more reasonable. Then, nbd_queue_rq() will not return
BLK_STS_IOERR after starting request.


This won't work, you have to have the request started if you return an error because of this in blk_mq_dispatch_rq_list

                if (unlikely(ret != BLK_STS_OK)) {
                        errors++;
                        blk_mq_end_request(rq, BLK_STS_IOERR);
                        continue;
                }

The request has to be started before we return an error, pushing it down means we have all of these error cases where we haven't started the request. Thanks,

Josef



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux