Re: [Update PATCH V3] md: don't unregister sync_thread with reconfig_mutex held

Jan Kara <jack@xxxxxxx> · Thu, 26 May 2022 13:53:36 +0200

[Added couple of CCs since this seems to be an issue in the generic block
layer]

On Wed 25-05-22 12:22:06, Logan Gunthorpe wrote:
> On 2022-05-25 03:04, Guoqing Jiang wrote:
> > I would prefer to focus on block tree or md tree. With latest block tree
> > (commit 44d8538d7e7dbee7246acda3b706c8134d15b9cb), I get below
> > similar issue as Donald reported, it happened with the cmd (which did
> > work with 5.12 kernel).
> > 
> > vm79:~/mdadm> sudo ./test --dev=loop --tests=05r1-add-internalbitmap
> 
> Ok, so this test passes for me, but my VM was not running with bfq. It
> also seems we have layers upon layers of different bugs to untangle.
> Perhaps you can try the tests with bfq disabled to make progress on the
> other regression I reported.
> 
> If I enable bfq and set the loop devices to the bfq scheduler, then I
> hit the same bug as you and Donald. It's clearly a NULL pointer
> de-reference in the bfq code, which seems to be triggered on the
> partition read after mdadm opens a block device (not sure if it's the md
> device or the loop device but I suspect the latter seeing it's not going
> through any md code).
> 
> Simplifying things down a bit, the null pointer dereference can be
> triggered by creating an md device with loop devices that have bfq
> scheduler set:
> 
>   mdadm --create --run /dev/md0 --level=1 -n2 /dev/loop0 /dev/loop1
> 
> The crash occurs in bfq_bio_bfqg() with blkg_to_bfqg() returning NULL.
> It's hard to trace where the NULL comes from in there -- the code is a
> bit complex.
> 
> I've found that the bfq bug exists in current md-next (42b805af102) but
> did not trigger in the base tag of v5.18-rc3. Bisecting revealed the bug
> was introduced by:
> 
>   4e54a2493e58 ("bfq: Get rid of __bio_blkcg() usage")
> 
> Reverting that commit and the next commit (075a53b7) on top of md-next
> was confirmed to fix the bug.
> 
> I've copied Jan, Jens and Paolo who can hopefully help with this. A
> cleaned up stack trace follows this email for their benefit.

So I've debugged this. The crash happens on the very first bio submitted to
the md0 device. The problem is that this bio gets remapped to loop0 - this
happens through bio_alloc_clone() -> __bio_clone() which ends up calling
bio_clone_blkg_association(). Now the resulting bio is inconsistent - it's
dst_bio->bi_bdev is pointing to loop0 while dst_bio->bi_blkg is pointing to
blkcg_gq associated with md0 request queue. And this breaks BFQ because
when this bio is inserted to loop0 request queue, BFQ looks at
bio->bi_blkg->q (it is a bit more complex than that but this is the gist
of the problem), expects its data there but BFQ is not initialized for md0
request_queue.

Now I think this is a bug in __bio_clone() but the inconsistency in the bio
is very much what we asked bio_clone_blkg_association() to do so maybe I'm
missing something and bios that are associated with one bdev but pointing
to blkg of another bdev are fine and controllers are supposed to handle
that (although I'm not sure how should they do that). So I'm asking here
before I just go and delete bio_clone_blkg_association() from
__bio_clone()...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR