Still poking around here, I think my previous emails were leading me down a dead end, at least w.r.t. unsafe locking. The one code path I identified below (blk_bio_list_merge() via __blk_mq_sched_bio_merge()) won't actually get hit since we'll instead call bfq_bio_merge(). The crashes I'm seeing have been in the read side (stacks below) [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 or [28497.344552] bfq_add_bfqq_busy+0x110/0x1ec [28497.348787] bfq_add_request+0x6bc/0x980 [28497.352845] bfq_insert_request+0x8ec/0x1240 [28497.357265] bfq_insert_requests+0x58/0x9c [28497.361503] blk_mq_sched_insert_requests+0x8c/0x180 [28497.366647] blk_mq_flush_plug_list+0x15c/0x1e0 [28497.371376] blk_flush_plug_list+0xf0/0x124 [28497.375877] blk_finish_plug+0x34/0x48 [28497.379846] read_pages+0x21c/0x288 [28497.383510] page_cache_ra_unbounded+0x1f0/0x24c [28497.388346] do_page_cache_ra+0x48/0x54 [28497.392388] do_sync_mmap_readahead+0x190/0x1e0 [28497.397150] filemap_fault+0x150/0x5f4 [28497.401111] do_read_fault+0xe0/0x1f0 [28497.404948] do_fault+0x40/0x1c8 In the crashes I'm looking at, it looks like the inconsistent bfq_queue is oom_bfqq (where waker_bfqq is NULL, but woken_list_node is hashed) I'm looking at bfq_init_rq(), where we set bfqq->waker_bfqq (potentially to NULL), and only hlist_add_head, no hlist_del_init. It looks like the logic here is, bfq_get_bfqq_handle_split is supposed to return a freshly allocated bfq_queue. However, it can also return oom_bfqq, which may already be on a woken list - then we proceed to set waker_bfqq without deleting woken_list node, which can result in the inconsistency seen in the crash. wdyt? bfqq = bfq_split_bfqq(bic, bfqq); split = true; if (!bfqq) { bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, true, is_sync, NULL); # This can also return oom_bfqq bfqq->waker_bfqq = old_bfqq->waker_bfqq; # this can be set to null bfqq->tentative_waker_bfqq = NULL; an aside, which looks at the async case, since I'd like to understand this code better. --- The call to bfq_split_bfqq() doesn't make sense to me - since for bfq_get_bfqq_handle_split() et. al. we pass the is_sync parameter, but bfq_split_bfqq() only ever clears bfqq for sync=true. My understanding seems like this is intended to be: we decide to split, we call bfq_split_bfqq() which sets bic->bfqq[is_sync] to NULL, then bfq_get_bfqq_handle_split() sees that it's null and allocates a brand new queue. But for async, bfq_split_bfqq will return NULL, but bic->bfqq[0] was never cleared, so bfq_get_bfqq_handle_split() will see that, and return that already existing bfqq. We'll then modify the not-freshly-allocated bfqq. Should bfq_split_bfqq() here have an is_sync param, and clear the corresponding bic->bfqq? --- On Mon, Oct 31, 2022 at 10:05 PM Khazhy Kumykov <khazhy@xxxxxxxxxxxx> wrote: > > On Mon, Oct 31, 2022 at 9:37 PM Khazhy Kumykov <khazhy@xxxxxxxxxxxx> wrote: > > > > I'm investigating a NULL deref crash in bfq_add_bfqq_busy(), wherein > > bfqq->woken_list_node is hashed, but bfqq->waker_bfqq is NULL - which > > seems inconsistent per my reading of the code.
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature