On Tue, 21 Nov 2017, Mike Snitzer wrote: > On Tue, Nov 21 2017 at 4:23pm -0500, > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > This is not correct: > > > > 2206 static void dm_wq_work(struct work_struct *work) > > 2207 { > > 2208 struct mapped_device *md = container_of(work, struct mapped_device, work); > > 2209 struct bio *bio; > > 2210 int srcu_idx; > > 2211 struct dm_table *map; > > 2212 > > 2213 if (!bio_list_empty(&md->rescued)) { > > 2214 struct bio_list list; > > 2215 spin_lock_irq(&md->deferred_lock); > > 2216 list = md->rescued; > > 2217 bio_list_init(&md->rescued); > > 2218 spin_unlock_irq(&md->deferred_lock); > > 2219 while ((bio = bio_list_pop(&list))) > > 2220 generic_make_request(bio); > > 2221 } > > 2222 > > 2223 map = dm_get_live_table(md, &srcu_idx); > > 2224 > > 2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { > > 2226 spin_lock_irq(&md->deferred_lock); > > 2227 bio = bio_list_pop(&md->deferred); > > 2228 spin_unlock_irq(&md->deferred_lock); > > 2229 > > 2230 if (!bio) > > 2231 break; > > 2232 > > 2233 if (dm_request_based(md)) > > 2234 generic_make_request(bio); > > 2235 else > > 2236 __split_and_process_bio(md, map, bio); > > 2237 } > > 2238 > > 2239 dm_put_live_table(md, srcu_idx); > > 2240 } > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > will not process md->rescued list. > > Can you elaborate further? We cannot be "in dm_wq_work in > __split_and_process_bio" simultaneously. Do you mean as a side-effect > of scheduling away from __split_and_process_bio? > > The more detail you can share the better. Suppose this scenario: * dm_wq_work calls __split_and_process_bio * __split_and_process_bio eventually reaches the function snapshot_map * snapshot_map attempts to take the snapshot lock * the snapshot lock could be released only if some bios submitted by the snapshot driver to the underlying device complete * the bios submitted to the underlying device were already offloaded by some other task and they are waiting on the list md->rescued * the bios waiting on md->rescued are not processed, because dm_wq_work is blocked in snapshot_map (called from __split_and_process_bio) > > The processing of md->rescued is also wrong - bios for different devices > > must be offloaded to different helper threads, so that processing a bio > > for a lower device doesn't depend on processing a bio for a higher device. > > If you offload all the bios on current->bio_list to the same thread, the > > bios still depend on each other and the deadlock will still happen. > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > allocations") speaks to this with: > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > contains bios that were scheduled *before* the current one started, so > they must have been submitted from higher up the stack, and we cannot be > waiting for them here (thanks to the "dm: ensure bio submission follows > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > list as there is nothing to be gained by being more selective." I think you are right - if we only offload current->bio_list[0], then mixing of dependent bios on the offloaded list won't happen. > And again: this patchset passes your dm-snapshot deadlock test. Is > that test somehow lacking? With your patchset, the deadlock would happen only if bios are queued on &md->deferred - and that happens only in case of resume or if we are processing REQ_PREFLUSH with non-zero data size. So, the simple test that I wrote doesn't trigger it, but a more complex test involving REQ_PREFLUSH could. > Or do you see a hypothetical case where a deadlock is still possible? > That is of less concern. I'd prefer that we tackle problems for > targets, and associated scenarios, that we currently support. > > Either way, happy to review this with you further. Any fixes are > welcomed too. But I'd like us to head in a direction that this patchset > is taking us. Specifically: away from DM relying on BIOSET_NEED_RESCUER. > > Thanks, > Mike Mikulas