On Tue, Nov 21 2017 at 8:21pm -0500, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > On Tue, 21 Nov 2017, Mike Snitzer wrote: > > > On Tue, Nov 21 2017 at 4:23pm -0500, > > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > > > This is not correct: > > > > > > 2206 static void dm_wq_work(struct work_struct *work) > > > 2207 { > > > 2208 struct mapped_device *md = container_of(work, struct mapped_device, work); > > > 2209 struct bio *bio; > > > 2210 int srcu_idx; > > > 2211 struct dm_table *map; > > > 2212 > > > 2213 if (!bio_list_empty(&md->rescued)) { > > > 2214 struct bio_list list; > > > 2215 spin_lock_irq(&md->deferred_lock); > > > 2216 list = md->rescued; > > > 2217 bio_list_init(&md->rescued); > > > 2218 spin_unlock_irq(&md->deferred_lock); > > > 2219 while ((bio = bio_list_pop(&list))) > > > 2220 generic_make_request(bio); > > > 2221 } > > > 2222 > > > 2223 map = dm_get_live_table(md, &srcu_idx); > > > 2224 > > > 2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { > > > 2226 spin_lock_irq(&md->deferred_lock); > > > 2227 bio = bio_list_pop(&md->deferred); > > > 2228 spin_unlock_irq(&md->deferred_lock); > > > 2229 > > > 2230 if (!bio) > > > 2231 break; > > > 2232 > > > 2233 if (dm_request_based(md)) > > > 2234 generic_make_request(bio); > > > 2235 else > > > 2236 __split_and_process_bio(md, map, bio); > > > 2237 } > > > 2238 > > > 2239 dm_put_live_table(md, srcu_idx); > > > 2240 } > > > > > > You can see that if we are in dm_wq_work in __split_and_process_bio, we > > > will not process md->rescued list. > > > > Can you elaborate further? We cannot be "in dm_wq_work in > > __split_and_process_bio" simultaneously. Do you mean as a side-effect > > of scheduling away from __split_and_process_bio? > > > > The more detail you can share the better. > > Suppose this scenario: > > * dm_wq_work calls __split_and_process_bio Right, I later realized this was the call chain you were referring to. Not sure how I missed it the first time around. > * __split_and_process_bio eventually reaches the function snapshot_map > * snapshot_map attempts to take the snapshot lock > > * the snapshot lock could be released only if some bios submitted by the > snapshot driver to the underlying device complete > * the bios submitted to the underlying device were already offloaded by > some other task and they are waiting on the list md->rescued > * the bios waiting on md->rescued are not processed, because dm_wq_work is > blocked in snapshot_map (called from __split_and_process_bio) SO you're saying the case that Neil doesn't think should happen: https://lkml.org/lkml/2017/11/21/658 ...can happen. > > > The processing of md->rescued is also wrong - bios for different devices > > > must be offloaded to different helper threads, so that processing a bio > > > for a lower device doesn't depend on processing a bio for a higher device. > > > If you offload all the bios on current->bio_list to the same thread, the > > > bios still depend on each other and the deadlock will still happen. > > > > Commit 325738403 ("dm: revise 'rescue' strategy for bio-based bioset > > allocations") speaks to this with: > > > > "Note that only current->bio_list[0] is offloaded. current->bio_list[1] > > contains bios that were scheduled *before* the current one started, so > > they must have been submitted from higher up the stack, and we cannot be > > waiting for them here (thanks to the "dm: ensure bio submission follows > > a depth-first tree walk" commit). Also, we now rescue *all* bios on the > > list as there is nothing to be gained by being more selective." > > I think you are right - if we only offload current->bio_list[0], then > mixing of dependent bios on the offloaded list won't happen. > > > And again: this patchset passes your dm-snapshot deadlock test. Is > > that test somehow lacking? > > With your patchset, the deadlock would happen only if bios are queued on > &md->deferred - and that happens only in case of resume or if we are > processing REQ_PREFLUSH with non-zero data size. > > So, the simple test that I wrote doesn't trigger it, but a more complex > test involving REQ_PREFLUSH could. Makes sense. But I need to think further about _why_ bios submitted to the snapshot driver's underlying device would end up on md->rescued (like you suggested above). Again, Neil thinks it not possible. Neil said: "they will not be recursive calls, so nothing will be added to current->bio_list[0] and nothing will be moved to md->rescued. Each generic_make_request() will completely submit the request in the lower level devel." Mike