On Wed, Nov 22 2017, Mikulas Patocka wrote: > On Wed, 22 Nov 2017, NeilBrown wrote: > >> On Tue, Nov 21 2017, Mikulas Patocka wrote: >> >> > On Tue, 21 Nov 2017, Mike Snitzer wrote: >> > >> >> On Tue, Nov 21 2017 at 4:23pm -0500, >> >> Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: >> >> >> >> > This is not correct: >> >> > >> >> > 2206 static void dm_wq_work(struct work_struct *work) >> >> > 2207 { >> >> > 2208 struct mapped_device *md = container_of(work, struct mapped_device, work); >> >> > 2209 struct bio *bio; >> >> > 2210 int srcu_idx; >> >> > 2211 struct dm_table *map; >> >> > 2212 >> >> > 2213 if (!bio_list_empty(&md->rescued)) { >> >> > 2214 struct bio_list list; >> >> > 2215 spin_lock_irq(&md->deferred_lock); >> >> > 2216 list = md->rescued; >> >> > 2217 bio_list_init(&md->rescued); >> >> > 2218 spin_unlock_irq(&md->deferred_lock); >> >> > 2219 while ((bio = bio_list_pop(&list))) >> >> > 2220 generic_make_request(bio); >> >> > 2221 } >> >> > 2222 >> >> > 2223 map = dm_get_live_table(md, &srcu_idx); >> >> > 2224 >> >> > 2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) { >> >> > 2226 spin_lock_irq(&md->deferred_lock); >> >> > 2227 bio = bio_list_pop(&md->deferred); >> >> > 2228 spin_unlock_irq(&md->deferred_lock); >> >> > 2229 >> >> > 2230 if (!bio) >> >> > 2231 break; >> >> > 2232 >> >> > 2233 if (dm_request_based(md)) >> >> > 2234 generic_make_request(bio); >> >> > 2235 else >> >> > 2236 __split_and_process_bio(md, map, bio); >> >> > 2237 } >> >> > 2238 >> >> > 2239 dm_put_live_table(md, srcu_idx); >> >> > 2240 } >> >> > >> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we >> >> > will not process md->rescued list. >> >> >> >> Can you elaborate further? We cannot be "in dm_wq_work in >> >> __split_and_process_bio" simultaneously. Do you mean as a side-effect >> >> of scheduling away from __split_and_process_bio? >> >> >> >> The more detail you can share the better. >> > >> > Suppose this scenario: >> > >> > * dm_wq_work calls __split_and_process_bio >> > * __split_and_process_bio eventually reaches the function snapshot_map >> > * snapshot_map attempts to take the snapshot lock >> > >> > * the snapshot lock could be released only if some bios submitted by the >> > snapshot driver to the underlying device complete >> > * the bios submitted to the underlying device were already offloaded by >> > some other task and they are waiting on the list md->rescued >> > * the bios waiting on md->rescued are not processed, because dm_wq_work is >> > blocked in snapshot_map (called from __split_and_process_bio) >> >> Yes, I think you are right. >> >> I think the solution is to get rid of the dm_offload() infrastructure >> and make it not necessary. >> i.e. discard my patches >> dm: prepare to discontinue use of BIOSET_NEED_RESCUER >> and >> dm: revise 'rescue' strategy for bio-based bioset allocations >> >> And build on "dm: ensure bio submission follows a depth-first tree walk" >> which was written after those and already makes dm_offload() less >> important. >> >> Since that "depth-first" patch, every request to the dm device, after >> the initial splitting, allocates just one dm_target_io structure, and >> makes just one __map_bio() call, and so will behave exactly the way >> generic_make_request() expects and copes with - thus avoiding awkward >> dependencies and deadlocks. Except.... >> >> a/ If any target defines ->num_write_bios() to return >1, >> __clone_and_map_data_bio() will make multiple calls to alloc_tio() >> and __map_bio(), which might need rescuing. >> But no target defines num_write_bios, and none have since it was >> removed from dm-cache 4.5 years ago. >> Can we discard num_write_bios?? >> >> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios >> to a value > 1, then __send_duplicate_bios() will also make multiple >> calls to alloc_tio() and __map_bio(). >> Some do. >> dm-cache-target: flush=2 >> dm-snap: flush=2 >> dm-stripe: discard, write_same, write_zeroes all set to 'stripes'. >> >> These will only be a problem if the second (or subsequent) alloc_tio() >> blocks waiting for an earlier allocation to complete. This will only >> be a problem if multiple threads are each trying to allocate multiple >> dm_target_io from the same bioset at the same time. >> This is rare and should be easier to address than the current >> dm_offload() approach. >> One possibility would be to copy the approach taken by >> crypt_alloc_buffer() which needs to allocate multiple entries from a >> mempool. >> It first tries the with GFP_NOWAIT. If that fails it take a mutex and >> tries with GFP_NOIO. This mean only one thread will try to allocate >> multiple bios at once, so there can be no deadlock. >> >> Below are two RFC patches. The first removes num_write_bios. >> The second is incomplete and makes a stab are allocating multiple bios >> at once safely. >> A third would be needed to remove dm_offload() etc... but I cannot quite >> fit that in today - must be off. >> >> Thanks, >> NeilBrown > > Another problem is this: > > struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split); > bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9); > bio_chain(b, bio); > > What if it blocks because the bioset is exhausted? > > The code basically builds a chain of bios of unlimited length (suppose for > example a case when we are splitting on every sector boundary, so there > will be one bio for every sector in the original bio), it could exhaust > the bioset easily. > > It would be better to use mechanism from md-raid that chains all the > sub-bios to the same master bio and doesn't create long chains of bios: > > if (max_sectors < bio_sectors(bio)) { > struct bio *split = bio_split(bio, max_sectors, > gfp, conf->bio_split); > bio_chain(split, bio); > generic_make_request(bio); > bio = split; > r1_bio->master_bio = bio; > r1_bio->sectors = max_sectors; > } > > Mikulas Yes, you are right something like that would be better. Also send_changing_extent_only allocates bios in a loop which can cause problems. I think we need to get __split_and_process_non_flush(), in all its branches, to check if len is too large for a single request, and if it is, create a clone for the prefix, attached that to the ci and map it, advance the original bio, and call generic_make_request on it. That shouldn't be too hard, but it is a change that would touch a few places. I'll see if I can write something. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature