On Wed, 15 Aug 2012, Kent Overstreet wrote: > > Both md and dm use __GFP_WAIT allocations from mempools in > > generic_make_request. > > > > I think you found an interesting bug here. Suppose that we have three > > stacked devices: d1 depends on d2 and d2 depends on d3. > > > > Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and > > sends them to the device d2 - these bios end up in current->bio_list. The > > driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, > > current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off > > the bio list and the driver for d2 is called with b2.2 - suppose that for > > some reason mempool in d2 is exhausted and the driver needs to wait until > > b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 > > is still in current->bio_list. So it deadlocks. > > > > Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible > > deadlock) into another bug (a possible bio failure with -ENOMEM). > > > > Increasing mempool sizes doesn't fix it either, the bio would just have to > > be split to more pieces in the above example to make it deadlock. > > > > I think the above possible deadlock scenario could be fixed by reversing > > current->bio_list processing - i.e. when some device's make_request_fn > > adds some bios to current->bio_list, these bios are processed before other > > bios that were on the list before. This way, bio list would contain "b3.1, > > b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would > > not happen. > > Your patch isn't sufficient in the case where a bio may be split > multiple times (I'm not sure if it's sufficient in the case where bios > are only split once; trying to work it all out makes my head hurt). > > You don't need multiple stacked drivers to see this; just the case where > a single driver is running that splits multiple times is sufficient, if > you have enough threads submitting at the same time. That is true. dm splits one bio to multiple bios, so it could still deadlock. Mikulas > Bcache works around this with the trick I mentioned previously, where it > masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue > if the allocation fails. > > This works but it'd have to be done in each stacking driver... it's not > a generic solution, and it's a pain in the ass. > > I came up with another idea the other day. Conceptually, it inverts my > previous workaround - the punting to workqueue is done in the allocation > code when necessary, for the bios that would be blocked. > > It's lightly tested, gonna rig up some kind of fault injection and test > it more thoroughly later. > > commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8 > Author: Kent Overstreet <koverstreet@xxxxxxxxxx> > Date: Mon Aug 13 18:11:01 2012 -0700 > > block: Avoid deadlocks with bio allocation by stacking drivers > > Previously, if we ever try to allocate more than once from the same bio > set while running under generic_make_request(), we risk deadlock. > > This would happen if e.g. a bio ever needed to be split more than once, > and it's difficult to handle correctly in the drivers - so in practice > it's not. > > This patch fixes this issue by allocating a rescuer workqueue for each > bio_set, and punting queued bios to said rescuer when necessary: > > diff --git a/fs/bio.c b/fs/bio.c > index bc4265a..7b4f655 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio) > } > EXPORT_SYMBOL(bio_reset); > > +static void bio_alloc_rescue(struct work_struct *work) > +{ > + struct bio_set *bs = container_of(work, struct bio_set, rescue_work); > + struct bio *bio; > + > + while (1) { > + spin_lock(&bs->rescue_lock); > + bio = bio_list_pop(&bs->rescue_list); > + spin_unlock(&bs->rescue_lock); > + > + if (!bio) > + break; > + > + generic_make_request(bio); > + } > +} > + > /** > * bio_alloc_bioset - allocate a bio for I/O > * @gfp_mask: the GFP_ mask given to the slab allocator > @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset); > **/ > struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > { > + gfp_t saved_gfp = gfp_mask; > unsigned front_pad; > unsigned inline_vecs; > unsigned long idx = BIO_POOL_NONE; > @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > p = kmalloc(sizeof(struct bio) + > nr_iovecs * sizeof(struct bio_vec), > gfp_mask); > + > front_pad = 0; > inline_vecs = nr_iovecs; > } else { > - p = mempool_alloc(bs->bio_pool, gfp_mask); > + /* > + * If we're running under generic_make_request() > + * (current->bio_list != NULL), we risk deadlock if we sleep on > + * allocation and there's already bios on current->bio_list that > + * were allocated from the same bio_set; they won't be submitted > + * (and thus freed) as long as we're blocked here. > + * > + * To deal with this, we first try the allocation without using > + * the mempool; if that fails, we punt all the bios on > + * current->bio_list to a different thread and then retry the > + * allocation with the original gfp mask. > + */ > + > + if (current->bio_list && > + !bio_list_empty(current->bio_list) && > + (gfp_mask & __GFP_WAIT)) > + gfp_mask &= GFP_ATOMIC; > +retry: > + if (gfp_mask & __GFP_WAIT) > + p = mempool_alloc(bs->bio_pool, gfp_mask); > + else > + p = kmem_cache_alloc(bs->bio_slab, gfp_mask); > + > front_pad = bs->front_pad; > inline_vecs = BIO_INLINE_VECS; > } > > if (unlikely(!p)) > - return NULL; > + goto err; > > bio = p + front_pad; > bio_init(bio); > @@ -338,6 +379,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) > > err_free: > mempool_free(p, bs->bio_pool); > +err: > + if (gfp_mask != saved_gfp) { > + gfp_mask = saved_gfp; > + > + spin_lock(&bs->rescue_lock); > + bio_list_merge(&bs->rescue_list, current->bio_list); > + bio_list_init(current->bio_list); > + spin_unlock(&bs->rescue_lock); > + > + queue_work(bs->rescue_workqueue, &bs->rescue_work); > + goto retry; > + } > + > return NULL; > } > EXPORT_SYMBOL(bio_alloc_bioset); > @@ -1546,6 +1600,9 @@ static void biovec_free_pools(struct bio_set *bs) > > void bioset_free(struct bio_set *bs) > { > + if (bs->rescue_workqueue) > + destroy_workqueue(bs->rescue_workqueue); > + > if (bs->bio_pool) > mempool_destroy(bs->bio_pool); > > @@ -1581,6 +1638,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) > > bs->front_pad = front_pad; > > + spin_lock_init(&bs->rescue_lock); > + bio_list_init(&bs->rescue_list); > + INIT_WORK(&bs->rescue_work, bio_alloc_rescue); > + > bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad); > if (!bs->bio_slab) { > kfree(bs); > @@ -1591,9 +1652,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad) > if (!bs->bio_pool) > goto bad; > > - if (!biovec_create_pools(bs, pool_size)) > - return bs; > + if (biovec_create_pools(bs, pool_size)) > + goto bad; > + > + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0); > + if (!bs->rescue_workqueue) > + goto bad; > > + return bs; > bad: > bioset_free(bs); > return NULL; > diff --git a/include/linux/bio.h b/include/linux/bio.h > index b22c22b..ba5b52e 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } > static inline void bio_disassociate_task(struct bio *bio) { } > #endif /* CONFIG_BLK_CGROUP */ > > -/* > - * bio_set is used to allow other portions of the IO system to > - * allocate their own private memory pools for bio and iovec structures. > - * These memory pools in turn all allocate from the bio_slab > - * and the bvec_slabs[]. > - */ > -#define BIO_POOL_SIZE 2 > -#define BIOVEC_NR_POOLS 6 > -#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1) > - > -struct bio_set { > - struct kmem_cache *bio_slab; > - unsigned int front_pad; > - > - mempool_t *bio_pool; > -#if defined(CONFIG_BLK_DEV_INTEGRITY) > - mempool_t *bio_integrity_pool; > -#endif > - mempool_t *bvec_pool; > -}; > - > -struct biovec_slab { > - int nr_vecs; > - char *name; > - struct kmem_cache *slab; > -}; > - > -/* > - * a small number of entries is fine, not going to be performance critical. > - * basically we just need to survive > - */ > -#define BIO_SPLIT_ENTRIES 2 > - > #ifdef CONFIG_HIGHMEM > /* > * remember never ever reenable interrupts between a bvec_kmap_irq and > @@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list *bl) > return bio; > } > > +/* > + * bio_set is used to allow other portions of the IO system to > + * allocate their own private memory pools for bio and iovec structures. > + * These memory pools in turn all allocate from the bio_slab > + * and the bvec_slabs[]. > + */ > +#define BIO_POOL_SIZE 2 > +#define BIOVEC_NR_POOLS 6 > +#define BIOVEC_MAX_IDX (BIOVEC_NR_POOLS - 1) > + > +struct bio_set { > + struct kmem_cache *bio_slab; > + unsigned int front_pad; > + > + mempool_t *bio_pool; > +#if defined(CONFIG_BLK_DEV_INTEGRITY) > + mempool_t *bio_integrity_pool; > +#endif > + mempool_t *bvec_pool; > + > + /* > + * Deadlock avoidance for stacking block drivers: see comments in > + * bio_alloc_bioset() for details > + */ > + spinlock_t rescue_lock; > + struct bio_list rescue_list; > + struct work_struct rescue_work; > + struct workqueue_struct *rescue_workqueue; > +}; > + > +struct biovec_slab { > + int nr_vecs; > + char *name; > + struct kmem_cache *slab; > +}; > + > +/* > + * a small number of entries is fine, not going to be performance critical. > + * basically we just need to survive > + */ > +#define BIO_SPLIT_ENTRIES 2 > + > #if defined(CONFIG_BLK_DEV_INTEGRITY) > > #define bip_vec_idx(bip, idx) (&(bip->bip_vec[(idx)])) > -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html