On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips (phillips@xxxxxxxxx) wrote: > > > We do not care about one cpu being able to increase its counter > > > higher than the limit, such inaccuracy (maximum bios in flight > > > thus can be more than limit, difference is equal to the number of > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > pointed it in the original description, but might forget, that if > > > it will be an issue, that atomic operations can be introduced > > > there. Any uber-precise measurements in the case when we are > > > close to the edge will not give us any benefit at all, since were > > > are already in the grey area. > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > counts and eventually all of them will be gone. No more IO > > on that block device! > > First, because number of increased and decreased operations are the > same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. > Let's solve problems in order of their appearence. If bio structure > will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.000000000 -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.000000000 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + if (q && q->metric) { + int need = bio->bi_reserved = q->metric(bio); + bio->queue = q; + wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need); + atomic_sub(&q->available, need); + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c 2007-07-08 16:32:17.000000000 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.000000000 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*----------------------------------------------------------------- * An IDR is used to keep track of allocated minor numbers. *---------------------------------------------------------------*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + atomic_set(&md->queue->available, md->queue->capacity = 1000); + init_waitqueue_head(&md->queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.000000000 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.000000000 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio->bi_size; } - bio->bi_size -= bytes_done; + if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) { + struct request_queue *q = bio->queue; + atomic_add(&q->available, bio->bi_reserved); + bio->bi_reserved = 0; /* just in case */ + wake_up(&q->throttle_wait); + } bio->bi_sector += (bytes_done >> 9); if (bio->bi_end_io) --- 2.6.22.clean/include/linux/bio.h 2007-07-08 16:32:17.000000000 -0700 +++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.000000000 -0700 @@ -109,6 +109,9 @@ struct bio { bio_end_io_t *bi_end_io; atomic_t bi_cnt; /* pin count */ + struct request_queue *queue; /* for throttling */ + unsigned bi_reserved; /* throttle metric */ + void *bi_private; bio_destructor_t *bi_destructor; /* destructor */ --- 2.6.22.clean/include/linux/blkdev.h 2007-07-08 16:32:17.000000000 -0700 +++ 2.6.22/include/linux/blkdev.h 2007-08-24 12:04:14.000000000 -0700 @@ -395,6 +395,10 @@ struct request_queue struct work_struct unplug_work; struct backing_dev_info backing_dev_info; + unsigned (*metric)(struct bio *bio); /* bio throttle metric */ + wait_queue_head_t throttle_wait; + atomic_t available; + unsigned capacity; /* * The queue owner gets to use this for whatever they like. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html