On Tue, 2010-08-10 at 10:44 -0400, Vivek Goyal wrote: > On Tue, Aug 10, 2010 at 03:42:22PM +0200, Heinz Mauelshagen wrote: > > > > This is a new device mapper "throttle" target which allows for > > throttling reads and writes (ie. enforcing throughput limits) in units > > of kilobytes per second. > > > > Hi Heinz, > > How about extending this stuff to handle cgroups also. So instead of > having deivice wide throttling policy, we throttle cgroups. That will > be a much more useful thing and will serve well the use case of throttling > virtual machines in cgroup. Hi Vivek, needs a serious design discussion but I think we could leverage it to allow for throttling of cgroups. > > Yesterday I had raised the issue of cgroup IO bandwidth throttling at > Linux Storage and Filesystem session. I thought that a device mapper > target will be the easiest thing to because I can make use of lots > of existing infrastructure. > > Christoph did not like it because of configuration concerns. He preferred > something in block layer/request queue. It was also hinted that there > were some ideas floating of better integation of device mapper > infrastructure with request queue and this thing should go behind that. Right, if a block layer change of that kind will be pending, we should wait for it to settle. > But the problem is I am not sure how long it is going to take before > this new infrastructure becomes a reality and it will not be practical > to wait for that. Did any reliable plans come out of the discussion or will there be any in the near future? > > There is a possibility that we can put a hook in __make_request function > and first take out all the bios and subject them to bandwidth limitation > and then pass it to lower layers. But that will mean redoing lots of > common infrastructure which has already been done. For example, > > - What happens to queue congestion semantics. > > - Request queue already has it based on requests and device mapper > seems to have its own congestion functions. Yes, dm does. > > - If I go for taking the bio out on request queue and hold them > back then I am not sure how to define congestion semantics. > To keep congestion semantcs simple, it would make sense to > create a new request queue (with the help of dm target), and > use that. Yes, that's an obvious approach to stay with the same congestion semantics. > > - I have yet to think through it but I think I wil be doing other common > operations like holding back requests in internal queues, dispatching > these later with the help of a kernel thread, allowing some to dispatch > immediately as these come in, Putting processes to sleep and waking > them later if we are already holding too many bios etc. > > To me it sounds that doing it is lot simpler with the help of device > mapper target. Though the not so nice part is the need of configuring > another device mapper target on every block device we want to control. Yes, we'd need identity mappings in the stack to be prepared. Or we need some __generic_make_request() hack ala bcache to hijack the request function on the fly. > > Christoph, would it make sense to currently go ahead with device mapper > target and later convert that to whenever request queue and device mapper > fusion thing happens. Or, do you have other ideas which I have not been > able to grasp.... Let's see what he wants to fill in. Cheers, Heinz > > Thanks > Vivek > > > > > > I've been using it for a while in testing configurations and think it's > > valuable for many people requiring simulation of low bandwidth > > interconnects or simulating different throughput characteristics on > > distinct address segments of a device (eg. fast outer disk spindles vs. > > slower inner ones). > > > > Please read Documentation/device-mapper/throttle.txt for how to use it. > > > > Note: this target can be combined with the "delay" target, which is > > already upstream in order to set io delays in addition to throttling, > > again valuable for long distance transport simulations. > > > > > > This target should stay separate rather than merged IMO, because it > > basically serves testing purposes and hence should not complicate any > > production mapping target. A potential merge with the "delay" target is > > subject to discussion. > > > > > > Signed-off-by: Heinz Mauelshagen <heinzm@xxxxxxxxxx> > > > > Documentation/device-mapper/throttle.txt | 68 ++++++ > > drivers/md/Kconfig | 8 + > > drivers/md/Makefile | 1 + > > drivers/md/dm-throttle.c | 389 ++++++++++++++++++++++++++++++ > > 4 files changed, 466 insertions(+), 0 deletions(-) > > > > diff --git a/Documentation/device-mapper/throttle.txt b/Documentation/device-mapper/throttle.txt > > new file mode 100644 > > index 0000000..9deea6e > > --- /dev/null > > +++ b/Documentation/device-mapper/throttle.txt > > @@ -0,0 +1,68 @@ > > +dm-throttle > > +=========== > > + > > +Device-Mapper's "throttle" target maps a linear range of the Device-Mapper > > +device onto a linear range of another device providing the option to throttle > > +read and write ios seperately. > > + > > +This target provides the ability to simulate low bandwidth transports to > > +devices or different throughput to seperate address segements of a device. > > + > > +Parameters: <#variable params> <read kbs> <write kbs> <dev path> <offset> > > + <#variable params> number of variable paramaters to set read and > > + write throttling kilobytes per second limits. > > + Range: 0 - 2 with > > + 0 = no throttling, > > + 1 and <read kbs> = read throttling only and > > + 2 and <read kbs> <write kbs> = read and write throttling. > > + <read kbs> read kilobatyes per second limit > > + <write kbs> write kilobatyes per second limit > > + <dev path>: Full pathname to the underlying block-device, or a > > + "major:minor" device-number. > > + <offset>: Starting sector within the device. > > + > > +Throttling read and write values can be adjusted through the constructor > > +by reloading a mapping table with the respective parameters or without > > +reloading through the message interface: > > + > > +dmsetup message <mapped device name> <offset> read_kbs <read kbs> > > +dmsetup message <mapped device name> <offset> write_kbs <read kbs> > > + > > +The target provides status information via its status interface: > > + > > +dmsetup status <mapped device name> > > + > > +Output includes the target version, the actual read and write kilobytes > > +per second limits used, how many read and write ios have been processed, > > +deferred and accounted for. > > + > > +Status can be reset without reloading the mapping table via the message > > +interface as well: > > + > > +dmsetup message <mapped device name> <offset> stats reset > > + > > + > > +Example scripts > > +=============== > > +[[ > > +#!/bin/sh > > +# Create an identity mapping for a device > > +# setting 1MB/s read and write throttling > > +echo "0 `blockdev --getsize $1` throttle 2 1024 1024 $1 0" | \ > > +dmsetup create throttle_identity > > +]] > > + > > +[[ > > +#!/bin/sh > > +# Set different throughput to first and second half of a device > > +let size=`blockdev --getsize $1`/2 > > +echo "0 $size throttle 2 10480 8192 $1 0 > > +$size $size throttle 2 2048 1024 $1 $size" | \ > > +dmsetup create throttle_segmented > > +]] > > + > > +[[ > > +#!/bin/sh > > +# Change read throughput on 2nd segment of previous segemented mapping > > +dmsetup message throttle_segmented $size 1 4096" > > +]] > > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig > > index 4a6feac..9c3cbe0 100644 > > --- a/drivers/md/Kconfig > > +++ b/drivers/md/Kconfig > > @@ -313,6 +313,14 @@ config DM_DELAY > > > > If unsure, say N. > > > > +config DM_THROTTLE > > + tristate "Throttling target (EXPERIMENTAL)" > > + depends on BLK_DEV_DM && EXPERIMENTAL > > + ---help--- > > + > > + A target that supports device throughput throttling > > + with bandwidth selection for reads and writes. > > + > > config DM_UEVENT > > bool "DM uevents (EXPERIMENTAL)" > > depends on BLK_DEV_DM && EXPERIMENTAL > > diff --git a/drivers/md/Makefile b/drivers/md/Makefile > > index e355e7f..6ea2598 100644 > > --- a/drivers/md/Makefile > > +++ b/drivers/md/Makefile > > @@ -37,6 +37,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o > > obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o > > obj-$(CONFIG_DM_CRYPT) += dm-crypt.o > > obj-$(CONFIG_DM_DELAY) += dm-delay.o > > +obj-$(CONFIG_DM_THROTTLE) += dm-throttle.o > > obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o > > obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o > > obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o > > diff --git a/drivers/md/dm-throttle.c b/drivers/md/dm-throttle.c > > new file mode 100644 > > index 0000000..bc000d0 > > --- /dev/null > > +++ b/drivers/md/dm-throttle.c > > @@ -0,0 +1,389 @@ > > +/* > > + * Copyright (C) 2010 Red Hat GmbH > > + * > > + * Module Author: Heinz Mauelshagen <heinzm@xxxxxxxxxx> > > + * > > + * This file is released under the GPL. > > + * > > + * Test target to stack on top of arbitrary other block > > + * device to throttle io in units of kilobyes per second. > > + * > > + * Throttling is configurable separately for reads and write > > + * via the constructor and the message interfaces. > > + */ > > + > > +#include "dm.h" > > +#include <linux/slab.h> > > + > > +static const char *version = "1.0"; > > + > > +#define DM_MSG_PREFIX "dm-throttle" > > +#define TI_ERR_RET(str, ret) \ > > + do { ti->error = DM_MSG_PREFIX ": " str; return ret; } while (0); > > +#define TI_ERR(str) TI_ERR_RET(str, -EINVAL) > > + > > +/* Statistics for target status output (see throttle_status()). */ > > +struct stats { > > + atomic_t accounted[2]; > > + atomic_t deferred_io[2]; > > + atomic_t io[2]; > > +}; > > + > > +/* Reset statistics variables. */ > > +static void stats_reset(struct stats *stats) > > +{ > > + int i = 2; > > + > > + while (i--) { > > + atomic_set(&stats->accounted[i], 0); > > + atomic_set(&stats->deferred_io[i], 0); > > + atomic_set(&stats->io[i], 0); > > + } > > +} > > + > > +/* Throttle context. */ > > +struct throttle_c { > > + /* Device to throttle. */ > > + struct { > > + struct dm_dev *dev; > > + sector_t start; > > + } dev; > > + > > + /* ctr parameters. */ > > + struct params { > > + unsigned bs[2]; /* Bytes per second. */ > > + unsigned kbs_ctr[2]; /* To save kb/s constructor args. */ > > + unsigned params; /* # of variable parameters. */ > > + } params; > > + > > + struct account { > > + /* Accounting for reads and writes. */ > > + struct ac_rw { > > + struct mutex mutex; > > + > > + unsigned long end_jiffies; > > + unsigned size; > > + } rw[2]; > > + > > + unsigned long flags; > > + } account; > > + > > + struct stats stats; > > +}; > > + > > +/* Return bytes/s value for kilobytes/s. */ > > +static inline unsigned to_bs(unsigned kbs) > > +{ > > + return kbs << 10; > > +} > > + > > +static inline unsigned to_kbs(unsigned bs) > > +{ > > + return bs >> 10; > > +} > > + > > +/* Reset account. */ > > +static void account_reset(int rw, struct throttle_c *tc) > > +{ > > + struct account *ac = &tc->account; > > + struct ac_rw *ac_rw = ac->rw + rw; > > + > > + ac_rw->size = 0; > > + ac_rw->end_jiffies = jiffies + HZ; > > + clear_bit(rw, &ac->flags); > > + smp_wmb(); > > +} > > + > > +/* Decide about throttling (ie. deferring bios). */ > > +static int throttle(struct throttle_c *tc, struct bio *bio) > > +{ > > + int rw = (bio_data_dir(bio) == WRITE); > > + unsigned bps; /* Bytes per second. */ > > + > > + smp_rmb(); > > + bps = tc->params.bs[rw]; > > + if (bps) { > > + unsigned size; > > + struct account *ac = &tc->account; > > + struct ac_rw *ac_rw = ac->rw + rw; > > + > > + if (time_after(jiffies, ac_rw->end_jiffies)) > > + /* Measure time exceeded. */ > > + account_reset(rw, tc); > > + else if (test_bit(rw, &ac->flags)) > > + /* In case we're throttled already. */ > > + return 1; > > + > > + /* Account I/O size. */ > > + size = ac_rw->size + bio->bi_size; > > + if (size > bps) { > > + /* Hit kilobytes per second threshold. */ > > + set_bit(rw, &ac->flags); > > + return 1; > > + } else { > > + ac_rw->size = size; > > + smp_wmb(); > > + } > > + > > + atomic_inc(tc->stats.accounted + rw); /* Statistics. */ > > + } > > + > > + return 0; > > +} > > + > > +/* > > + * Destruct a throttle mapping. > > + */ > > +static void throttle_dtr(struct dm_target *ti) > > +{ > > + struct throttle_c *tc = ti->private; > > + > > + if (tc->dev.dev) > > + dm_put_device(ti, tc->dev.dev); > > + > > + kfree(tc); > > +} > > + > > +/* Check @arg to be >= @min && <= @max. */ > > +static inline int range_ok(int arg, int min, int max) > > +{ > > + return !(arg < min || arg > max); > > +} > > + > > +/* Return "write" or "read" string for @write */ > > +static const char *rw_str(int write) > > +{ > > + return write ? "write" : "read"; > > +} > > + > > +/* > > + * Construct a throttle mapping: > > + * > > + * <start> <len> throttle \ > > + * #throttle_params <throttle_params> \ > > + * orig_dev_name orig_dev_start > > + * > > + * #throttle_params = 0 - 2 > > + * throttle_parms = [read_kbs [write_kbs]] > > + * > > + */ > > +static int throttle_ctr(struct dm_target *ti, unsigned argc, char **argv) > > +{ > > + int i, kbs[] = { 0, 0 }, r, throttle_params; > > + unsigned long long tmp; > > + sector_t start; > > + struct throttle_c *tc; > > + struct params *params; > > + > > + if (!range_ok(argc, 3, 5)) > > + TI_ERR("Invalid argument count"); > > + > > + /* Get #throttle_params. */ > > + if (sscanf(argv[0], "%d", &throttle_params) != 1 || > > + !range_ok(throttle_params, 0, 2)) > > + TI_ERR("Invalid throttle parameter number argument"); > > + > > + /* Handle any variable throttle parameters. */ > > + for (i = 0; i < throttle_params; i++) { > > + /* Get throttle read/write kilobytes per second. */ > > + if (sscanf(argv[i + 1], "%d", kbs + i) != 1 || kbs[i] < 0) { > > + static char msg[60]; > > + > > + snprintf(msg, sizeof(msg), > > + "Invalid throttle %s kilobytes per second", > > + rw_str(i)); > > + ti->error = msg; > > + return -EINVAL; > > + } > > + } > > + > > + if (sscanf(argv[2 + throttle_params], "%llu", &tmp) != 1) > > + TI_ERR("Invalid throttle device offset"); > > + > > + start = tmp; > > + > > + /* Allocate throttle context. */ > > + tc = ti->private = kzalloc(sizeof(*tc), GFP_KERNEL); > > + if (!tc) > > + TI_ERR_RET("Cannot allocate throttle context", -ENOMEM); > > + > > + /* Aquire throttle device. */ > > + r = dm_get_device(ti, argv[1 + throttle_params], > > + dm_table_get_mode(ti->table), &tc->dev.dev); > > + if (r) { > > + DMERR("Throttle device lookup failed"); > > + goto err; > > + } > > + > > + /* Check throttled device length. */ > > + if (ti->len > > > + i_size_read(tc->dev.dev->bdev->bd_inode) >> SECTOR_SHIFT) { > > + DMERR("Throttled device too small for mapping"); > > + goto err; > > + } > > + > > + tc->dev.start = start; > > + params = &tc->params; > > + params->params = throttle_params; > > + > > + i = ARRAY_SIZE(kbs); > > + while (i--) { > > + params->kbs_ctr[i] = kbs[i]; > > + params->bs[i] = to_bs(kbs[i]); > > + mutex_init(&tc->account.rw[i].mutex); > > + } > > + > > + stats_reset(&tc->stats); > > + return 0; > > +err: > > + throttle_dtr(ti); > > + return -EINVAL; > > +} > > + > > +/* Map a throttle io. */ > > +static int throttle_map(struct dm_target *ti, struct bio *bio, > > + union map_info *map_context) > > +{ > > + int r, rw = (bio_data_dir(bio) == WRITE); > > + struct throttle_c *tc = ti->private; > > + struct ac_rw *ac_rw = tc->account.rw + rw; > > + > > + mutex_lock(&ac_rw->mutex); > > + do { > > + r = throttle(tc, bio); > > + if (r) { > > + long end = ac_rw->end_jiffies, j = jiffies; > > + > > + /* Wait till next second when KB/s reached. */ > > + if (j < end) > > + schedule_timeout_uninterruptible(end - j); > > + } > > + } while (r); > > + > > + mutex_unlock(&ac_rw->mutex); > > + > > + /* Remap. */ > > + bio->bi_bdev = tc->dev.dev->bdev; > > + bio->bi_sector = bio->bi_sector - ti->begin + tc->dev.start; > > + > > + atomic_inc(&tc->stats.io[rw]); /* Statistics */ > > + return 1; /* Done with the bio; let dm core submit it. */ > > +} > > + > > +/* Message method. */ > > +static int throttle_message(struct dm_target *ti, unsigned argc, char **argv) > > +{ > > + int kbs, rw; > > + struct throttle_c *tc = ti->private; > > + > > + if (argc == 2) { > > + if (!strcmp(argv[0], "stats") && > > + !strcmp(argv[1], "reset")) { > > + /* Reset statistics. */ > > + stats_reset(&tc->stats); > > + goto out; > > + } else if (!strcmp(argv[0], "read_kbs")) > > + /* Adjust read kilobytes per second. */ > > + rw = 0; > > + else if (!strcmp(argv[0], "write_kbs")) > > + /* Adjust write kilobytes per second. */ > > + rw = 1; > > + else > > + goto err; > > + > > + /* Read r/w kbs paramater. */ > > + if (sscanf(argv[1], "%d", &kbs) != 1 || kbs < 0) { > > + DMWARN("Unrecognised throttle %s_kbs parameter.", > > + rw_str(rw)); > > + return -EINVAL; > > + } > > + > > + /* Update settings. */ > > + mutex_lock(&tc->account.rw[rw].mutex); > > + tc->params.bs[rw] = to_bs(kbs); > > + account_reset(rw, tc); > > + mutex_unlock(&tc->account.rw[rw].mutex); > > +out: > > + return 0; > > + } > > +err: > > + DMWARN("Unrecognised throttle message received."); > > + return -EINVAL; > > +} > > + > > +/* Status output method. */ > > +static int throttle_status(struct dm_target *ti, status_type_t type, > > + char *result, unsigned maxlen) > > +{ > > + ssize_t sz = 0; > > + struct throttle_c *tc = ti->private; > > + struct stats *s = &tc->stats; > > + struct params *p = &tc->params; > > + > > + switch (type) { > > + case STATUSTYPE_INFO: > > + DMEMIT("v=%s rkb=%u wkb=%u r=%u w=%u rd=%u wd=%u " > > + "acr=%u acw=%u", > > + version, > > + to_kbs(p->bs[0]), to_kbs(p->bs[1]), > > + atomic_read(s->io), atomic_read(s->io + 1), > > + atomic_read(s->deferred_io), > > + atomic_read(s->deferred_io + 1), > > + atomic_read(s->accounted), > > + atomic_read(s->accounted + 1)); > > + break; > > + > > + case STATUSTYPE_TABLE: > > + DMEMIT("%u", p->params); > > + > > + if (p->params) { > > + DMEMIT(" %u", p->kbs_ctr[0]); > > + > > + if (p->params > 1) > > + DMEMIT(" %u", p->kbs_ctr[1]); > > + } > > + > > + DMEMIT(" %s %llu", > > + tc->dev.dev->name, > > + (unsigned long long) tc->dev.start); > > + } > > + > > + return 0; > > +} > > + > > +static struct target_type throttle_target = { > > + .name = "throttle", > > + .version = {1, 0, 0}, > > + .module = THIS_MODULE, > > + .ctr = throttle_ctr, > > + .dtr = throttle_dtr, > > + .map = throttle_map, > > + .message = throttle_message, > > + .status = throttle_status, > > +}; > > + > > +int __init dm_throttle_init(void) > > +{ > > + int r = dm_register_target(&throttle_target); > > + > > + if (r) > > + DMERR("Failed to register %s [%d]", DM_MSG_PREFIX, r); > > + else > > + DMINFO("registered %s %s", DM_MSG_PREFIX, version); > > + > > + return r; > > +} > > + > > +void dm_throttle_exit(void) > > +{ > > + dm_unregister_target(&throttle_target); > > + DMINFO("unregistered %s %s", DM_MSG_PREFIX, version); > > +} > > + > > +/* Module hooks */ > > +module_init(dm_throttle_init); > > +module_exit(dm_throttle_exit); > > + > > +MODULE_DESCRIPTION(DM_NAME "device-mapper throttle target"); > > +MODULE_AUTHOR("Heinz Mauelshagen <heinzm@xxxxxxxxxx>"); > > +MODULE_LICENSE("GPL"); > > > > > > -- > > dm-devel mailing list > > dm-devel@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel