On Thu, 2010-01-07 at 18:18 +0800, 张宇 wrote: > Is there any command-line example to explain how to use this patch? Look at Documentation/device-mapper/replicator.txt and at comments above replicator_ctr() and _replicator_dev_ctr() in dm-repl.c for mapping table syntax and examples. > I have compiled it and loaded the modules, the '<start><length>' > target parameters in both replicator ant replicator-dev targets means > what? replicator doesn't provide a direct mapping of its own, so <length> is arbitrary and <start> shal be 0. > how can I construct these target? See documentation hints above. > I haven't read the source code in detail till now, sorry. You will now ;-) Regards, Heinz > > > 2009/12/18 <heinzm@xxxxxxxxxx> > From: Heinz Mauelshagen <heinzm@xxxxxxxxxx> > > The dm-registry module is a general purpose registry for > modules. > > The remote replicator utilizes it to register its ringbuffer > log and > site link handlers in order to avoid duplicating registry code > and logic. > > > Signed-off-by: Heinz Mauelshagen <heinzm@xxxxxxxxxx> > Reviewed-by: Jon Brassow <jbrassow@xxxxxxxxxx> > Tested-by: Jon Brassow <jbrassow@xxxxxxxxxx> > --- > Documentation/device-mapper/replicator.txt | 203 > +++++++++++++++++++++++++ > drivers/md/Kconfig | 8 + > drivers/md/Makefile | 1 + > drivers/md/dm-registry.c | 224 > ++++++++++++++++++++++++++++ > drivers/md/dm-registry.h | 38 +++++ > 5 files changed, 474 insertions(+), 0 deletions(-) > create mode 100644 Documentation/device-mapper/replicator.txt > create mode 100644 drivers/md/dm-registry.c > create mode 100644 drivers/md/dm-registry.h > > diff --git > 2.6.33-rc1.orig/Documentation/device-mapper/replicator.txt > 2.6.33-rc1/Documentation/device-mapper/replicator.txt > new file mode 100644 > index 0000000..1d408a6 > --- /dev/null > +++ 2.6.33-rc1/Documentation/device-mapper/replicator.txt > @@ -0,0 +1,203 @@ > +dm-replicator > +============= > + > +Device-mapper replicator is designed to enable redundant > copies of > +storage devices to be made - preferentially, to remote > locations. > +RAID1 (aka mirroring) is often used to maintain redundant > copies of > +storage for fault tolerance purposes. Unlike RAID1, which > often > +assumes similar device characteristics, dm-replicator is > designed to > +handle devices with different latency and bandwidth > characteristics > +which are often the result of the geograhic disparity of > multi-site > +architectures. Simply put, you might choose RAID1 to protect > from > +a single device failure, but you would choose remote > replication > +via dm-replicator for protection against a site failure. > + > +dm-replicator works by first sending write requests to the > "replicator > +log". Not to be confused with the device-mapper dirty log, > this > +replicator log behaves similarly to that of a journal. Write > requests > +go to this log first and then are copied to all the replicate > devices > +at their various locations. Requests are cleared from the > log once all > +replicate devices confirm the data is received/copied. This > architecture > +allows dm-replicator to be flexible in terms of device > characteristics. > +If one device should fall behind the others - perhaps due to > high latency - > +the slack is picked up by the log. The user has a great deal > of > +flexibility in specifying to what degree a particular site is > allowed to > +fall behind - if at all. > + > +Device-Mapper's dm-replicator has two targets, "replicator" > and > +"replicator-dev". The "replicator" target is used to setup > the > +aforementioned log and allow the specification of site link > properties. > +Through the "replicator" target, the user might specify that > writes > +that are copied to the local site must happen synchronously > (i.e the > +writes are complete only after they have passed through the > log device > +and have landed on the local site's disk). They may also > specify that > +a remote link should asynchronously complete writes, but that > the remote > +link should never fall more than 100MB behind in terms of > processing. > +Again, the "replicator" target is used to define the > replicator log and > +the characteristics of each site link. > + > +The "replicator-dev" target is used to define the devices > used and > +associate them with a particular replicator log. You might > think of > +this stage in a similar way to setting up RAID1 (mirroring). > You > +define a set of devices which will be copies of each other, > but > +access the device through the mirror virtual device which > takes care > +of the copying. The user accessible replicator device is > analogous > +to the mirror virtual device, while the set of devices being > copied > +to are analogous to the mirror images (sometimes called > 'legs'). > +When creating a replicator device via the "replicator-dev" > target, > +it must be associated with the replicator log (created with > the > +aforementioned "replicator" target). When each redundant > device > +is specified as part of the replicator device, it is > associated with > +a site link whose properties were defined when the > "replicator" > +target was created. > + > +The user can go farther than simply replicating one device. > They > +can continue to add replicator devices - associating them > with a > +particular replicator log. Writes that go through the > replicator > +log are guarenteed to have their write ordering preserved. > So, if > +you associate more than one replicator device to a particular > +replicator log, you are preserving write ordering across > multiple > +devices. This might be useful if you had a database that > spanned > +multiple disks and write ordering must be preserved or any > transaction > +accounting scheme would be foiled. (You can imagine this > like > +preserving write ordering across a number of mirrored > devices, where > +each mirror has images/legs in different geographic > locations.) > + > +dm-replicator has a modular architecture. Future > implementations for > +the replicator log and site link modules are allowed. The > current > +replication log is ringbuffer - utilized to store all writes > being > +subject to replication and enforce write ordering. The > current site > +link code is based on accessing block devices (iSCSI, FC, > etc) and > +does device recovery including (initial) resynchronization. > + > + > +Picture of a 2 site configuration with 3 local devices (LDs) > in a > +primary site being resycnhronied to 3 remotes sites with 3 > remote > +devices (RDs) each via site links (SLINK) 1-2 with site link > 0 > +as a special case to handle the local devices: > + > + | > + Local (primary) site | Remote > sites > + -------------------- | > ------------ > + | > + D1 D2 Dn | > + | | | | > + +---+- ... -+ | > + | | > + REPLOG-----------------+- SLINK1 ------------+ > + | | | | > + SLINK0 (special case) | | | > + | | | | > + +-----+ ... + | | +----+- ... -+ > + | | | | | | | | > + LD1 LD2 LDn | | RD1 RD2 > RDn > + | | > + +-- SLINK2------------+ > + | | | > + | | +----+- ... -+ > + | | | | | > + | | RD1 RD2 > RDn > + | | > + | | > + | | > + +- SLINKm ------------+ > + | | > + | +----+- ... -+ > + | | | | > + | RD1 RD2 > RDn > + > + > + > + > +The following are descriptions of the device-mapper tables > used to > +construct the "replicator" and "replicator-dev" targets. > + > +"replicator" target parameters: > +------------------------------- > +<start> <length> replicator \ > + <replog_type> <#replog_params> <replog_params> \ > + [<slink_type_0> <#slink_params_0> > <slink_params_0>]{1..N} > + > +<replog_type> = "ringbuffer" is currently the only > available type > +<#replog_params> = # of args following this one intended for > the replog (2 or 4) > +<replog_params> = <dev_path> <dev_start> [auto/create/open > <size>] > + <dev_path> = device path of replication log (REPLOG) > backing store > + <dev_start> = offset to REPLOG header > + create = The replication log will be initialized > if not active > + and sized to "size". (If already > active, the create > + will fail.) Size is always in sectors. > + open = The replication log must be initialized > and valid or > + the constructor will fail. > + auto = If a valid replication log header is > found on the > + replication device, this will behave > like 'open'. > + Otherwise, this option behaves like > 'create'. > + > +<slink_type> = "blockdev" is currently the only available > type > +<#slink_params> = 1/2/4 > +<slink_params> = <slink_nr> [<slink_policy> [<fall_behind> > <N>]] > + <slink_nr> = This is a unique number that is used > to identify a > + particular site/location. '0' is > always used to > + identify the local site, while > increasing integers > + are used to identify remote sites. > + <slink_policy> = The policy can be either 'sync' or > 'async'. > + 'sync' means write requests will not > return until > + the data is on the storage device. > 'async' allows > + a device to "fall behind"; that is, > outstanding > + write requests are waiting in the > replication log > + to be processed for this site, but it > is not delaying > + the writes of other sites. > + <fall_behind> = This field is used to specify how far > the user is > + willing to allow write requests to > this specific site > + to "fall behind" in processing before > switching to > + a 'sync' policy. This "fall behind" > threshhold can > + be specified in three ways: ios, > size, or timeout. > + 'ios' is the number of pending I/Os > allowed (e.g. > + "ios 10000"). 'size' is the amount > of pending data > + allowed (e.g. "size 200m"). Size > labels include: > + s (sectors), k, m, g, t, p, and e. > 'timeout' is > + the amount of time allowed for writes > to be > + outstanding. Time labels include: s, > m, h, and d. > + > + > +"replicator-dev" target parameters: > +----------------------------------- > +start> <length> replicator-dev > + <replicator_device> <dev_nr> \ > + [<slink_nr> <#dev_params> <dev_params> > + <dlog_type> <#dlog_params> <dlog_params>]{1..N} > + > +<replicator_device> = device previously constructed via > "replication" target > +<dev_nr> = An integer that is used to 'tag' write > requests as > + belonging to a particular set of devices > - specifically, > + the devices that follow this argument > (i.e. the site > + link devices). > +<slink_nr> = This number identifies the site/location > where the next > + device to be specified comes from. It > is exactly the > + same number used to identify the > site/location (and its > + policies) in the "replicator" target. > Interestingly, > + while one might normally expect a > "dev_type" argument > + here, it can be deduced from the site > link number and > + the 'slink_type' given in the > "replication" target. > +<#dev_params> = '1' (The number of allowed parameters > actually depends > + on the 'slink_type' given in the > "replication" target. > + Since our only option there is > "blockdev", the only > + allowable number here is currently '1'.) > +<dev_params> = 'dev_path' (Again, since "blockdev" is > the only > + 'slink_type' available, the only > allowable argument here > + is the path to the device.) > +<dlog_type> = Not to be confused with the "replicator > log", this is > + the type of dirty log associated with > this particular > + device. Dirty logs are used for > synchronization, during > + initialization or fall behind > conditions, to bring devices > + into a coherent state with its peers - > analogous to > + rebuilding a RAID1 (mirror) device. > Available dirty > + log types include: 'nolog', 'core', and > 'disk' > +<#dlog_params> = The number of arguments required for a > particular log > + type - 'nolog' = 0, 'core' = 1/2, 'disk' > = 2/3. > +<dlog_params> = 'nolog' => ~no arguments~ > + 'core' => <region_size> [sync | nosync] > + 'disk' => <dlog_dev_path> <region_size> > [sync | nosync] > + <region_size> = This sets the granularity at which > the dirty log > + tracks what areas of the device is > in-sync. > + [sync | nosync] = Optionally specify whether the sync > should be forced > + or avoided initially. > diff --git 2.6.33-rc1.orig/drivers/md/Kconfig > 2.6.33-rc1/drivers/md/Kconfig > index acb3a4e..62c9766 100644 > --- 2.6.33-rc1.orig/drivers/md/Kconfig > +++ 2.6.33-rc1/drivers/md/Kconfig > @@ -313,6 +313,14 @@ config DM_DELAY > > If unsure, say N. > > +config DM_REPLICATOR > + tristate "Replication target (EXPERIMENTAL)" > + depends on BLK_DEV_DM && EXPERIMENTAL > + ---help--- > + A target that supports replication of local devices to > remote sites. > + > + If unsure, say N. > + > config DM_UEVENT > bool "DM uevents (EXPERIMENTAL)" > depends on BLK_DEV_DM && EXPERIMENTAL > diff --git 2.6.33-rc1.orig/drivers/md/Makefile > 2.6.33-rc1/drivers/md/Makefile > index e355e7f..be05b39 100644 > --- 2.6.33-rc1.orig/drivers/md/Makefile > +++ 2.6.33-rc1/drivers/md/Makefile > @@ -44,6 +44,7 @@ obj-$(CONFIG_DM_SNAPSHOT) += > dm-snapshot.o > obj-$(CONFIG_DM_MIRROR) += dm-mirror.o > dm-log.o dm-region-hash.o > obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o > obj-$(CONFIG_DM_ZERO) += dm-zero.o > +obj-$(CONFIG_DM_REPLICATOR) += dm-log.o dm-registry.o > > quiet_cmd_unroll = UNROLL $@ > cmd_unroll = $(AWK) -f$(srctree)/$(src)/unroll.awk -vN= > $(UNROLL) \ > diff --git 2.6.33-rc1.orig/drivers/md/dm-registry.c > 2.6.33-rc1/drivers/md/dm-registry.c > new file mode 100644 > index 0000000..fb8abbf > --- /dev/null > +++ 2.6.33-rc1/drivers/md/dm-registry.c > @@ -0,0 +1,224 @@ > +/* > + * Copyright (C) 2009 Red Hat, Inc. All rights reserved. > + * > + * Module Author: Heinz Mauelshagen (heinzm@xxxxxxxxxx) > + * > + * Generic registry for arbitrary structures > + * (needs dm_registry_type structure upfront each registered > structure). > + * > + * This file is released under the GPL. > + * > + * FIXME: use as registry for e.g. dirty log types as well. > + */ > + > +#include <linux/init.h> > +#include <linux/module.h> > +#include <linux/moduleparam.h> > + > +#include "dm-registry.h" > + > +#define DM_MSG_PREFIX "dm-registry" > + > +static const char *version = "0.001"; > + > +/* Sizable class registry. */ > +static unsigned num_classes; > +static struct list_head *_classes; > +static rwlock_t *_locks; > + > +void * > +dm_get_type(const char *type_name, enum dm_registry_class > class) > +{ > + struct dm_registry_type *t; > + > + read_lock(_locks + class); > + list_for_each_entry(t, _classes + class, list) { > + if (!strcmp(type_name, t->name)) { > + if (!t->use_count && ! > try_module_get(t->module)) { > + read_unlock(_locks + class); > + return ERR_PTR(-ENOMEM); > + } > + > + t->use_count++; > + read_unlock(_locks + class); > + return t; > + } > + } > + > + read_unlock(_locks + class); > + return ERR_PTR(-ENOENT); > +} > +EXPORT_SYMBOL(dm_get_type); > + > +void > +dm_put_type(void *type, enum dm_registry_class class) > +{ > + struct dm_registry_type *t = type; > + > + read_lock(_locks + class); > + if (!--t->use_count) > + module_put(t->module); > + > + read_unlock(_locks + class); > +} > +EXPORT_SYMBOL(dm_put_type); > + > +/* Add a type to the registry. */ > +int > +dm_register_type(void *type, enum dm_registry_class class) > +{ > + struct dm_registry_type *t = type, *tt; > + > + if (unlikely(class >= num_classes)) > + return -EINVAL; > + > + tt = dm_get_type(t->name, class); > + if (unlikely(!IS_ERR(tt))) { > + dm_put_type(t, class); > + return -EEXIST; > + } > + > + write_lock(_locks + class); > + t->use_count = 0; > + list_add(&t->list, _classes + class); > + write_unlock(_locks + class); > + > + return 0; > +} > +EXPORT_SYMBOL(dm_register_type); > + > +/* Remove a type from the registry. */ > +int > +dm_unregister_type(void *type, enum dm_registry_class class) > +{ > + struct dm_registry_type *t = type; > + > + if (unlikely(class >= num_classes)) { > + DMERR("Attempt to unregister invalid class"); > + return -EINVAL; > + } > + > + write_lock(_locks + class); > + > + if (unlikely(t->use_count)) { > + write_unlock(_locks + class); > + DMWARN("Attempt to unregister a type that is > still in use"); > + return -ETXTBSY; > + } else > + list_del(&t->list); > + > + write_unlock(_locks + class); > + return 0; > +} > +EXPORT_SYMBOL(dm_unregister_type); > + > +/* > + * Return kmalloc'ed NULL terminated pointer > + * array of all type names of the given class. > + * > + * Caller has to kfree the array!. > + */ > +const char **dm_types_list(enum dm_registry_class class) > +{ > + unsigned i = 0, count = 0; > + const char **r; > + struct dm_registry_type *t; > + > + /* First count the registered types in the class. */ > + read_lock(_locks + class); > + list_for_each_entry(t, _classes + class, list) > + count++; > + read_unlock(_locks + class); > + > + /* None registered in this class. */ > + if (!count) > + return NULL; > + > + /* One member more for array NULL termination. */ > + r = kzalloc((count + 1) * sizeof(*r), GFP_KERNEL); > + if (!r) > + return ERR_PTR(-ENOMEM); > + > + /* > + * Go with the counted ones. > + * Any new added ones after we counted will be > ignored! > + */ > + read_lock(_locks + class); > + list_for_each_entry(t, _classes + class, list) { > + r[i++] = t->name; > + if (!--count) > + break; > + } > + read_unlock(_locks + class); > + > + return r; > +} > +EXPORT_SYMBOL(dm_types_list); > + > +int __init > +dm_registry_init(void) > +{ > + unsigned n; > + > + BUG_ON(_classes); > + BUG_ON(_locks); > + > + /* Module parameter given ? */ > + if (!num_classes) > + num_classes = DM_REGISTRY_CLASS_END; > + > + n = num_classes; > + _classes = kmalloc(n * sizeof(*_classes), GFP_KERNEL); > + if (!_classes) { > + DMERR("Failed to allocate classes registry"); > + return -ENOMEM; > + } > + > + _locks = kmalloc(n * sizeof(*_locks), GFP_KERNEL); > + if (!_locks) { > + DMERR("Failed to allocate classes locks"); > + kfree(_classes); > + _classes = NULL; > + return -ENOMEM; > + } > + > + while (n--) { > + INIT_LIST_HEAD(_classes + n); > + rwlock_init(_locks + n); > + } > + > + DMINFO("initialized %s for max %u classes", version, > num_classes); > + return 0; > +} > + > +void __exit > +dm_registry_exit(void) > +{ > + BUG_ON(!_classes); > + BUG_ON(!_locks); > + > + kfree(_classes); > + _classes = NULL; > + kfree(_locks); > + _locks = NULL; > + DMINFO("exit %s", version); > +} > + > +/* Module hooks */ > +module_init(dm_registry_init); > +module_exit(dm_registry_exit); > +module_param(num_classes, uint, 0); > +MODULE_PARM_DESC(num_classes, "Maximum number of classes"); > +MODULE_DESCRIPTION(DM_NAME "device-mapper registry"); > +MODULE_AUTHOR("Heinz Mauelshagen <heinzm@xxxxxxxxxx>"); > +MODULE_LICENSE("GPL"); > + > +#ifndef MODULE > +static int __init num_classes_setup(char *str) > +{ > + num_classes = simple_strtol(str, NULL, 0); > + return num_classes ? 1 : 0; > +} > + > +__setup("num_classes=", num_classes_setup); > +#endif > diff --git 2.6.33-rc1.orig/drivers/md/dm-registry.h > 2.6.33-rc1/drivers/md/dm-registry.h > new file mode 100644 > index 0000000..1cb0ce8 > --- /dev/null > +++ 2.6.33-rc1/drivers/md/dm-registry.h > @@ -0,0 +1,38 @@ > +/* > + * Copyright (C) 2009 Red Hat, Inc. All rights reserved. > + * > + * Module Author: Heinz Mauelshagen (heinzm@xxxxxxxxxx) > + * > + * Generic registry for arbitrary structures. > + * (needs dm_registry_type structure upfront each registered > structure). > + * > + * This file is released under the GPL. > + */ > + > +#include "dm.h" > + > +#ifndef DM_REGISTRY_H > +#define DM_REGISTRY_H > + > +enum dm_registry_class { > + DM_REPLOG = 0, > + DM_SLINK, > + DM_LOG, > + DM_REGION_HASH, > + DM_REGISTRY_CLASS_END, > +}; > + > +struct dm_registry_type { > + struct list_head list; /* Linked list of types in > this class. */ > + const char *name; > + struct module *module; > + unsigned int use_count; > +}; > + > +void *dm_get_type(const char *type_name, enum > dm_registry_class class); > +void dm_put_type(void *type, enum dm_registry_class class); > +int dm_register_type(void *type, enum dm_registry_class > class); > +int dm_unregister_type(void *type, enum dm_registry_class > class); > +const char **dm_types_list(enum dm_registry_class class); > + > +#endif > -- > 1.6.2.5 > > -- > dm-devel mailing list > dm-devel@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/dm-devel > > -- > dm-devel mailing list > dm-devel@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel