On Sun, Nov 20, 2016 at 12:58 PM, Edward Shishkin <edward.shishkin@xxxxxxxxx> wrote: > On 09/25/2016 12:47 AM, Edward Shishkin wrote: >> >> Logical Volumes >> >> >> Reiser4 will support logical (compound) volumes. For now we have >> implemented the simplest ones - mirrors. As a supplement to existing >> checksums it will provide a failover - an important feature, which >> will reduce number of cases when your volume needs to be repaired by >> fsck. >> >> Reiser4 subvolume is a component of logical volume. Subvolume is >> always associated with a physical, or logical (built of RAID, LVM, >> etc means) block device. Every subvolume possesses: >> >> . volume ID; >> . subvolume ID; >> . mirror ID; >> . number of replicas. >> >> mirror ID is a serial number from 0 till 65535. Subvolume with mirror >> ID 0 has a special name - original. Other ones are called replicas. >> We use to say "original A has a replica B" (or "B replicates A", >> which is the same), iff A and B possess the same subvolume ID. >> Original with all its replicas are called "mirrors". >> >> For subvolumes we have introduced a special disk format plugin >> "format41". In accordance with Reiser4 development model it means >> forward incompatibility. We have introduced it intentionally, for >> protection. Indeed, for clear reasons users must not have possibility >> to RW-mount separate replicas (without originals). >> The multi-device extension is backward compatible: all volumes of the >> old format (format40) are supported as logical volumes composed of >> only one (original) subvolume. >> >> >> Registration and activation of subvolumes >> >> >> For now every Reiser4 logical volume has only one original subvolume. >> Number of replicas can be 0, or more. Logical volume can be mount >> by usual mount command. Simply specify any its subvolume (the >> original, or some its replica). The only condition is that original >> and all its replicas should be registered in the system. If original, >> or some its replica are not registered, then mount will fail with a >> respective kernel message. >> >> Currently there is no tool to register specified subvolume (TBD). >> However, mount command always tries to register the specified device. >> The registration policy is "sticky". It means that your device won't >> be unregistered after umount, as well as failed mount. (You will be >> able to unregister it mandatory by a special tool - TBD). >> >> Procedure of registration reads the master super-block of the >> subvolume and puts the subvolume header to a specilal list of >> registered subvolumes. >> >> Mounting a logical volume activates all its registered components. >> Procedure of activation reads format super-block of the subvolume, and >> performs other actions like initialization of space maps, transaction >> replay, etc. as specified by the method ->init_format() of respective >> disk format plugin. Pointer to an activated subvolume is placed to a >> special table of active subvolumes. >> >> >> Mirror operations >> >> >> So original and mirrors actually represent RAID0 on the filesystem >> level. >> >> COMMENT. We aren't engaged in marketing fraud on collecting all >> features of the block layer's RAID and LVM. Reiser4 mirrors implement >> a failover, that block layers's RAID0 is not able to provide. >> >> It will be possible to "upgrade", or "downgrade" a reiser4 array of >> mirrors by attaching / detaching online one, or more replicas by >> special user-space tools (mirror.reiser4, TBD). Also by those tools it >> will be possible to swap original with any its replica, or make a new >> original from any replica, if the old one is lost for some reasons. >> >> Fsck will refuse to check/repir replica. Fsck is supposed to work only >> with original subvolumes. After mounting an fsck-ed original, kernel >> will automatically run a special on-line backgroud procedure (scrub) >> in order to synchronize the repaired original with all its replicas. >> >> Once in a while user has to check his array of mirrors by running >> scrub in the background mode. >> >> WARNING: Bear in mind once and forever: Replica is not a backup!!! >> >> >> Technical Notes >> >> >> 1. Reiser4 Transaction Design document is transferred to logical >> volumes without any modifications, but with a small addition. Atom is >> now composed of per-subvolume components. >> >> 2. By design all mirrors differ only in mirror-IDs which are stored in >> master super-block. Format super-blocks of mirrors are identical. This >> approach provides best performance and full parallelism in issuing IO >> requests for mirrors. The minus is a small compromise in design, >> according to which master super-block doesn't participate in >> transactions. It means that mirror operations on upgrading/degrading/ >> swapping can not spawn usual transactions, which can be committed >> and (re)played using existing transaction manager. That is, mirror >> operations won't survive a system crash. If a system crash happens >> during a mirror operation, then the mirror structure should be >> checked/fixed offline by the mirror tools (kernel will refuse to mount >> unchecked array of mirrors). Fortunately, all critical mirror >> operations issue small number of IO requests, so that probability of >> their interruption is close to zero. >> >> 3. We don't commit transactions on all mirrors, only on the original >> subvolume (this is the single functional difference of original and >> its replicas). Transaction (re)play, of course, is going on all >> mirrors using the wandering maps/blocks of the original subvolume. >> >> >> How to test the new features >> >> >> Checkout branch "format41" of the upstream reiser4 and reiser4progs >> git repos on https://github.com/edward6 Build and install as usual. >> >> Mirrors can be created by mkfs.reiser4 option -m. If this option is >> specified, then the first listed device will be the original, other >> ones - replicas. All devices of an array should have the same size. >> Further we'll avoid that restriction. >> >> IMPORTANT: when creating mirrors specify node41 plugin (with checksum >> support). Otherwise, your mirrors won't be more useful than block >> layer's RAID0. >> >> Register all your mirrors, trying to "mount" them one-by-one in any >> order. If you have N mirrors (i.e. one original and N-1 replicas), >> then first N-1 mount commands will fail. Of course, it is not too >> graceful, but this is temporal solution. The N-th "attempt" should >> succeed. Have a fun. Unmount as usual. >> >> >> Example >> >> >> Suppose we have 2 partitions /dev/sda7 and /dev/sda8 of equal >> size. Let's create an array of 2 mirrors: >> >> # mkfs.reiser4 -my -o node=node41 /dev/sda7 /dev/sda8 >> >> Take a look at original subvolume: >> >> # debugfs.reiser4 /dev/sda7 >> >> Take a look at replica: >> >> # debugfs.reiser4 /dev/sda8 >> >> Find differences ;) >> >> Register the original subvolume >> >> # mount /dev/sda7 /mnt >> mount: wrong fs type, bad option, bad superblock blablabla.... >> # dmesg >> reiser4[mount(20914)]: check_active_replicas >> (fs/reiser4/init_volume.c:268)[edward-1750]: >> WARNING: /dev/sda7 requires replicas, which are not registered. >> >> Register the replica and mount the array: >> >> #mount /dev/sda8 /mnt >> #dmesg >> >> reiser4: registered subvolume (/dev/sda8) >> reiser4 (sda8): found disk format 4.0.1. >> reiser4 (/dev/sda7): using Hybrid Transaction Model. >> >> Let's copy a file /etc/services to our array of mirrors: >> >> # cp /etc/services /mnt/. >> >> Unmount the array: >> >> # umount /mnt >> >> Find a root block: it goes the first in the tree dump: >> >> # debugfs.reiser4 -t /dev/sda7 >> >> In our case the root block has blocknumber #79 >> >> Let's now take a look on how our failover works. The death defying >> act: we erase the root block of the original subvolume: >> >> # dd if=/dev/zero of=/dev/sda7 bs=4096 count=1 seek=79 >> >> We know that the mount procedure load the root block. Let's try to >> mount our array with the corrupted root block: >> >> # mount /dev/sda8 /mnt >> >> Everything works.. >> Take a look at kernel messages: >> >> # dmesg >> reiser4[mount(21224)]: parse_node41 >> (fs/reiser4/plugin/node/node41.c:79)[edward-1645]: >> WARNING: block 79 (/dev/sda7): bad checksum. Please, scrub the volume. >> >> >> TODO >> >> >> 1) Mirror tools (upgrade/downgrade a mirror array, swap original and >> specified replica, convert replica to an original, visualization of >> mirror >> arrays, etc); >> 2) Scrub (online background checking and synchronizaton of mirrors); >> 3) Checksumming format super-block; >> 4) Issuing discard requests for replicas on SSD devices. >> >> All items are very simple to implement. If anyone cares, then I'll >> provide details. >> >> > > > So the latest update is that we don't need online scrub: this feature > is inherent to badly designed file systems. > > Instead we provide transparent (on the fly) failover. That is, in the > case of IO error (because of death of device, etc), or if checksum > verification failed (because of bitrot, etc), reiser4 immediately > issues IO requests against replica devices. > > Thus, the latest version of TODO list includes the following items: > > 1. Implementation of Mirror Tools (upgrade/downgrade/synchronize a > mirror array, swap original and specified replica, convert replica > to an original, visualization of mirror arrays, etc); > > 2. Checksumming format super-block and bitmap blocks; > > 3. Issuing discard requests for replicas on SSD devices. > > 4. Testing. > > a) Testing overall stability of format41: > Create a mirrored volume and perform usual stressing by fsx, > stress.sh, dbench, etc. > > b) Testing the feature of failover: > Create a mirrored volume and emulate data corruption and death > of devices under some workload. To emulate data corruption use > dd to fill metadata blocks with zeros. To emulate death of > devices, simply create one or more mirrors on USB sticks and > remove them during heavy IO activity. > Both test scenarios are implemented in xfstests and reiser4 is supported. > > Thanks, > Edward. > > > -- > To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html