On Mon, May 25, 2020 at 6:08 PM Edward Shishkin <edward.shishkin@xxxxxxxxx> wrote: > > Reiser5: Data Tiering. Burst Buffers > Speedup synchronous modifications > > > Dumping peaks of IO load to a proxy device > > > Now you can add a small high-performance block device to your large > logical volume composed of relatively slow commodity disks and get > an impression that the whole your volume has throughput which is as > high, as the one of that "proxy" device! > > This is based on a simple observation that in real life IO load is > going by peaks, and the idea is to dump those peaks to a high- > performance "proxy" device. Usually you have enough time between peaks > to flush the proxy device, that is, to migrate the "hot data" from the > proxy device to slow media in background mode, so that your proxy > device is always ready to accept a new portion of "peaks". > > Such technique, which is also known as "Burst Buffers", initially > appeared in the area of HPC. Despite this fact, it is also important > for usual applications. In particular, it allows to speedup the ones, > which perform so-called "atomic updates". > > > Speedup "atomic updates" in user-space > > > There is a whole class of applications with high requirements to data > integrity. Such applications (typically data bases) want to be sure > that any data modifications either complete, or they don't. And they > don't appear as partially occurred. Some applications has weaker > requirements: with some restrictions they accept also partially > occurred modifications. > > Atomic updates in user space are performed via a sequence of 3 steps. > Suppose you need to modify data of some file "foo" in an atomic way. > For this you need to: > > 1. write a new temporary file "foo.tmp" with modified data > 2. issue fsync(2) against "foo.tmp" > 3. rename "foo.tmp" to "foo". > > At step 1 the file system populates page cache with new data > At step 2 the file system allocates disk addresses for all logical > blocks of the file foo.tmp and writes that file to disk. At step 3 all > blocks containing old data get released. > > Note that steps 2 and 3 become a reason of essential performance drop > on slow media. The situation gets improved, when all dirty data are > written to a dedicated high-performance proxy-disk, which exactly > happens in a file system with Burst Buffers support. > > > Speedup all synchronous modifications (TODO) > Burst Buffers and transaction manager > > > Not only dirty data pages, but also dirty meta-data pages can be > dumped to the proxy-device, so that step (3) above also won't > contribute to the performance drop. > > Moreover, not only new logical data blocks can be dumped to the proxy > disk. All dirty data pages, including ones, which already have > location on the main (slow) storage can also be relocated to the proxy > disk, thus, speeding up synchronous modification of files in _all_ > cases (not only in atomic updates via write-fsync-rename sequence > described above). > > Indeed, let's remind that any modified page is always written to disk > in a context of committing some transaction. Depending on the commit > strategy (there are 2 ones "relocate" and "overwrite"), for each such > modified dirty page there are only 2 possibility: > > a) to be written right away to a new location, > b) to be written first to a temporary location (journal), then to be > written back to permanent location. > > With Burst buffers support in the case (a) the file system writes > dirty page right away to the proxy device. Then user should take care > to migrate it back to the permanent storage (see section "Flushing > proxy devise" below). In the case (b) the modified copy will be > written to the proxy device (wandering logs), then at checkpoint time > (playing a transaction) reiser4 transaction manager will write it to > the permanent location (on commodity disks). In this case user doesn't > need to worry on flushing proxy device, however, the procedure of > commit takes more time, as user should also wait for "checkpoint > completion". > > So from the standpoint of performance "write-anywhere" transaction > model (reiser4 mount option "txmod=wa") is more preferable then > journalling model (txmod=journal), or even hybrid model (txmod=hybrid) > > > Predictable and non-predictable migration > Meta-data migration > > > As we already mentioned, not only dirty data pages, but also dirty > meta-data pages can be dumped to the proxy-device. Note, however, that > not predictable meta-data migration is not possible because of > chicken-eggish problem. Indeed, non-predictable migration means that > nobody knows, on what device of your logical volume a stripe of data > will be relocated in the future. Such migration requires to record > location of data stripes. Now note, that such records is always a part > of meta-data. Hence, you are now able to migrate meta-data in > non-predictable way. > > However, it is perfectly possible to distribute/migrate meta-data in a > predictable way (it will be supported in so-called "symmetric" logical > volumes - currently not implemented). Classic example of predictable > migration is RAID arrays (once you add, or remove a device to/from the > array, all data blocks migrate in predictable way during rebalancing). > If relocation is predictable, then it is not need to record locations > of data stripes - it can always be calculated. > > Thus, non-predictable migration is applicable to data only. > > > Definition of data tiering. > Using proxy device to store hot data (TODO) > > > Now we can precisely define tiering as (meta-)data relocation in > accordance with some strategy (automatic, or user-defined), so that > every relocated unit always gets location on another device-component > of the logical volume. > > During such relocation block number B1 on device D1 gets released, > first address component is changed to D2, second component is changed > to 0 (which indicates not allocated block number), then the file > system allocates block number B2 on device D2: > > (D1, B1) -> (D2, 0) -> (D2, B2) > > Note that tiering is not defined for simple volumes (i.e. volumes, > consisting only of one device). Blocks relocation within one device > is always in a competence of a file system (to be precisely, of block > allocator. > > Burst buffers is just one of strategies, in accordance with which all > new logical blocks (optionally, all dirty pages) always get location > on a dedicated proxy device. As we have figured out, Burst Buffers is > useful for HPC applications, as well as for usual applications > executing fsync(2) frequently. > > There are other data tiering strategies, which can be useful for other > class of applications. All of them can be easily implemented in > Reiser5. > > For example, you can use proxy device to store hot data only. With > such strategy new logical blocks (which are always "cold") will always > go to the main storage (in contrast with Burst Buffers, where new > logical blocks first get written to the proxy disk). Once in a while > you need to scan your volume in order to push colder data out, and > pull hotter data in the proxy disk. Reiser5 contains a common > interface for this. It is possible to maintain per-file, or even per- > blocks-extent "temperature" of data (e.g. as a generation counter), > but we still don't have more or less satisfactory algorithms to > determine "critical temperature" for pushing data in/out proxy disk. > > > Getting started with proxy disk over logical volume > > > Just follow the administration guide: > https://reiser4.wiki.kernel.org/index.php/Proxy_Device_Administration > Re: > WARNING: THE STUFF IS NOT STABLE! Don't store important data on > Reiser5 logical volumes till beta-stability announcement. Will you be releasing reiser4 Software Format Release Number 4.0.2 for Linux kernel 5.6 patch? >From personal experience SFRN 4.0.2 is stable and all my data, local and cloud virtual machines instances, as well as computing for the last six(6)+ years is in that format/environment. Although I have not tried an Debian based installation with this second iteration of SFRN 5 I have no use for the kernel/reiser4progs until they play well with Debian installer, python, etc.. Best Professional Regards. -- Jose R R http://metztli.it --------------------------------------------------------------------------------------------- Download Metztli Reiser4: Debian Buster w/ Linux 5.5.19 AMD64 --------------------------------------------------------------------------------------------- feats ZSTD compression https://sf.net/projects/metztli-reiser4/ ------------------------------------------------------------------------------------------- Official current Reiser4 resources: https://reiser4.wiki.kernel.org/