Reiser4 (meta)data checksums 1. Why protect (meta)data? We want to be protected against hardware problems such as data rot in memory and decay of storage media. We want to be sure that our data structures are consistent, because working with corrupted data structures is dangerous. Strictly speaking, such protection is not a business of a file system. It would be more logical to assume that this is a business of the upper and the lower subsystems. To be precisely, protection against data rot in memory is a business of the memory controller, and protection against decay of storage media is a business of the block device controller/driver. However, frequently the mentioned subsystems don't provide such protection for various reasons. As the result the file system suffers (becomes corrupted, inconsistent), and poor users start to blame file system developers. 2. Why "inline" checksums? Reiser4 stores per-node checksum right in the node that we want to protect. This is much more efficient than using dedicated data structures for checksums, as we don't need to launch expensive search procedures every time when we need to access a checksum. Using dedicated data structures to store checksums is a design mistake. 3. When we check/update per-node checksums? Let's start from protection against storage media decay. If someone wants protection against data rot in memory, then let me know. Since we implement protection against storage media decay, it is enough to check [update] a checksum right after IO completion [before submitting IO request]. We don't need to update a checksum after every modification. So, updating checksums in Reiser4 is a delayed action. Reiser4 updates per-node checksum at commit time right before writing the node to disk. At the moment of checksum update any process modifying this node will be blocked on an attempt to acquire an exclusive access: longterm_lock_znode -> try_capture_block Thus, updated checksum won't be "spoiled" before hitting the disk. Checksum verification is going right after read IO completion in the ->parse() method of node plugin. 4. How we handle corruptions If node's checksum verification failed, them further working with such node is dangerous. Currently user has 2 options for online handling this situation: 1) kernel panic (default behavior); 2) remount reiser4 partition as read-only (if mount option "onerror=remount-ro" was specified). In both cases user should repair his partition offline by fsck. TODO: Online failover mode is in plans. For this mode we need to support mirror(s). Every in-memory replica gets updated at the moment of the checksum update. At the finish of transaction commit all replicas have to be written to the mirror. If checksum verification failed, then we issue a read IO request for the replica block of the mirror. Comment. Mirrors can be internal (when we allocate replicas on the same partition) and external (when we allocate replicas on different device). 5. Why use crc32c for checksums? Modern CPUs have instructions, which allow to compute a full 32-bit CRC step in 3 cycles. 6. How to protect data? Currently we don't support checksums for unformatted blocks, where bodies of large files are stored. If you want to protect your data (not only metadata), then you have 3 options: 1) Make sure that reiser4 stores bodies of your files in fragments (i.e. "inline" data chunks). Fragments are always stored in formatted nodes, which are protected by checksums. It is possible with mkfs option "formatting=tails" for files managed by unix_file plugin (if you don't use compression) or "compressMode=latt" for files managed by cryptcompress plugin (if you use compression). NOTE. This option will lead to performance degradation (especially for delete operations). 2) Protect your data by yourself. If a file system guarantees consistency of metadata, then data protection can be successfuly implemented in the user-space. Indeed, since file body is uniquely determined by extent pointers, which are guaranteed to be consistent, then checking consistency of the file's body in user space is always a correct operation. So, feel free to check your data in the user-space: we have provided basis for this. 3) Implement checksums for unformatted nodes in reiser4. This option requires a new format for extent pointers (which will include a 32-bit field for checksum), and, respectively, a new item plugin (extent-pointer-with-checksum, or so). 7. How to enable checksum support in reiser4 Specify mkfs.reiser4 option "-o node=node41" when formatting your partition and mount as usual. We recommend to use mount option "onerror=remount-ro", so reiser4 won't panic on failed checksum verification. 8. Compatibility with other features Checksums are compatible with all reiser4 features. Adding a checksum support is a great example of how reiser4 resists the problem of creeping featurism. We just added a new node plugin, which manages nodes of a new format (node41) with a 32-bit field for the checksum. The new plugin mostly reuses methods of the old one (node40) as you can see from the following patches: http://marc.info/?l=reiserfs-devel&m=142359111509525&w=2 http://marc.info/?l=reiserfs-devel&m=142359112409527&w=2 9. TODO A. Failover via mirroring (see section 4 for implementation hints). B. Maintain checksums for the superblock and bitmap blocks. Comment. We already have such support for bitmap blocks, however, it uses adler32 and checksums update/verification is not invoked for some historical reasons. I suggest to replace adler32 with crc32c and trigger the update/verification. Comment. For superblock protection we need to add a 32-bit field to the disk superblock and update/verify it like in the case of formatted nodes. -- To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html