On Fri, Sep 02, 2011 at 10:15:27AM -0400, Greg Freemyer wrote: > On Wed, Aug 31, 2011 at 8:30 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > Hi all, > > > > This patchset adds crc32c checksums to most of the ext4 metadata objects. A > > full design document is on the ext4 wiki[1] but I will summarize that document here. > > > > As much as we wish our storage hardware was totally reliable, it is still > > quite possible for data to be corrupted on disk, corrupted during transfer over > > a wire, or written to the wrong places. To protect against this sort of > > non-hostile corruption, it is desirable to store checksums of metadata objects > > on the filesystem to prevent broken metadata from shredding the filesystem. > > > > The crc32c polynomial was chosen for its improved error detection capabilities > > over crc32 and crc16, and because of its hardware acceleration on current and > > upcoming Intel and Sparc chips. > > > > Each type of metadata object has been retrofitted to store a checksum as follows: > > > > - The superblock stores a crc32c of itself. > > - Each inode stores crc32c(fs_uuid + inode_num + inode + slack_space_after_inode) > > - Block and inode bitmaps each get their own crc32c(fs_uuid + group_num + > > bitmap), stored in the block group descriptor. > > - Each extent tree block stores a crc32c(fs_uuid + inode_num + extent_entries) > > in unused space at the end of the block. > > - Each directory leaf block has an unused-looking directory entry big enough to > > store a crc32c(fs_uuid + inode_num + block) at the end of the block. > > - Each directory htree block is shortened to contain a crc32c(fs_uuid + > > inode_num + block) at the end of the block. > > - Extended attribute blocks store crc32c(fs_uuid + block_no + ea_block) in the > > header. > > - Journal commit blocks can be converted to use crc32c to checksum all blocks > > in the transaction, if journal_checksum is given. > > > > The first four patches in the kernel patchset fix existing bugs in ext4 that > > cause incorrect checkums to be written. I think Ted already took them, but > > with recent instability I'm resending them to be cautious. The subsequent 12 > > patches add the necessary code to support checksumming in ext4 and jbd2. > > > > I also have a set of three patches that provide a faster crc32c implementation > > based on Bob Pearson's earlier crc32 patchset. This will be sent under > > separate cover to the crypto list and to lkml/linux-ext4. > > > > The patchset for e2fsprogs will be sent under separate cover only to linux-ext4 > > as it is quite lengthy (~36 patches). > > > > As far as performance impact goes, I see nearly no change with a standard mail > > server ffsb simulation. On a test that involves only file creation and > > deletion and extent tree modifications, I see a drop of about 50 percent with > > the current kernel crc32c implementation; this improves to a drop of about 20 > > percent with the enclosed crc32c implementation. However, given that metadata > > is usually a small fraction of total IO, it doesn't seem like the cost of > > enabling this feature is unreasonable. > > > > There are of course unresolved issues: > > > > - What to do when the block group descriptor isn't big enough to hold 2 crc32s > > (which is the case with 32-bit ext4 filesystems, sadly). I'm not quite > > convinced that truncating a 32-bit checksum to 16-bits is a safe idea. Right > > now, one has to enable the 64bit feature to enable any bitmap checksums. > > I'm not sure how effective crc16 is at checksumming 32768-bit bitmaps. > > > > - Using the journal commit hooks to delay crc32c calculation until dirty > > buffers are actually being written to disk. > > > > - Can we get away with using a (hw accelerated) LE crc32c for jbd2, which > > stores its data in BE order? > > > > - Interaction with online resize code. Yongqiang seems to be in the process of > > rewriting this, so I haven't looked at it very closely yet. > > > > - If block group descriptors can now exceed 32 bytes (when 64bit filesystem > > support is enabled), should we use crc32c instead of crc16? From what I've > > read of the literature, crc16 is not very effective on datasets exceeding 256 > > bytes. > > > > Please have a look at the design document and patches, and please feel free to > > suggest any changes. I will be at LPC next week if anyone wishes to discuss, > > debate, or protest. > > > > --D > > > > [1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums > > Derrick, > > Brainstorming only: > > Another thing you might consider is to somehow tie into the data > integrity patches that went into the kernel a couple years ago. Those > are tied to specialized storage devices (typically scsi) that can > actually have the checksum live on the disk, but not in the normal > data area. ie. in the sector header / footer or some other out of > band area. > > At a minimum, it may make sense to use the same CRC which that API > does. Then you could calculate the CRC once and put it both in-band > in the inode and out-of-band in the dedicated integrity area of > supporting storage devices. If you have the necessary DIF/DIX hardware and kernel support then every block in the FS is already checksummed and you don't need metadata_csum at all. This patchset was really intended for setups where we don't have DIF/DIX. Furthermore, the nice thing about the in-filesystem checksum is that we bake in other things like the FS UUID and the inode number, which gives you a somewhat better assurance that the data block belongs to the fs and the file that the code think it belongs to. The DIX interface allows for a 32-bit block number and a 16-bit application tag ... which is unfortunately small given 64-bit block numbers and 32-bit inode numbers. I guess there's also an argument that from a layering perspective it's desirable to have a FS image whose integrity features remain intact even if you copy the image to a different device that doesn't support the hardware feature. As a side note, the crc-t10dif implementation is quite slow -- the hardware accelerated crc32c is 15x faster, and the sw implementation is usually 3-6x faster. I suspect somebody will want to fix that before DIF becomes more widespread... > That if the data is corrupted on the wire as an example, the > controller itself may be able to verify its a bad crc and ask for a > re-read without even involving the kernel. > > I believe supporting hardware is rare, but if the kernel is going to > have a data integrity API to support it at all, then code like this is > exactly the kind of code that should layer on top of it. The good news is that if you're really worried about integrity, metadata_csum and DIF/DIX aren't mutually exclusive features. Rejecting corrupted write commands at write time seems like a useful feature. :) --D -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html