On Feb 1, 2018, at 4:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Jan 31, 2018 at 07:03:16PM -0500, Theodore Ts'o wrote: >> On Wed, Jan 31, 2018 at 12:41:13PM -0800, James Bottomley wrote: >>>> Like fscrypto, where most of the code is in fs/crypto, most of the >>>> fs-verity will be in fs/verity. There will be minimal hooks in a >>>> particular file system, so if another file system wants to play, then >>>> can do so relatively easily. >>> >>> OK, sounds good ... I notice, now I look, that fscrypt uses xattrs >>> (albeit hidden under the covers of get/set_context), will dm-verity use >>> the same trick or do people really need space in the inode? >> >> I assume you mean fs-verity above, and no, we aren't going to use >> xattrs because the Merkle tree won't fit in the xattr. So the plan >> was to put the fs-verity header, the PKCS7 signature, and the Merkle >> tree after i_size (rounded to a blocksize boundary). Remember, the >> fs-verity case we only worry about the read-ony case. > > I think putting valid data beyond EOF is going to be problematic for > many filesystems. Getting things like truncate right are hard enough > without having to special case a bunch of new functionality that > specifically allows IO access beyond EOF. Indeed, how does "truncate > isize but leave special data behind" work and what's the userspace > API to drive it? And how does it interact with all the page cache > code that checks for page->index beyond EOF to detect a truncated > page that should not be accessed? > > There's also further complications for filesystems like XFS e.g. how > do we tell the difference between valid data beyond EOF and > speculative allocation (done by delalloc) beyond EOF that contains > no data and can be removed if it is not written to in a short while? > > This just seems like a horrible can of worms to me and is not > something we should be building generic infrastructure around. > > Just how big do these merkle trees get, anyway? The Merkle tree will have one checksum per "leaf block" of the filesystem (though I'd recommend to use a fixed-size checksum leaf block like 4KB so that userspace doesn't need to care about the actual filesystem blocksize on disk). After that, there is a tree of checksums from the leaf blocks up to the root. If there was a weak checksum like CRC32 (4 bytes/leaf) then the tree size would be somewhat over 0.1% of the file size. If the tree has a strong checksum like SHA256 (32 bytes/leaf) then the overhead is over 0.8%. Strictly speaking, the whole Merkle tree does not need to be stored on disk. If the leaf checksums are stored (to allow random IO access with data verification) and the root node (to allow verification of the rest of the leaf blocks) then the intermediate tree could be recomputed with relatively low overhead (0.1% vs. checksumming the whole file at open). >> As I stated above, we need to put the Merkle tree after i_size anyway, >> so the current plan doesn't use xattrs at all. Xattr storage space is >> also precious (especially if you are trying to keep all of the xattrs > > No it's not. xattr space is specifically designed for uses like > this, and if you have to take an extra IO to read it then that's the > cost of storing large chunks of non-userdata data on a file. You;ve > got to take extra IOs to read the merkle tree if it's stored beyond > EOF anyway, so it doesn't matter if we take extra IOs to read it > from an xattr.... Since the tree size depends on file size, it would hit the 64KB xattr size limit at 64MB (CRC32) or 8MB (SHA256), unless we also allow larger xattrs to userspace. There was an ext4 feature landed in 4.13 to allow larger on-disk xattrs than the previous 4KB (single block) limit (essentially any size xattr could be stored), so that wouldn't be a problem if the userspace xattr API limit was removed. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP