On Thu, Feb 01, 2018 at 04:43:37PM -0700, Andreas Dilger wrote: > On Feb 1, 2018, at 4:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Wed, Jan 31, 2018 at 07:03:16PM -0500, Theodore Ts'o wrote: > >> On Wed, Jan 31, 2018 at 12:41:13PM -0800, James Bottomley wrote: > >>>> Like fscrypto, where most of the code is in fs/crypto, most of the > >>>> fs-verity will be in fs/verity. There will be minimal hooks in a > >>>> particular file system, so if another file system wants to play, then > >>>> can do so relatively easily. > >>> > >>> OK, sounds good ... I notice, now I look, that fscrypt uses xattrs > >>> (albeit hidden under the covers of get/set_context), will dm-verity use > >>> the same trick or do people really need space in the inode? > >> > >> I assume you mean fs-verity above, and no, we aren't going to use > >> xattrs because the Merkle tree won't fit in the xattr. So the plan > >> was to put the fs-verity header, the PKCS7 signature, and the Merkle > >> tree after i_size (rounded to a blocksize boundary). Remember, the > >> fs-verity case we only worry about the read-ony case. > > > > I think putting valid data beyond EOF is going to be problematic for > > many filesystems. Getting things like truncate right are hard enough > > without having to special case a bunch of new functionality that > > specifically allows IO access beyond EOF. Indeed, how does "truncate > > isize but leave special data behind" work and what's the userspace > > API to drive it? And how does it interact with all the page cache > > code that checks for page->index beyond EOF to detect a truncated > > page that should not be accessed? > > > > There's also further complications for filesystems like XFS e.g. how > > do we tell the difference between valid data beyond EOF and > > speculative allocation (done by delalloc) beyond EOF that contains > > no data and can be removed if it is not written to in a short while? > > > > This just seems like a horrible can of worms to me and is not > > something we should be building generic infrastructure around. > > > > Just how big do these merkle trees get, anyway? > > The Merkle tree will have one checksum per "leaf block" of the filesystem > (though I'd recommend to use a fixed-size checksum leaf block like 4KB so > that userspace doesn't need to care about the actual filesystem blocksize > on disk). .... > Since the tree size depends on file size, it would hit the 64KB xattr size > limit at 64MB (CRC32) or 8MB (SHA256), unless we also allow larger xattrs > to userspace. So how many cases are there where we need to support >64MB files for integrity measurement? I just checked my laptop, and there are only a handful or binaries/library/data files shipped by the distro that are over 64MB. Perhaps there's a tradeoff that can be made here - store the full merkle tree if it fits in an xattr (which covers the vast majority of system binaries and data files), otherwise calculate it on the fly and cache it in memory? > There was an ext4 feature landed in 4.13 to allow larger > on-disk xattrs than the previous 4KB (single block) limit (essentially any > size xattr could be stored), so that wouldn't be a problem if the userspace > xattr API limit was removed. We'd need to rev the on-disk format for XFS as the internal xattr value length is held in a 16 bit field. Might be trivial to do because we'd only need to modify the "remote attribute" header as we already store large attributes out of line. Of course, that still leaves all the userspace API to deal with. :( Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx