Re: Initial patches for Incremental FS

Miklos Szeredi <miklos@xxxxxxxxxx> · Fri, 3 May 2019 06:22:23 -0400

On Fri, May 3, 2019 at 12:23 AM Eugene Zemtsov <ezemtsov@xxxxxxxxxx> wrote:
>
> On Thu, May 2, 2019 at 6:26 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > Why not CODA, though, with local fs as cache?
>
> On Thu, May 2, 2019 at 4:20 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > This sounds very useful.
> >
> > Why does it have to be a new special-purpose Linux virtual file?
> > Why not FUSE, which is meant for this purpose?
> > Those are things that you should explain when you are proposing a new
> > filesystem,
> > but I will answer for you - because FUSE page fault will incur high
> > latency also after
> > blocks are locally available in your backend store. Right?
> >
> > How about fscache support for FUSE then?
> > You can even write your own fscache backend if the existing ones don't
> > fit your needs for some reason.
> >
> > Piling logic into the kernel is not the answer.
> > Adding the missing interfaces to the kernel is the answer.
> >
>
> Thanks for the interest and feedback. What I dreaded most was silence.
>
> Probably I should have given a bit more details in the introductory email.
> Important features we’re aiming for:
>
> 1. An attempt to read a missing data block gives a userspace data loader a
> chance to fetch it. Once a block is loaded (in advance or after a page fault)
> it is saved into a local backing storage and following reads of the same block
> are done directly by the kernel. [Implemented]
>
> 2. Block level compression. It saves space on a device, while still allowing
> very granular loading and mapping. Less granular compression would trigger
> loading of more data than absolutely necessary, and that’s the thing we
> want to avoid. [Implemented]
>
> 3. Block level integrity verification. The signature scheme is similar to
> DMverity or fs-verity. In other words, each file has a Merkle tree with
> crypto-digests of 4KB blocks. The root digest is signed with RSASSA or ECDSA.
> Each time a data block is read digest is calculated and checked with the
> Merkle tree, if the signature check fails the read operation fails as well.
> Ideally I’d like to use fs-verity API for that. [Not implemented yet.]
>
> 4. New files can be pushed into incremental-fs “externally” when an app needs
> a new resource or a binary. This is needed for situations when a new resource
> or a new version of code is available, e.g. a user just changed the system
> language to Spanish, or a developer rolled out an app update.
> Things change over time and this means that we can’t just incrementally
> load a precooked ext4 image and mount it via a loopback device.   [Implemented]
>
> 5. No need to support writes or file resizing. It eliminates a lot of
> complexity.
>
> Currently not all of these features are implemented yet, but they all will be
> needed to achieve our goals:
>  - Apps can be delivered incrementally without having to wait for extra data.
>    At the same time given enough time the app can be downloaded fully without
>    having to keep a connection open after that.
> - App’s integrity should be verifiable without having to read all its blocks.
> - Local storage and battery need to be conserved.
> - Apps binaries and resources can change over time.
>    Such changes are triggered by external events.
>

Good summary.  I understand the requirements better now.

I still have issues with this design, because it looks very android
specific.  For example I know that lazy download  is something
actually being heavily used by distributed computing (see cernvm-fs)
so it's not a specific requirement of android.   By bundling these
features together into a kernel module you are basically limiting the
user base and hence possibly missing out on some of the advantages of
having a more varied user base.

I wonder how much of the performance issues with the fuse prototype
was because of 4k reads/disabling re adahead?   I know you require
that for the data loading part, but it would be trivial to turn that
behavior off once everything is in place.   Does the prototype do
that?  Have you tried doing that?  Is the prototype in a good enough
shape to perhaps move it to a public repository for review?

I'm also wondering about some of the features you describle above.
Why a new block fs?  A normal fs (ext4) provides most of those things:
you can add files to it, etc...  The one thing it doesn't provide is
compression, and that's because it's hard for the non-incremental
case.   So do we really need a new disk format for this?  Or can the
missing compression feature (perhaps with limits) be implemented in
ext4/f2fs?  In that case we even can take that work off of fuse and
just leave the loading to the fuse part. Cernvm-fs does that with a
fuse fs on the lower layer that does  lazy downloading, and putting
already downloaded files in an upper layer of overlayfs for faster
access, but it's possible that there's a better way of doing that not
involving even overlayfs.

Thanks,
Miklos