On Wed, Jan 15, 2020 at 11:09:03PM +0000, David Howells wrote: > Andreas Dilger <adilger@xxxxxxxxx> wrote: > > > > It would also have to say that blocks of zeros shouldn't be optimised away. > > > > I don't necessarily see that as a requirement, so long as the filesystem > > stores a "block" at that offset, but it could dedupe all zero-filled blocks > > to the same "zero block". That still allows saving storage space, while > > keeping the semantics of "this block was written into the file" rather than > > "there is a hole at this offset". > > Yeah, that's more what I was thinking of. Provided I can find out that > something is present, it should be fine. I'm curious how this proposal handles an application punching a hole through the cache? Does that get cached, or does that operation have to be synchronous with the server? Or is it a moot point because no server supports hole punching, so it gets replaced with equivalent zero block data writes? Zero blocks are stupidly common on typical user data corpuses, and a naive block-oriented deduper can create monster extents with millions or even billions of references if it doesn't have some special handling for zero blocks. Even if they don't trigger filesystem performance bugs or hit RAM or other implementation limits, it's still bigger and slower to use zero-filled data blocks than just using holes for zero blocks. In the bees deduper for btrfs, zero blocks get replaced with holes unconditionally in uncompressed extents, and in compressed extents if the extent consists entirely of zeros (a long run of zero bytes is compressed to a few bits by all supported compression algorithms, and hole metdata is much larger than a few bits, so no gain is possible if anything less than the entire compressed extent is eliminated). That behavior could be adjusted to support this use case, as a non-default user option. For defrag a similar optimization is possible: read a long run of consecutive zero data blocks, write a prealloc extent. I don't know of anyone doing that in real life, but it would play havoc with anything trying to store information in FIEMAP data (or related ioctls like GETFSMAP or TREE_SEARCH). I think an explicit dirty-cache-data metadata structure is a good idea despite implementation complexity. It would eliminate dependencies on non-portable filesystem behavior, and not abuse a facility that might already be in active (ab)use by other existing things. If you have a writeback cache, you need to properly control write ordering with a purpose-built metadata structure, or fsync() will be meaningless through your caching layer, and after a crash you'll upload whatever confused, delalloc-reordered, torn-written steaming crap is on the local disk to the backing store. > David >
Attachment:
signature.asc
Description: PGP signature