On Jan 15, 2020, at 6:31 AM, Christoph Hellwig <hch@xxxxxx> wrote: > > On Wed, Jan 15, 2020 at 09:10:44PM +0800, Qu Wenruo wrote: >>> That allows userspace to distinguish fe_physical addresses that may be >>> on different devices. This isn't in the kernel yet, since it is mostly >>> useful only for Btrfs and nobody has implemented it there. I can give >>> you details if working on this for Btrfs is of interest to you. >> >> IMHO it's not good enough. >> >> The concern is, one extent can exist on multiple devices (mirrors for >> RAID1/RAID10/RAID1C2/RAID1C3, or stripes for RAID5/6). >> I didn't see how it can be easily implemented even with extra fields. >> >> And even we implement it, it can be too complex or bug prune to fill >> per-device info. > > It's also completely bogus for the use cases to start with. fiemap > is a debug tool reporting the file system layout. Using it for anything > related to actual data storage and data integrity is a receipe for > disaster. As said the right thing for the use case would be something > like the NFS READ_PLUS operation. If we can't get that easily it can > be emulated using lseek SEEK_DATA / SEEK_HOLE assuming no other thread > could be writing to the file, or the raciness doesn't matter. I don't think either of those will be any better than FIEMAP, if the reason is that the underlying filesystem is filling in holes with actual data blocks to optimize the IO pattern. SEEK_HOLE would not find a hole in the block allocation, and would happily return the block of zeroes to the caller. Also, it isn't clear if SEEK_HOLE considers an allocated but unwritten extent to be a hole or a block? I think what is needed here is an fadvise/ioctl that tells the filesystem "don't allocate blocks unless actually written" for that file. Storing anything in a separate data structure is a recipe for disaster, since it will become inconsistent after a crash, or filesystem corruption+e2fsck, and will unnecessarily bloat the on-disk metadata for every file to hold redundant information. I don't see COW/reflink/compression as being a problem in this case, since what cachefiles cares about is whether there is _any_ data for a given logical offset, not where/how the data is stored. IF FIEMAP was used for a btrfs backing filesystem, it would need the "EXTENT_DATA_COMPRESSED" feature to be implemented as well, so that it can distinguish the logical vs. physical allocations. I don't think that would be needed for SEEK_HOLE and SEEK_DATA, so long as they handle unwritten extents properly (and are correctly implemented in the first place, some filesystems fall back to always returning the next block for SEEK_DATA). Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP