Hello, The following patches are the latest attempt at implementing a fiemap ioctl, which can be used by userspace software to get extent information for an inode in an efficient manner. These patches are against Linus' latest tree. While the core vfs patch seems to be approaching feature-completeness, most of the series should still be considered as being incomplete. The fs patches in particular need some more attention. I think there's enough here however, that it makes sense to start posting to fsdevel for general comments. Testing so far has been light, typically consisting of me running a bare-bones ioctl wrapper program by hand: http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/ We definitely need some more rigorous testing software, which I believe Eric is working on. Additionally, a port of the 'filefrag' application still needs to be completed. A lot has changed since the last fiemap patch was posted. Mostly, the vfs<->fs api is more fleshed out, with suitable abstractions and helper functions to aid implementation of ->fiemap. Some checks were added in the vfs patch to catch things like overflow, fs limits checks, etc. Automatic trimming of the request happens now so the fs doesn't have to worry about ranges being larger than it can handle. Some changes were also made to the user API with the goal of simplifying things so that it was easier for client file systems to implement a callback. My hope is that a simpler API means file systems will provide ->fiemap() quicker, and will be less likely to return results that are wrong, or worse, slightly different from other implementations. - Except for 'fm_flags', the various in/out fields on struct fiemap got turned into a single 'out' field - the number of mapped extents (fm_mapped_extents). This gives the kernel side dealing with struct fiemap fewer 'moving parts' to deal with. - Extent flags were cleaned up, and some new ones got added. - Instead of forcing the user to add up all extent lengths before a given one to figure it's logical offset, an 'fe_logical" field was added to fiemap_extent. This is a lot more obvious and straight forward in my opinion, and is well worth the tradeoff of a few bytes. It also obviates the need to describe holes as their existence is easily implied now. Also, fm_start and fm_length no longer have to be 'out' variables, which goes back to the 1st listed change. - Handling of incompatible flags was simplified to just return -EBADR and the set of not-understood flags in fm_flags. - Documentation/filesystems/fiemap.txt has been added in the 1st patch. Below this I will include the contents of fiemap.txt to make it more convenient for folks to get details on the API. --Mark Fiemap Ioctl ============ The fiemap ioctl is an efficient method for userspace to get file extent mappings. Instead of block-by-block mapping (such as bmap), fiemap returns a list of extents. Request Basics -------------- A fiemap request is encoded within struct fiemap: struct fiemap { __u64 fm_start; /* logical offset (inclusive) at * which to start mapping (in) */ __u64 fm_length; /* logical length of mapping which * userspace cares about (in) */ __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in) */ __u32 fm_extent_count; /* size of fm_extents array (in) */ __u32 fm_mapped_extents; /* number of extents that were * mapped (out) */ __u32 fm_reserved; struct fiemap_extent fm_extents[0]; }; fm_start, and fm_length specify the logical range within the file which the process would like mappings for. Extents returned mirror those on disk - that is, the logical offset of the 1st returned extent may start before fm_start, and the range covered by the last returned extent may end after fm_length. All offsets and lengths are in bytes. Certain flags to modify the way in which mappings are looked up can be set in fm_flags. If the kernel doesn't understand some particular flags, it will return EBADR and the contents of fm_flags will contain the set of flags which caused the error. If the kernel is compatible with all flags passed, the contents of fm_flags will be unmodified. It is up to userspace to determine whether rejection of a particular flag is fatal to it's operation. This scheme is intended to allow the fiemap interface to grow in the future but without losing compatibility with old software. Currently, there are four flags which can be set in fm_flags: * FIEMAP_FLAG_NUM_EXTENTS If this flag is set, extent information will not be returned via the fm_extents array and the value of fm_extent_count will be ignored. Instead, the total number of extents covering the range will be returned via fm_mapped_extents. This is useful for programs which only want to count the number of extents in a file, but don't care about the actual extent layout. * FIEMAP_FLAG_SYNC If this flag is set, the kernel will sync the file before mapping extents. * FIEMAP_FLAG_HSM_READ If the extent is offline, retrieve it before mapping and do not flag it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file system does not support HSM. * FIEMAP_FLAG_XATTR If this flag is set, the extents returned will describe the inodes extended attribute lookup tree, instead of it's data tree. * FIEMAP_FLAG_LUN_ORDER If the file system stripes file data, this will return contiguous regions of physical allocation, sorted by LUN. Logical offsets may not make sense if this flag is passed. If the file system does not support multiple LUNs, this flag will be ignored. Extent Mapping -------------- Note that all of this is ignored if FIEMAP_FLAG_NUM_EXTENTS is set. Extent information is returned within the embedded fm_extents array which userspace must allocate along with the fiemap structure. The total number of fiemap_extents available should be passed via fm_extent_count. The of extents mapped by kernel will be returned via fm_mapped_extents. If the number of fiemap_extents allocated is less than would be required to map the requested range, the maximum number of extents that can be mapped in available memory will be returned and fm_mapped_extents will be equal to fm_extent_count. In that case, the last extent in the array will not complete the requested range and will not have the FIEMAP_EXTENT_LAST flag set (see the next section on extent flags). Each extent is described by a single fiemap_extent structure as returned in fm_extents. struct fiemap_extent { __u64 fe_logical;/* logical offset in bytes for the start of * the extent */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent */ __u64 fe_length; /* length in bytes for the extent */ __u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */ __u32 fe_lun; /* logical device number for extent (starting at 0)*/ }; All offsets and lengths are in bytes and mirror those on disk - it is valid for an extents logical offset to start before the request or it's logical length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is returned, fe_logical, fe_physical and fe_length will be aligned to the block size of the file system. The fe_flags field contains flags which describe the extent returned. A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in the file so that the process making fiemap calls can determine when no more extents are available. Some flags are intentionally vague and will always be set in the presence of other more specific flags. This way a program looking for a general property does not have to know all existing and future flags which imply that property. For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking for inline or tail-packed data can key on the specific flag. Software which simply cares not to try operating on non-aligned extents however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to worry about all present and future flags which might imply unaligned data. Note that the opposite is not true - it would be valid for FIEMAP_EXTENT_NOT_ALIGNED to appear alone. * FIEMAP_EXTENT_LAST This is the last extent in the file. A mapping attempt past this extent will return nothing. * FIEMAP_EXTENT_UNKNOWN The location of this extent is currently unknown. This may indicate the data is stored on an inaccessible volume or that no storage has been allocated for the file yet. * FIEMAP_EXTENT_SECONDARY - This will also set FIEMAP_EXTENT_UNKNOWN. The data for this extent is in secondary storage. * FIEMAP_EXTENT_DELALLOC - This will also set FIEMAP_EXTENT_UNKNOWN. Delayed allocation - while there is data for this extent, it's physical location has not been allocated yet. * FIEMAP_EXTENT_NO_DIRECT Direct access to the data in this extent is illegal or will have undefined results. * FIEMAP_EXTENT_NET - This will also set FIEMAP_EXTENT_NO_DIRECT The data for this extent is not stored in a locally-accessible device. * FIEMAP_EXTENT_DATA_COMPRESSED - This will also set FIEMAP_EXTENT_NO_DIRECT The data in this extent has been compressed by the file system. * FIEMAP_EXTENT_DATA_ENCRYPTED - This will also set FIEMAP_EXTENT_NO_DIRECT The data in this extent has been encrypted by the file system. * FIEMAP_EXTENT_NOT_ALIGNED Extent offsets and length are not guaranteed to be block aligned. * FIEMAP_EXTENT_DATA_INLINE This will also set FIEMAP_EXTENT_NOT_ALIGNED Data is located within a meta data block. * FIEMAP_EXTENT_DATA_TAIL This will also set FIEMAP_EXTENT_NOT_ALIGNED Data is packed into a block with data from other files. * FIEMAP_EXTENT_UNWRITTEN Unwritten extent - the extent is allocated but it's data has not been initialized. VFS -> File System Implementation --------------------------------- File systems wishing to support fiemap must implement a ->fiemap callback (on struct inode_operations): struct inode_operations { ... int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start, u64 len); ->fiemap is passed struct fiemap_extent_info which describes the fiemap request: struct fiemap_extent_info { unsigned int fi_flags; /* Flags as passed from user */ unsigned int fi_extents_mapped; /* Number of mapped extents */ unsigned int fi_extents_max; /* Size of fiemap_extent array */ char *fi_extents_start; /* Start of fiemap_extent array */ }; It is intended that the file system should only need to access fi_flags directly. Aside from checking fi_flags to modify callback behavior, flags which the file system can not handle, can be written into fieinfo->fi_flags. In this case, the file system *must* return -EBADR so that ioctl_fiemap() can write them into the userspace buffer. For each extent in the request range, the file system should call the helper function, fiemap_fill_next_extent(): int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, u64 phys, u64 len, u32 flags, u32 lun); fiemap_fill_next_extent() will use the passed values to populate the next free extent in the fm_extents array. 'General' extent flags will automatically be set from specific flags on behalf of the calling file system so that the userspace API is not broken. fiemap_fill_next_extent() returns 0 on success, and 1 when the user-supplied fm_extents array is full. If an error is encountered while copying the extent to user memory, -EFAULT will be returned. If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling this helper is not necessary and fi_extents_mapped can be set directly. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html