Re: [PATCH v2 5/5] btrfs: fiemap: return extent physical size

Sweet Tea Dorminy <sweettea-kernel@xxxxxxxxxx> · Wed, 3 Apr 2024 03:18:19 -0400

On 4/3/24 01:52, Qu Wenruo wrote:

在 2024/4/3 16:02, Sweet Tea Dorminy 写道:
[...]

fileoff 0, logical len 4k, phy 13631488, phy len 64K
fileoff 4k, logical len 4k, phy 13697024, phy len 4k
fileoff 8k, logical len 56k, phy 13631488 + 8k, phylen 56k

[HOW TO CALCULATE WASTED SPACE IN USER SPACE]
My concern is on the first entry. It indicates that we have wasted 
60K (phy len is 64K, while logical len is only 4K)

But that information is not correct, as in reality we only wasted 4K, 
the remaining 56K is still referred by file range [8K, 64K).

Do you mean that user space program should maintain a mapping of each 
utilized physical range, and when handling the reported file range 
[8K, 64K), the user space program should find that the physical range 
covers with one existing extent, and do calculation correctly?

My goal is to give an unprivileged interface for tools like compsize 
to figure out how much space is used by a particular set of files. 
They report the total disk space referenced by the provided list of 
files, currently by doing a tree search (CAP_SYS_ADMIN) for all the 
extents pertaining to the requested files and deduplicating extents 
based on disk_bytenr.

It seems simplest to me for userspace for the kernel to emit the 
entire extent for each part of it referenced in a file, and let 
userspace deal with deduplicating extents. This is also most similar 
to the existing tree-search based interface. Reporting whole extents 
gives more flexibility for userspace to figure out how to report 
bookend extents, or shared extents, or ...

That's totally fine, no matter what solution you go, (reporting exactly 
as the on-disk file extent, or with offset into consideration), user 
space always need to maintain some type of mapping to calculate the 
wasted space by bookend extents.

It does seem a little weird where if you request with fiemap only e.g. 
4k-16k range in that example file you'll get reported all 68k 
involved, but I can't figure out a way to fix that without having the 
kernel keep track of used parts of the extents as part of reporting, 
which sounds expensive.

I do not think mapping 4k-16K is a common scenario either, but since you 
mentioned, at least we need a consistent way to emit a filemap entry.

The tracking part can be done in the user space.

You're right that I'm being inconsistent, taking off extent_offset 
from the reported disk size when that isn't what I should be doing, so 
I fixed that in v3.

[COMPRESSION REPRESENTATION]
The biggest problem other than the complexity in user space is the 
handling of compressed extents.

Should we return the physical bytenr (disk_bytenr of file extent 
item) directly or with the extent offset added?
Either way it doesn't look consistent to me, compared to 
non-compressed extents.

As I understand it, the goal of reporting physical bytenr is to 
provide a number which we could theoretically then resolve into a disk 
location or few if we cared, but which doesn't necessarily have any 
physical meaning. To quote the fiemap documentation page: "It is 
always undefined to try to update the data in-place by writing to the 
indicated location without the assistance of the filesystem". So I 
think I'd prefer to always report the entire size of the entire extent 
being referenced.

The concern is, if we have a compressed file extent, reflinked to 
different part of the file.

Then the fiemap returns all different physical bytenr (since offset is 
added), user space tool have no idea they are the same extent on-disk.
Furthermore, if we emit the physical + offset directly to user space 
(which can be beyond the compressed extent), then we also have another 
uncompressed extent at previous physical + offset.

Would that lead to bad calculation in user space to determine how many 
bytes are really used?

[ALTERNATIVE FORMAT]
The other alternative would be following the btrfs ondisk format, 
providing a unique physical bytenr for any file extent, then the 
offset/referred length inside the uncompressed extent.

That would handle compressed and regular extents more consistent, and 
a little easier for user space tool to handle (really just a tiny bit 
easier, no range overlap check needed), but more complex to 
represent, and I'm not sure if any other filesystem would be happy to 
accept the extra members they don't care.

I really want to make sure that this interface reports the unused 
space in e.g bookend extents well -- compsize has been an important 
tool for me in this respect, e.g. a time when a 10g file was taking up 
110g of actual disk space. If we report the entire length of the 
entire extent, then when used on whole files one can establish the 
space referenced by that file but not used; similarly on multiple 
files. So while I like the simplicity of just reporting the used 
length, I don't think there's a way to make compsize unprivileged with 
that approach.

Why not? In user space we just need to maintain a mapping of each 
referred range.

Then we get the real actual disk space, meanwhile the fiemap report is 
no different than "btrfs ins dump-tree" for file extents (we have all 
the things we need, filepos, length (num_bytes), disk_bytenr, 
disk_num_bytes, offset, and ram_bytes

The fiemap output (in this changeset) has equivalents of filepos, 
length; disk_bytenr + offset, disk_num_bytes - offset -- we don't get 
ram_bytes and we get two computed values from the three dump-tree outputs.
If it were reporting whole extents, it'd be disk_bytenr, disk_num_bytes, 
and we'd be missing offset.
I think we'd need a third piece of information about physical space in 
order to convey all three equivalents of disk_bytenr, disk_num_bytes, 
offset. And without that third piece of information, we can't both match 
up disk extents and also know exactly what disk bytenr data is stored 
at, I think? But maybe you're proposing exactly that, having a third number?