On 2021/7/30 下午2:00, NeilBrown wrote:
On Fri, 30 Jul 2021, Qu Wenruo wrote:
On 2021/7/30 下午1:25, Qu Wenruo wrote:
On 2021/7/30 上午10:36, NeilBrown wrote:
I've been pondering all the excellent feedback, and what I have learnt
from examining the code in btrfs, and I have developed a different
perspective.
Great! Some new developers into the btrfs realm!
Maybe "subvol" is a poor choice of name because it conjures up
connections with the Volumes in LVM, and btrfs subvols are very different
things. Btrfs subvols are really just subtrees that can be treated as a
unit for operations like "clone" or "destroy".
As such, they don't really deserve separate st_dev numbers.
Maybe the different st_dev numbers were introduced as a "cheap" way to
extend to size of the inode-number space. Like many "cheap" things, it
has hidden costs.
Forgot another problem already caused by this st_dev method.
Since btrfs uses st_dev to distinguish them its inode name space, and
st_dev is allocated using anonymous bdev, and the anonymous bdev poor
has limited size (much smaller than btrfs subvolume id name space), it's
already causing problems like we can't allocate enough anonymous bdev
for each subvolume, and failed to create subvolume/snapshot.
What sort of numbers do you see in practice? How many subvolumes and how
many inodes per subvolume?
Normally the "live"(*) subvolume numbers are below the minor dev number
range (1<<20), thus not a big deal.
*: Live here means the subvolume is at least accessed once. Subvolume
exists but never accessed doesn't get its anonymous bdev number allocated.
But (1<<20) is really small compared some real-world users.
Thus we had some reports of such problem, and changed the timing to
allocate such bdev number.
If we allocated some number of bits to each, with over-allocation to
allow for growth, could we fit both into 64 bits?
I don't think it's even possible, as currently we use u32 for dev_t,
which is already way below the theoretical limit (U64_MAX - 512).
Thus AFAIK there is no real way to solve it right now.
Thanks,
Qu
NeilBrown
Thus it's really a time to re-consider how we should export this info to
user space.
Thanks,
Qu