proposed draft for ext4 reflink

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I have been thinking about adding reflink support to ext4 filesystem.
File reflink supports multiple files share the same data blocks. This is very useful to take snapshots and doing backups. When reflink command is called, a new file/inode is created, but the new file points to the same data blocks from the original file. When there is need to change the new file data, copy on write is triggered. Currently there are other filesystem like btrfs and OCFS has reflink support. And it seems interesting to add this feature to ext4 as well.


Here is the first draft of ext reflink design text. I am sending out in hope I could get more feedbacks and suggestions. Thanks!

Mingming


ext4 reflink overview
=====================

In current ext4 filesystem, one data block could only be used to one inode at a time. After reflink, we break this rule, same data block could be shared by multiple files. The key issue here is how to avoid freeing up blocks still used by other inode (reflinked files). We need to keep track of the usage of data blocks using counters.

Using the refcount to track block usage is pretty straightforward. When reflink is called, shared data block refcount are increased. Upon read, there is no action needed, but if one inode start to write to the shared data block, copy on write will happen first. The refcount of the original data block will be decreased correspondingly and a new data block is allocated to for the modified data. refcount will be decreased whenever one of the inode free its data, and and only when the refcount drop to zero could the filesystem safely claim this data block back.

The key question is where to store the refcount for each shared data blocks. Current ext4 block bitmaps only used for if the block is free or not. We need to store the refcount somewhere else.

There are multiple options. I started to think about option 1) at the beginning then option 2) came out when data checksumming feature is planned. Option 2) sounds more straignt forward and I will list both so we could have some discussions about which would be better solution.

Option 1) Dynamic per-reflinked files refcount

===================================

Based on reflink file groups, we could have a dynamically allocated shared refcount tree. This tree is hanging out from the inodes that sharing the same set of blocks. The tree is indexed by physical block numbers. The files sharing blocks will to use this tree as reference to look up the reference counter to determine if the block is safe to free, or need a copy-on-write.

COW (Copy On Write) and refcount tree
-------------------------------------

We would need an extent like structure to store the {physical block number, len, refcount} refcount record in refcount tree.

When reflink is called, a new inode is created, and the extent tree is copied from the original inode to the new inode. If the original inode already have a refcount tree, then the refcount for the extent will be increased. If not, then refcount tree is created and the two inodes all point to the same refcount tree. Every extent will have a refcount record and will be inserted into the refcount tree. At that very beginning time, the refcount record is 1:1 map to the extent structure. This will change as inode starts to write. When inode wants to overwrite to a shared block, copy on write happens -- new block be allocated before the write and the original extent data are remain untouched. The original refcount record need to be updated accordingly after COW. If the inode only overwrite part of extent, the refcount record need to split and decrease refcount for the portion of the change extent. The refcount for the portion that still shared by the inodes remain the same.

In worse case refcount tree becomes very fragmented if inode keep rewriting after reflink. Imagining one inode rewrite every 4k after being reflinked by other inode. At certain point, we may need to allow larger chunk of COW, or even a whole file data copy would be triggered if fragmentation getting worse.

The refcount tree could be a btree that easy to insert, search etc operate. Since this tree is shared by reflinked files, we would need a lock to guard access this tree operations.

Since this is important metadata, we would want to add checksums for refcount index and leaf blocks where the refcount records are stored.


Link refcount tree to inodes
-----------------------------------

The root of refcount tree are pointed from inodes that are reflinked. At the time of the reflink, the address of root refcount tree would be linked from inodes. To store the location of the refcount tree, one way is to use extended attributes. Extended attributes have to be copied to the new reflinked file first. The location of the reflink root block is stored as two extended attributes (32 bits). We also could store the address of refcount tree into inode size extra_isize.

Liu zheng's proposal of project quota also looks for space in ext4 inode to store project id. Expanding inode size extra_isize impact all files so this is not optimal.

I haven’t thought much about what we need to do in the e2fsprogs side, but that would require teach fsck to understand refcount tree.





Option 2 ) Static filesystem-wide per-block refcount

========================================

The most straightforward way is to create refcount record for every data block in the filesystem. Similar to data checksumming feature proposed earlier, along with other blockgroup metadata, we could have a per-block metadata record, to store refcounts, back reference, data checksumming. The per for the blocks in that blockgroup.

This works well if the data checksumming feature plans to go this direction (adding per data block metadata), and we could just add two bytes for block refcount. Getting the block refcount will only take O(1) time if the per-data metadata are allocated statically, and there is basically very little impact to performance.

The downside is the extra space cost for the blocks not shared. Unlike data checksumming feature, refcount only matters to those blocks being shared in the reflinked inodes. And secondly, it would not as efficient as per-extent refcount as we would need to track per-block refcount instead of larger extent granularity.



Overall, this is just a draft to show the thoughts about implement reflink for ext4 filesystem. I am sure there are lots of other things that I might missed or havent thought through. I am looking for many suggestions, critics and discussion, and hopefully this could be a good start.

Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux