proposed draft for ext4 reflink

mingming cao <mingming.cao@xxxxxxxxxx> · Fri, 09 May 2014 16:40:48 -0700

Hello,

I have been thinking about adding reflink support to ext4 filesystem.
File reflink supports multiple files share the same data blocks. This is 
very useful to take snapshots and doing backups.  When reflink command 
is called, a new file/inode is created, but the new file points to the 
same data blocks from the original file.  When there is need to change 
the new file data, copy on write is triggered. Currently there are other 
filesystem like btrfs and OCFS has reflink support. And it seems 
interesting to add this feature to ext4 as well.

Here is the first draft of ext reflink design text. I am sending out in 
hope I could get more feedbacks and suggestions. Thanks!

Mingming

ext4 reflink overview
=====================

In current ext4 filesystem, one data block could only be used to one 
inode at a time. After reflink, we break this rule, same data block 
could be shared by multiple files.  The key issue here is how to avoid 
freeing up blocks still used by other inode (reflinked files).  We need 
to keep track of the usage of data blocks using counters.

Using the refcount to track block usage is pretty straightforward.  When 
reflink is called,  shared data block refcount are increased. Upon read, 
there is no action needed, but if one inode start to write to the shared 
data block, copy on write will happen first. The refcount of the 
original data block will be decreased correspondingly and a new data 
block is allocated to for the modified data. refcount will be decreased 
whenever one of the inode free its data, and and only when the refcount 
drop to zero could the filesystem safely claim this data block back.

The key question is where to store the refcount for each shared data 
blocks. Current ext4 block bitmaps only used for if the block is free or 
not. We need to store the refcount somewhere else.

There are multiple options. I started to think about option 1) at the 
beginning  then option 2) came out when data checksumming feature is 
planned.  Option 2) sounds more straignt forward and I will list both so 
we could have some discussions about which would be better solution.

Option 1) Dynamic per-reflinked files refcount

===================================

Based on reflink file groups, we could have a dynamically allocated 
shared  refcount tree. This tree is hanging out from the inodes that 
sharing the same set of blocks. The tree is indexed by physical block 
numbers. The files sharing blocks will to use this tree as reference to 
look up the reference counter to determine if the block is safe to free, 
or need a copy-on-write.

COW (Copy On Write) and refcount tree
-------------------------------------

We would need an extent like structure to store the {physical block 
number, len, refcount} refcount record  in refcount tree.

When reflink is called, a new inode is created, and the extent tree is 
copied from the original inode to the new inode.  If the original inode 
already have a refcount tree, then the refcount for the extent will be 
increased.  If not, then refcount tree is created  and  the two inodes 
all point to the same refcount tree.  Every extent will have a refcount 
record and will be inserted into the refcount tree.  At that very 
beginning time, the refcount record is 1:1 map to the extent structure. 
This will change as inode starts to write. When inode wants to overwrite 
to a shared block,  copy on write happens --  new block be allocated 
before the write and the original extent data are remain untouched.  The 
original refcount record need to be updated accordingly after COW.   If 
the inode only overwrite part of extent, the refcount record need to 
split and decrease refcount for the portion of the change extent.  The 
refcount for the portion that still shared by the inodes remain the same.

In worse case refcount tree becomes very fragmented if inode keep 
rewriting after reflink. Imagining one inode rewrite every 4k after 
being reflinked by other inode. At certain point, we may need to allow 
larger chunk of COW, or even a whole file data copy would be triggered 
if fragmentation getting worse.

The refcount tree could be a btree that easy to insert, search etc 
operate. Since this tree is shared by reflinked files, we would need a 
lock to guard access this tree operations.

Since this is important metadata, we would want to add checksums for 
refcount index and leaf blocks where the refcount records are stored.

Link refcount tree to inodes
-----------------------------------

The root of refcount tree are pointed from inodes that are reflinked. 
At the time of the reflink, the address of root refcount tree would be 
linked from inodes. To store the location of the refcount tree,  one way 
is to use extended attributes. Extended attributes have to be copied to 
the new reflinked file first. The location of the reflink root block is 
stored as two extended attributes (32 bits). We also could store the 
address of refcount tree into inode size extra_isize.

Liu zheng's proposal of project quota also looks for space in ext4 inode 
to store project id. Expanding inode size extra_isize impact all files 
so this is not optimal.

I haven’t thought much about what we need to do in the e2fsprogs side, 
but that would require teach fsck to understand refcount tree.

Option 2 ) Static filesystem-wide per-block refcount

========================================

The most straightforward way is to create refcount record for every data 
block in the filesystem.  Similar to data checksumming feature proposed 
earlier, along with other blockgroup metadata, we could have a per-block 
metadata record, to store refcounts, back reference, data checksumming. 
The per for the blocks in that blockgroup.

This works well if the data checksumming feature plans to go this 
direction (adding per data block metadata), and we could just add two 
bytes for block refcount. Getting the block refcount will only take O(1) 
time if the per-data metadata are allocated statically,  and there is 
basically very little impact to performance.

The downside  is the extra space cost for the blocks not shared.  Unlike 
data checksumming feature, refcount only matters to those blocks being 
shared in the reflinked inodes. And secondly, it would not as efficient 
as per-extent refcount as we would need to track per-block refcount 
instead of larger extent granularity.

Overall, this is just a draft to show the thoughts about implement 
reflink for ext4 filesystem. I am sure there are lots of other things 
that I might missed or havent thought through. I am looking for many 
suggestions, critics and discussion,  and hopefully this could be a good 
start.

Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html