Hello,
I have been thinking about adding reflink support to ext4 filesystem.
File reflink supports multiple files share the same data blocks. This is
very useful to take snapshots and doing backups. When reflink command
is called, a new file/inode is created, but the new file points to the
same data blocks from the original file. When there is need to change
the new file data, copy on write is triggered. Currently there are other
filesystem like btrfs and OCFS has reflink support. And it seems
interesting to add this feature to ext4 as well.
Here is the first draft of ext reflink design text. I am sending out in
hope I could get more feedbacks and suggestions. Thanks!
Mingming
ext4 reflink overview
=====================
In current ext4 filesystem, one data block could only be used to one
inode at a time. After reflink, we break this rule, same data block
could be shared by multiple files. The key issue here is how to avoid
freeing up blocks still used by other inode (reflinked files). We need
to keep track of the usage of data blocks using counters.
Using the refcount to track block usage is pretty straightforward. When
reflink is called, shared data block refcount are increased. Upon read,
there is no action needed, but if one inode start to write to the shared
data block, copy on write will happen first. The refcount of the
original data block will be decreased correspondingly and a new data
block is allocated to for the modified data. refcount will be decreased
whenever one of the inode free its data, and and only when the refcount
drop to zero could the filesystem safely claim this data block back.
The key question is where to store the refcount for each shared data
blocks. Current ext4 block bitmaps only used for if the block is free or
not. We need to store the refcount somewhere else.
There are multiple options. I started to think about option 1) at the
beginning then option 2) came out when data checksumming feature is
planned. Option 2) sounds more straignt forward and I will list both so
we could have some discussions about which would be better solution.
Option 1) Dynamic per-reflinked files refcount
===================================
Based on reflink file groups, we could have a dynamically allocated
shared refcount tree. This tree is hanging out from the inodes that
sharing the same set of blocks. The tree is indexed by physical block
numbers. The files sharing blocks will to use this tree as reference to
look up the reference counter to determine if the block is safe to free,
or need a copy-on-write.
COW (Copy On Write) and refcount tree
-------------------------------------
We would need an extent like structure to store the {physical block
number, len, refcount} refcount record in refcount tree.
When reflink is called, a new inode is created, and the extent tree is
copied from the original inode to the new inode. If the original inode
already have a refcount tree, then the refcount for the extent will be
increased. If not, then refcount tree is created and the two inodes
all point to the same refcount tree. Every extent will have a refcount
record and will be inserted into the refcount tree. At that very
beginning time, the refcount record is 1:1 map to the extent structure.
This will change as inode starts to write. When inode wants to overwrite
to a shared block, copy on write happens -- new block be allocated
before the write and the original extent data are remain untouched. The
original refcount record need to be updated accordingly after COW. If
the inode only overwrite part of extent, the refcount record need to
split and decrease refcount for the portion of the change extent. The
refcount for the portion that still shared by the inodes remain the same.
In worse case refcount tree becomes very fragmented if inode keep
rewriting after reflink. Imagining one inode rewrite every 4k after
being reflinked by other inode. At certain point, we may need to allow
larger chunk of COW, or even a whole file data copy would be triggered
if fragmentation getting worse.
The refcount tree could be a btree that easy to insert, search etc
operate. Since this tree is shared by reflinked files, we would need a
lock to guard access this tree operations.
Since this is important metadata, we would want to add checksums for
refcount index and leaf blocks where the refcount records are stored.
Link refcount tree to inodes
-----------------------------------
The root of refcount tree are pointed from inodes that are reflinked.
At the time of the reflink, the address of root refcount tree would be
linked from inodes. To store the location of the refcount tree, one way
is to use extended attributes. Extended attributes have to be copied to
the new reflinked file first. The location of the reflink root block is
stored as two extended attributes (32 bits). We also could store the
address of refcount tree into inode size extra_isize.
Liu zheng's proposal of project quota also looks for space in ext4 inode
to store project id. Expanding inode size extra_isize impact all files
so this is not optimal.
I haven’t thought much about what we need to do in the e2fsprogs side,
but that would require teach fsck to understand refcount tree.
Option 2 ) Static filesystem-wide per-block refcount
========================================
The most straightforward way is to create refcount record for every data
block in the filesystem. Similar to data checksumming feature proposed
earlier, along with other blockgroup metadata, we could have a per-block
metadata record, to store refcounts, back reference, data checksumming.
The per for the blocks in that blockgroup.
This works well if the data checksumming feature plans to go this
direction (adding per data block metadata), and we could just add two
bytes for block refcount. Getting the block refcount will only take O(1)
time if the per-data metadata are allocated statically, and there is
basically very little impact to performance.
The downside is the extra space cost for the blocks not shared. Unlike
data checksumming feature, refcount only matters to those blocks being
shared in the reflinked inodes. And secondly, it would not as efficient
as per-extent refcount as we would need to track per-block refcount
instead of larger extent granularity.
Overall, this is just a draft to show the thoughts about implement
reflink for ext4 filesystem. I am sure there are lots of other things
that I might missed or havent thought through. I am looking for many
suggestions, critics and discussion, and hopefully this could be a good
start.
Mingming
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html