[LSF/MM/BPF TOPIC] : SSDFS - flexible architecture for FDP, ZNS SSD, and SMR HDD

Viacheslav Dubeyko <slava@xxxxxxxxxxx> · Wed, 17 Jan 2024 11:11:43 +0300

Hello,

I would like to share the current status of SSDFS stabilization and
new features implementation. Also I would like to discuss how
kernel-space file systems can adopt FDP (Flexible Data Placement)
technology on the example of SSDFS architecture.

[DISCUSSION]

SSDFS is based on segment concept and it has multiple types of segments
(superblock, mapping table, segment bitmap, b-tree nodes, user data).
The first point of FDP employment is to use hints to place different segment
types into different reclaim units. It means that different type of
data/metadata (with different “hotness”) can be placed into different
reclaim units. And it provides more efficient NAND flash management
on SSD side.

Another important point of FDP, that it can guarantee the decreasing
write amplification and predictable reclaiming on SSD side.
SSDFS provides the way to define the size of erase block.
If it’s ZNS SSD, then mkfs tool uses the size of zone that storage device
exposes to mkfs tool. However, for the case of conventional SSD, the size of
erase block is defined by user. Technically speaking, this size
could be smaller or bigger that the real erase block inside of SSD.
Also, FTL could use a tricky mapping scheme that could combine LBAs in
the way making FS activity inefficient even by using erase block or
segment concept. First of all, reclaim unit makes guarantee that erase
blocks or segments on file system side will match to erase blocks
(reclaim units) on SSD side. Also, end-user can use various sizes of
logical erase blocks but the logical erase blocks of the same segment
type will be placed into the same reclaim unit. FDP can guarantee that
LBAs of the same logical erase block will go into the same reclaim
unit but not to be distributed among various physical erase blocks.
Important difference with ZNS SSD that end-user can define the size
of logical erase block in flexible manner. The flexibility to use
the various logical erase block sizes provides the better efficiency of
file system because various workloads could require different logical
erase block sizes.

Another interesting feature of FDP is reclaim group that can combine
multiple reclaim units. SSDFS uses segment that could contain several
logical erase blocks. Technically speaking, different logical erase
blocks of the same segment can be located in different reclaim groups.
It sounds like different NAND dies can process requests for different
logical erase blocks in parallel. So, potentially, it can work as
perfomance improvement feature.

Technically speaking, any file system can place different types of
metadata in different reclaim units. However, user data is slightly
more tricky case. Potentially, file system logic can track “hotness” or
frequency of updates of some user data and try to direct the different
types of user data in different reclaim units. But, from another point of view,
we have folders in file system namespace. If application can place different
types of data in different folders, then, technically speaking,
file system logic can place the content of different folders into different
reclaim units. But application needs to follow some “discipline”
to store different types of user data (different “hotness”, for example)
in different folders. SSDFS can easily to have several types of user
data segments (cold, warm, hot data, for example). The main problem is
how to define the type of data during the write operation.

However, SSDFS is log-structured file system and every log could contain
up to three area types in the log's payload (main area, journal area, diff
area). The main area can keep initial state of logical blocks (cold data).
Any updates for initial state of logical blocks in main area can be stored
as deltas or diffs into the diff area (hot data). The journal area is the
space for compaction scheme that combines multiple small files or
compressed logical blocks into one LBA ("NAND flash page"). As a result,
journal area can be considered like warm data. Finally, all these areas
can be stored in contiguous LBAs sequences inside of log's payload
and for every area type is possible to provide the hint of placing
the LBA range in particular reclaim unit. So, as far as I can see,
this Diff-On-Write approach can provide better NAND flash management
scheme even for user data.

So, I would like to discuss:
(1) How kernel-space file systems can adopt FDP technology?
(2) How FDP technology can improve efficiency and reliability of
kernel-space file system?

[CURRENT STATUS]

(1) Issue with read-intensive nature has been fixed:
    - compression of offset translation table has beed added
    - offset translation table is stored in every log
(2) SSDFS was completely reworked for using memory folio
(3) 8K/16K/32K support is much more stable
    (but still with some issues)
(4) Multiple erase blocks in segment support is more stable
    (but still with some issues)
(5) Erase block inflation model has been implemented
    (patch is under testing)
(6) Erase block based deduplication has been introduced
(7) recoverfs tool has been implemented
    (not all features implemented yet)

[CURRENT ISSUES]

(1) ZNS support is still not fully stable;
(2) b-tree operations have issues for some use-cases;
(5) Delta-encoding support is not stable;
(6) The fsck tool are not implemented yet;

[PATCHSET]

Current state of patchset for the review:
https://github.com/dubeyko/ssdfs-driver/tree/master/patchset/linux-kernel-6.7.0

SSDFS is an open-source, kernel-space LFS file system designed:
(1) eliminate GC overhead, (2) prolong SSD lifetime, (3) natively support
a strict append-only mode (ZNS SSD + SMR HDD compatible), (4) guarantee
strong reliability, (5) guarantee stable performance.

Benchmarking results show that SSDFS is capable:
(1) generate smaller amount of write I/O requests compared with:
    1.4x - 116x (ext4),
    14x - 42x (xfs),
    6.2x - 9.8x (btrfs),
    1.5x - 41x (f2fs),
    0.6x - 22x (nilfs2);
(2) decrease the write amplification factor compared with:
    1.3x - 116x (ext4),
    14x - 42x (xfs),
    6x - 9x (btrfs),
    1.5x - 50x (f2fs),
    1.2x - 20x (nilfs2);
(3) prolong SSD lifetime compared with:
    1.4x - 7.8x (ext4),
    15x - 60x (xfs),
    6x - 12x (btrfs),
    1.5x - 7x (f2fs),
    1x - 4.6x (nilfs2).

[REFERENCES]
[1] SSDFS tools: https://github.com/dubeyko/ssdfs-tools.git
[2] SSDFS driver: https://github.com/dubeyko/ssdfs-driver.git
[3] Linux kernel with SSDFS support: https://github.com/dubeyko/linux.git
[4] SSDFS (paper): https://arxiv.org/abs/1907.11825
[5] Linux Plumbers 2022: https://www.youtube.com/watch?v=sBGddJBHsIo