Re: [External] [RFC PATCH 00/76] SSDFS: flash-friendly LFS file system for ZNS SSD

Stefan Hajnoczi <stefanha@xxxxxxxxxx> · Tue, 28 Feb 2023 08:59:01 -0500

On Mon, Feb 27, 2023 at 02:59:08PM -0800, Viacheslav A.Dubeyko wrote:
> > On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> > These comparisions include file systems that don't support zoned devices
> > natively, maybe that's why IOPS comparisons cannot be made?
> > 
> 
> Performance comparison can be made for conventional SSD devices.
> Of course, ZNS SSD has some peculiarities (limited number of open/active
> zones, zone size, write pointer, strict append-only mode) and it requires
> fair comparison. Because, these peculiarities/restrictions can as help as
> make life more difficult. However, even if we can compare file systems for
> the same type of storage device, then various configuration options
> (logical block size, erase block size, segment size, and so on) or particular
> workload can significantly change a file system behavior. It’s always not so
> easy statement that this file system faster than another one.

I incorrectly assumed ssdfs was only for zoned devices.

> 
> >> (3) decrease the write amplification factor compared with:
> >>    1.3x - 116x (ext4),
> >>    14x - 42x (xfs),
> >>    6x - 9x (btrfs),
> >>    1.5x - 50x (f2fs),
> >>    1.2x - 20x (nilfs2);
> >> (4) prolong SSD lifetime compared with:
> > 
> > Is this measuring how many times blocks are erased? I guess this
> > measurement includes the background I/O from ssdfs migration and moving?
> > 
> 
> So, first of all, I need to explain the testing methodology. Testing included:
> (1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file.
> Every particular test-case is executed as multiple mount/unmount operations
> sequence. For example, total number of file creation operations were 1000 and
> 10000, but one mount cycle included 10, 100, or 1000 file creation, file update,
> or file delete operations. Finally, file system must flush all dirty metadata and
> user data during unmount operation.
> 
> The blktrace tool registers LBAs and size for every I/O request. These data are
> the basis for estimation how many erase blocks have been involved into
> operations. SSDFS volumes have been created by using 128KB, 512KB, and
> 8MB erase block sizes. So, I used these erase block sizes for estimation.
> Generally speaking, we can estimate the total number of erase blocks that
> were involved into file system operations for particular use-case by means of
> calculation of number of bytes of all I/O requests and division on erase block size.
> If file system uses in-place updates, then it is possible to estimate how many times
> the same erase block (we know LBA numbers) has been completely re-written.
> For example, if erase block (starting from LBA #32) received 1310720 bytes of
> write I/O requests, then erase block of 128KB in size has been re-written 10x times.
> So, it means that FTL needs to store all these data into 10 X 128KB erase blocks
> in the background or execute around 9 erase operation to keep the actual state
> of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility.
> 
> However, if we would like to estimate the total number of erase operation, then
> we need to take into account:
> 
> E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention)
> 
> The estimation of erase operation on the basis of retention issue is tricky and
> it shows negligibly small number for such short testing. So, we can ignore it.
> However, retention issue is important factor of decreasing SSD lifetime.
> I executed the estimation of this factor and I made comparison for various
> file systems. But this factor is deeply depends on time, workload, and
> payload size. So, it’s really hard to share any stable and reasonable numbers
> for this factor. Especially, it heavily depends on FTL implementation.
> 
> It is possible to make estimation of read disturbance but, again, it heavily
> depends on NAND flash type, organization, and FTL algorithms. Also, this
> estimation shows really small numbers that can be ignored for short testing.
> I’ve made this estimation and I can see that, currently, SSDFS has read-intensive
> nature because of offset translation table distribution policy. I am testing the fix
> and I have hope to remove this issue.
> 
> SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations
> even for such “short" test-cases. As far as I can see, no other file system issues
> discard operations for the same test-cases. I included TRIM/erase operations
> into the calculation of total number of erase operations.
> 
> Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one.
> I’ve made estimation of number of erase operations that FS GC can generate.
> However, as far as I can see, even without taking into account the FS GC erase
> operations, SSDFS looks better compared with F2FS and NILFS2.
> I need to add here that SSDFS uses migration scheme and doesn’t need
> in classical GC. But even for such “short” test-cases migration scheme shows
> really efficient TRIM/erase policy. 
> 
> So, write amplification factor was estimated on the basis of write I/O requests
> comparison. And SSD lifetime prolongation has been estimated and compared
> by using the model that I explained above. I hope I explained it's clear enough.
> Feel free to ask additional questions if I missed something.
> 
> The measurement includes all operations (foreground and background) that
> file system initiates because of using mount/unmount model. However, migration
> scheme requires additional explanation. Generally speaking, migration scheme
> doesn’t generate additional I/O requests. Oppositely, migration scheme decreases
> number of I/O requests. It could be tricky to follow. SSDFS uses compression,
> delta-encoding, compaction scheme, and migration stimulation. It means that
> reqular file system’s update operations are the main vehicle of migration scheme.
> Let imagine that application updates 4KB logical block. It means that SSDFS
> tries to compress (or delta-encode) this piece of data. Let compression gives us
> 1KB compressed piece of data (4KB uncompressed size). It means that we can
> place 1KB into 4KB memory page and we have 3KB free space. So, migration
> logic checks that exhausted (completely full) old erase block that received update
> operation has another valid block(s). If we have such valid logical blocks, then
> we can compress this logical blocks and store it into free space of 4K memory page.
> So, we can finally store 4 compressed logical blocks (1KB in size each), for example,
> into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical
> blocks instead of 4 ones. I simplify the explanation, but idea remains the same.
> I hope I clarified the point. Feel free to ask additional questions if I missed something.

Thanks for these explanations, that clarifies things!

Stefan
Attachment:
signature.asc

Description: PGP signature