On Mon, Feb 27, 2023 at 02:59:08PM -0800, Viacheslav A.Dubeyko wrote: > > On Feb 27, 2023, at 5:53 AM, Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote: > > These comparisions include file systems that don't support zoned devices > > natively, maybe that's why IOPS comparisons cannot be made? > > > > Performance comparison can be made for conventional SSD devices. > Of course, ZNS SSD has some peculiarities (limited number of open/active > zones, zone size, write pointer, strict append-only mode) and it requires > fair comparison. Because, these peculiarities/restrictions can as help as > make life more difficult. However, even if we can compare file systems for > the same type of storage device, then various configuration options > (logical block size, erase block size, segment size, and so on) or particular > workload can significantly change a file system behavior. It’s always not so > easy statement that this file system faster than another one. I incorrectly assumed ssdfs was only for zoned devices. > > >> (3) decrease the write amplification factor compared with: > >> 1.3x - 116x (ext4), > >> 14x - 42x (xfs), > >> 6x - 9x (btrfs), > >> 1.5x - 50x (f2fs), > >> 1.2x - 20x (nilfs2); > >> (4) prolong SSD lifetime compared with: > > > > Is this measuring how many times blocks are erased? I guess this > > measurement includes the background I/O from ssdfs migration and moving? > > > > So, first of all, I need to explain the testing methodology. Testing included: > (1) create file (empty, 64 bytes, 16K, 100K), (2) update file, (3) delete file. > Every particular test-case is executed as multiple mount/unmount operations > sequence. For example, total number of file creation operations were 1000 and > 10000, but one mount cycle included 10, 100, or 1000 file creation, file update, > or file delete operations. Finally, file system must flush all dirty metadata and > user data during unmount operation. > > The blktrace tool registers LBAs and size for every I/O request. These data are > the basis for estimation how many erase blocks have been involved into > operations. SSDFS volumes have been created by using 128KB, 512KB, and > 8MB erase block sizes. So, I used these erase block sizes for estimation. > Generally speaking, we can estimate the total number of erase blocks that > were involved into file system operations for particular use-case by means of > calculation of number of bytes of all I/O requests and division on erase block size. > If file system uses in-place updates, then it is possible to estimate how many times > the same erase block (we know LBA numbers) has been completely re-written. > For example, if erase block (starting from LBA #32) received 1310720 bytes of > write I/O requests, then erase block of 128KB in size has been re-written 10x times. > So, it means that FTL needs to store all these data into 10 X 128KB erase blocks > in the background or execute around 9 erase operation to keep the actual state > of data into one 128KB erase block. So, this is the estimation of FTL GC responsibility. > > However, if we would like to estimate the total number of erase operation, then > we need to take into account: > > E total = E(FTL GC) + E(TRIM) + E(FS GC) + E(read disturbance) + E(retention) > > The estimation of erase operation on the basis of retention issue is tricky and > it shows negligibly small number for such short testing. So, we can ignore it. > However, retention issue is important factor of decreasing SSD lifetime. > I executed the estimation of this factor and I made comparison for various > file systems. But this factor is deeply depends on time, workload, and > payload size. So, it’s really hard to share any stable and reasonable numbers > for this factor. Especially, it heavily depends on FTL implementation. > > It is possible to make estimation of read disturbance but, again, it heavily > depends on NAND flash type, organization, and FTL algorithms. Also, this > estimation shows really small numbers that can be ignored for short testing. > I’ve made this estimation and I can see that, currently, SSDFS has read-intensive > nature because of offset translation table distribution policy. I am testing the fix > and I have hope to remove this issue. > > SSDFS has efficient TRIM/erase policy. So, I can see TRIM/erase operations > even for such “short" test-cases. As far as I can see, no other file system issues > discard operations for the same test-cases. I included TRIM/erase operations > into the calculation of total number of erase operations. > > Estimation of GC operations on FS side (F2FS, NILFS2) is the most speculative one. > I’ve made estimation of number of erase operations that FS GC can generate. > However, as far as I can see, even without taking into account the FS GC erase > operations, SSDFS looks better compared with F2FS and NILFS2. > I need to add here that SSDFS uses migration scheme and doesn’t need > in classical GC. But even for such “short” test-cases migration scheme shows > really efficient TRIM/erase policy. > > So, write amplification factor was estimated on the basis of write I/O requests > comparison. And SSD lifetime prolongation has been estimated and compared > by using the model that I explained above. I hope I explained it's clear enough. > Feel free to ask additional questions if I missed something. > > The measurement includes all operations (foreground and background) that > file system initiates because of using mount/unmount model. However, migration > scheme requires additional explanation. Generally speaking, migration scheme > doesn’t generate additional I/O requests. Oppositely, migration scheme decreases > number of I/O requests. It could be tricky to follow. SSDFS uses compression, > delta-encoding, compaction scheme, and migration stimulation. It means that > reqular file system’s update operations are the main vehicle of migration scheme. > Let imagine that application updates 4KB logical block. It means that SSDFS > tries to compress (or delta-encode) this piece of data. Let compression gives us > 1KB compressed piece of data (4KB uncompressed size). It means that we can > place 1KB into 4KB memory page and we have 3KB free space. So, migration > logic checks that exhausted (completely full) old erase block that received update > operation has another valid block(s). If we have such valid logical blocks, then > we can compress this logical blocks and store it into free space of 4K memory page. > So, we can finally store 4 compressed logical blocks (1KB in size each), for example, > into 4KB memory page. It means that SSDFS issues one I/O request for 4 logical > blocks instead of 4 ones. I simplify the explanation, but idea remains the same. > I hope I clarified the point. Feel free to ask additional questions if I missed something. Thanks for these explanations, that clarifies things! Stefan
Attachment:
signature.asc
Description: PGP signature