On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote: > As far as I know, usually, a folder contains dozens or hundreds > files/folders in average. There are many research works that had showed > this fact. Do you mean some special use-case when folder could contain > the billion files? Could you share some research work that describes > some practical use-case with billion files per folder? You are entirely correct that the vast majority of directories contain only a handful of files. That is my case (1). A few directories on a typical server can go into the tens of thousands of files. There was a time when we could not handle those efficiently, and now thanks to HTree we can. Some directories go into the millions, ask the Lustre people about that. If you could have a directory with a billion files then somebody will have a use for it. For example, you may be able to avoid a database for a particular application and just use the file system instead. Now, scaling to a billion files is just one of several things that Shardmap does better than HTree. More immediately, Shardmap implements readdir simply, accurately and efficiently, unlike HTree. See here for some discussion: https://lwn.net/Articles/544520/ "Widening ext4's readdir() cookie" See the recommendation that is sometimes offered to work around HTree's issues with processing files in hash order. Basically, read the entire directory into memory, sort by inode number, then process in that order. As an application writer do you really want to do this, or would you prefer that the filesystem just take care of for you so the normal, simple and readable code is also the most efficient code? > If you are talking about improving the performance then do you mean > some special open-source implementation? I mean the same kind of kernel filesystem implementation that HTree currently has. Open source of course, GPLv2 to be specific. >> For delete, Shardmap avoids write multiplication by appending tombstone >> entries to index shards, thereby addressing a well known HTree delete >> performance issue. > > Do you mean Copy-On-Write policy here or some special technique? The technique Shardmap uses to reduce write amplication under heavy load is somewhat similar to the technique used by Google's Bigtable to achieve a similar result for data files. (However, the resemblance to Bigtable ends there.) Each update to a Shardmap index is done twice: once in a highly optimized hash table shard in cache, then again by appending an entry to the tail of the shard's media "fifo". Media writes are therefore mostly linear. I say mostly, because if there is a large number of shards then a single commit may need to update the tail block of each one, which still adds up to vastly fewer blocks than the BTree case, where it is easy to construct cases where every index block must be updated on every commit, a nasty example of n**2 performance overhead. > How could be good Shardmap for the SSD use-case? Do you mean that we > could reduce write amplification issue for NAND flash case? Correct. Reducing write amplification is particularly important for flash based storage. It also has a noticeable beneficial effect on efficiency under many common and not so common loads. > Let's imagine that it needs to implement the Shardmap approach. Could > you estimate the implementation and stabilization time? How expensive > and long-term efforts could it be? Shardmap is already implemented and stable, though it does need wider usage and testing. Code is available here: https://github.com/danielbot/Shardmap Shardmap needs to be ported to kernel, already planned and in progress for Tux3. Now I am proposing that the Ext4 team should consider porting Shardmap to Ext4, or at least enter into a serious discussion of the logistics. Added Darrick to cc, as he is already fairly familiar with this subject, once was an Ext4 developer, and perhaps still is should the need arise. By the way, is there a reason that Ted's MIT address bounced on my original post? Regards, Daniel