Re: [RFC] Thing 1: Shardmap fox Ext4

Daniel Phillips <daniel@xxxxxxxxx> · Wed, 27 Nov 2019 00:28:26 -0800

On 2019-11-26 11:40 p.m., Vyacheslav Dubeyko wrote:
> As far as I know, usually, a folder contains dozens or hundreds
> files/folders in average. There are many research works that had showed
> this fact. Do you mean some special use-case when folder could contain
> the billion files? Could you share some research work that describes
> some practical use-case with billion files per folder?

You are entirely correct that the vast majority of directories contain
only a handful of files. That is my case (1). A few directories on a
typical server can go into the tens of thousands of files. There was
a time when we could not handle those efficiently, and now thanks to
HTree we can. Some directories go into the millions, ask the Lustre
people about that. If you could have a directory with a billion files
then somebody will have a use for it. For example, you may be able to
avoid a database for a particular application and just use the file
system instead.

Now, scaling to a billion files is just one of several things that
Shardmap does better than HTree. More immediately, Shardmap implements
readdir simply, accurately and efficiently, unlike HTree. See here for
some discussion:

   https://lwn.net/Articles/544520/
   "Widening ext4's readdir() cookie"

See the recommendation that is sometimes offered to work around
HTree's issues with processing files in hash order. Basically, read
the entire directory into memory, sort by inode number, then process
in that order. As an application writer do you really want to do this,
or would you prefer that the filesystem just take care of for you so
the normal, simple and readable code is also the most efficient code?

> If you are talking about improving the performance then do you mean
> some special open-source implementation?

I mean the same kind of kernel filesystem implementation that HTree
currently has. Open source of course, GPLv2 to be specific.

>> For delete, Shardmap avoids write multiplication by appending tombstone
>> entries to index shards, thereby addressing a well known HTree delete
>> performance issue.
> 
> Do you mean Copy-On-Write policy here or some special technique?

The technique Shardmap uses to reduce write amplication under heavy
load is somewhat similar to the technique used by Google's Bigtable to
achieve a similar result for data files. (However, the resemblance to
Bigtable ends there.)

Each update to a Shardmap index is done twice: once in a highly
optimized hash table shard in cache, then again by appending an
entry to the tail of the shard's media "fifo". Media writes are
therefore mostly linear. I say mostly, because if there is a large
number of shards then a single commit may need to update the tail
block of each one, which still adds up to vastly fewer blocks than
the BTree case, where it is easy to construct cases where every
index block must be updated on every commit, a nasty example of
n**2 performance overhead.

> How could be good Shardmap for the SSD use-case? Do you mean that we
> could reduce write amplification issue for NAND flash case?

Correct. Reducing write amplification is particularly important for
flash based storage. It also has a noticeable beneficial effect on
efficiency under many common and not so common loads.

> Let's imagine that it needs to implement the Shardmap approach. Could
> you estimate the implementation and stabilization time? How expensive
> and long-term efforts could it be?

Shardmap is already implemented and stable, though it does need wider
usage and testing. Code is available here:

   https://github.com/danielbot/Shardmap

Shardmap needs to be ported to kernel, already planned and in progress
for Tux3. Now I am proposing that the Ext4 team should consider porting
Shardmap to Ext4, or at least enter into a serious discussion of the
logistics.

Added Darrick to cc, as he is already fairly familiar with this subject,
once was an Ext4 developer, and perhaps still is should the need arise.
By the way, is there a reason that Ted's MIT address bounced on my
original post?

Regards,

Daniel