Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

Dave Chinner <dchinner@xxxxxxxxxx> · Fri, 19 Feb 2016 16:26:37 +1100

On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>         "With this model, filestore rearrange the tree very
>         frequently : + 40 I/O every 32 objects link/unlink."
> It is the consequence of parameters :
> filestore_merge_threshold = 2
> filestore_split_multiple = 1
> 
> Not of ext4 customization.

It's a function of the directory structure you are using to work
around the scalability deficiencies of the ext4 directory structure.
i.e. the root cause is that you are working around an ext4 problem.

> The large amount of objects in FileStore require indirect access and
> more IOPS for every directory.
> 
> If root of inode B+tree is a simple block, we have the same problem with XFS

Only if you use the same 32-entries per directory constraint. Get
rid of that constraint, start thinking about storing tens of
thousands of files per directory instead. i.e. let the directory
structure handle IO optimisation as the number of entries grow, not
impose artificial limits that prevent them from working efficiently.

Put simply, XFS is more efficient in terms of the average physical
IO per random inode lookup with shallow, wide directory structures
than it will be with a narrow, deep setup that is optimised to work
around the shortcomings of ext3/ext4.

When you use deep directory structures to inde millions of files,
you have to assume that any random lookup will require directory
inode IO. When you use wide, shallow directories you can almost
guarantee that the directory inodes will remain cached in memory
because the are so frequently traversed. hence we never need to do
IO for directory inodes in a wide, shallow config, and so that IO
can be ignored.

So let's assume, for ease of maths, we have 40 byte dirent
structures (~24 byte file names). That means a single 4k directory
block can index aproximately 60-70 entries. More than this, and XFs
switches to a more scalable multi-block ("leaf", then "node") format.

When XFs moves to a multi-block structure, the first block of the
directory is converted to a name hash btree that allows finding any
directory entry in one further IO.  The hash index is made up of 8
byte entries, so for a 4k block it can index 500 entries in a single
IO.  IOWs, a random, cold cache lookup across 500 directory entries
can be done in 2 IOs.

Now lets add a second level to that hash btree - we have 500 hash
index leaf blocks that can be reached in 2 IOs, so now we can reach
25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
entries.

It should be noted that the length of the directory entries doesn't
affect this lookup scalability because the index is based on 4 byte
name hashes. Hence it has the same scalability characterisitics
regardless of the name lengths; it is only affect by changes in
directory block size.

If we consider your current "1 IO per directory" config using a 32
entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
4 IOs it's 1 million entries. This is assuming we can fit 32 entries
in the inode core, which we shoul dbe able to do for the nodes of
the tree, but the leaves with the file entries are probably going to
have full object names and so are likely to be in block format. I've
ignored this and assume the leaf directories pointing to the objects
are also inline.

IOWs, by the time we get to needing 4 IOs to reach the file store
leaf directories (i.e. > ~30,000 files in the object store), a
single XFS directory is going to have the same or better IO efficiency
than your configuration fixed confiugration.

And we can make XFS even better - with an 8k directory block size, 2
IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
reach a billion entries.

So, in summary, the number of entries that can be indexed in a
given number of IOs:

IO count		1	2	3	4
32 entry wide		32	1k	32k	1m
4k dir block		70	500	25k	2.5m
8k dir block		150	1k	1m	1000m

And the number of directories required for a given number of
files if we limit XFS directories to 3 internal IOs:

file count		1k	10k	100k	1m	10m	100m
32 entry wide		32	320	3200	32k	320k	3.2g
4k dir block		1	1	5	50	500	5k
8k dir block		1	1	1	1	11	101

So, as you can see, once you make the directory structure shallow
and wide, you can reach many more entries in the same number of IOs
and there is much lower inode/dentry cache footprint when you do so.
IOWs, on XFS you design the heirachy to provide the necessary
lookup/modification concurrency as IO scalibility as file counts
rise is already efficeintly handled by the filesystem's directory
structure.

Doing this means the file store does not need to rebalance every 32
create/unlink operations. Nor do you need to be concerned about
maintaining a working set of directory inodes in cache under memory
pressure - there directory entries become the hotest items in the
cache and so will never get reclaimed.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@xxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html