Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There's a long standing bugzilla entry for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

See Kefu and Sam's comments about scrubbing. That's basically the only blocker AFAIK.

Mark

On 02/19/2016 05:28 AM, Blair Bethwaite wrote:
Interesting observations Dave. Given XFS is Ceph's current production
standard it makes me wonder why the default filestore configs split
leaf directories at only 320 objects. We've seen first hand that it
doesn't take long before this starts hurting performance in a big way.

Cheers,

On 19 February 2016 at 16:26, Dave Chinner <dchinner@xxxxxxxxxx> wrote:
On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
         "With this model, filestore rearrange the tree very
         frequently : + 40 I/O every 32 objects link/unlink."
It is the consequence of parameters :
filestore_merge_threshold = 2
filestore_split_multiple = 1

Not of ext4 customization.

It's a function of the directory structure you are using to work
around the scalability deficiencies of the ext4 directory structure.
i.e. the root cause is that you are working around an ext4 problem.

The large amount of objects in FileStore require indirect access and
more IOPS for every directory.

If root of inode B+tree is a simple block, we have the same problem with XFS

Only if you use the same 32-entries per directory constraint. Get
rid of that constraint, start thinking about storing tens of
thousands of files per directory instead. i.e. let the directory
structure handle IO optimisation as the number of entries grow, not
impose artificial limits that prevent them from working efficiently.

Put simply, XFS is more efficient in terms of the average physical
IO per random inode lookup with shallow, wide directory structures
than it will be with a narrow, deep setup that is optimised to work
around the shortcomings of ext3/ext4.

When you use deep directory structures to inde millions of files,
you have to assume that any random lookup will require directory
inode IO. When you use wide, shallow directories you can almost
guarantee that the directory inodes will remain cached in memory
because the are so frequently traversed. hence we never need to do
IO for directory inodes in a wide, shallow config, and so that IO
can be ignored.

So let's assume, for ease of maths, we have 40 byte dirent
structures (~24 byte file names). That means a single 4k directory
block can index aproximately 60-70 entries. More than this, and XFs
switches to a more scalable multi-block ("leaf", then "node") format.

When XFs moves to a multi-block structure, the first block of the
directory is converted to a name hash btree that allows finding any
directory entry in one further IO.  The hash index is made up of 8
byte entries, so for a 4k block it can index 500 entries in a single
IO.  IOWs, a random, cold cache lookup across 500 directory entries
can be done in 2 IOs.

Now lets add a second level to that hash btree - we have 500 hash
index leaf blocks that can be reached in 2 IOs, so now we can reach
25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
entries.

It should be noted that the length of the directory entries doesn't
affect this lookup scalability because the index is based on 4 byte
name hashes. Hence it has the same scalability characterisitics
regardless of the name lengths; it is only affect by changes in
directory block size.

If we consider your current "1 IO per directory" config using a 32
entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
4 IOs it's 1 million entries. This is assuming we can fit 32 entries
in the inode core, which we shoul dbe able to do for the nodes of
the tree, but the leaves with the file entries are probably going to
have full object names and so are likely to be in block format. I've
ignored this and assume the leaf directories pointing to the objects
are also inline.

IOWs, by the time we get to needing 4 IOs to reach the file store
leaf directories (i.e. > ~30,000 files in the object store), a
single XFS directory is going to have the same or better IO efficiency
than your configuration fixed confiugration.

And we can make XFS even better - with an 8k directory block size, 2
IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
reach a billion entries.

So, in summary, the number of entries that can be indexed in a
given number of IOs:

IO count                1       2       3       4
32 entry wide           32      1k      32k     1m
4k dir block            70      500     25k     2.5m
8k dir block            150     1k      1m      1000m

And the number of directories required for a given number of
files if we limit XFS directories to 3 internal IOs:

file count              1k      10k     100k    1m      10m     100m
32 entry wide           32      320     3200    32k     320k    3.2g
4k dir block            1       1       5       50      500     5k
8k dir block            1       1       1       1       11      101

So, as you can see, once you make the directory structure shallow
and wide, you can reach many more entries in the same number of IOs
and there is much lower inode/dentry cache footprint when you do so.
IOWs, on XFS you design the heirachy to provide the necessary
lookup/modification concurrency as IO scalibility as file counts
rise is already efficeintly handled by the filesystem's directory
structure.

Doing this means the file store does not need to rebalance every 32
create/unlink operations. Nor do you need to be concerned about
maintaining a working set of directory inodes in cache under memory
pressure - there directory entries become the hotest items in the
cache and so will never get reclaimed.

Cheers,

Dave.
--
Dave Chinner
dchinner@xxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux