Re: Filestore directory splitting (ZFS/FreeBSD)

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 18 Apr 2017 08:47:00 -0500

On 04/18/2017 08:32 AM, Willem Jan Withagen wrote:
On 18-4-2017 15:27, Sage Weil wrote:
On Tue, 18 Apr 2017, Willem Jan Withagen wrote:
Hi,

I'm running some alrger tests with ceph-fuse on FreeBSD, and noticed
that I start to run into directory splitting I think?
I just Rsynced my full src and ports tree into it, which is like >
1.000.000 small files.
At least I saw a rather large number of directories with relatively few
files in them.

I know from UFS that large directories used to be a problem, although
some of the issues on FreeBSD have been fixed long ago.

So can anybody explain the rationale behind this process and give a bit
of a feeling why we start splitting at 320 files??

It actually has little to do with the underlying file system's ability to
handle large directories.  Ceph needs to do object enumeration in sorted
order (based on the [g]hobject_t sort order) and readdir return entries in
a semi-random order based on how the fs is implemented.  We keep
directories smallish so that we can list the whole directory, sort in
memory, and then return the correct result.

Oke, I see.
So it has more to do with the efficiency of readdir and friends....?

And then the odd question: ;-)

why would this be (still) so extensively configurable?
I mean, once it has been "sorted out" it could be changed into more or
less fixed. Either bij making it a *h-file constant, or marking it so in
common_opt.h in the comments.
And perhaps even remove it from the documentation.

I was alos triggered by some of the remarks on the user list th
splitting was a expensive process that did have impact on performance.

It's still a bit of an on-going debate where those split values should 
be tuned, and frankly whether or not splitting should happen at all. 
This was sort of the motivation to try to "pre-split" directories if you 
think you're going to end up with lots of objects per PG.  There's a lot 
of nuance here regarding directory fragmentation, cached dentries and 
inodes, the effects of large dentry/inode cache on other things 
(syncfs), selinux behavior (or similar behavior, ie what actions 
triggers xattr security lookups).  If we could make the things that need 
to scan through the directories asynchronous that would probably let us 
push the directory limits much higher, but that would be a fairly major 
change for filestore and we are trying to keep it stable until bluestore 
is ready.

Speaking of which, bluestore doesn't suffer from this particular issue 
so eventually this whole discussion sort of becomes obsolete. We have 
other performance challenges in bluestore that we are still working on 
(mostly metadata/rocksdb related).

Mark

--WjW

FreeBSD's filestore runs of ZFS, and From waht I've seen thusfar is that
ZFS has a different (better) behaviour with large directories.
(I once ended up with > 1.000.000 security cam pictures in one
directory, and that still was sort of workable.)

So has there been any testing to quatify the settings? And how would I
be able to determine if ZFS deserves better/larger settings?

--WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html