Re: filestore_split_multiple hardcoded maximum?

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 8 Dec 2016 11:25:56 -0600

I don't want to retype it all, but you guys might be interested in the 
discussion under section 3 of this post here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html

basically the gist of it is:

1) Make sure SELinux isn't doing security xattr lookups for link/unlink 
operations (this makes splitting incredibly painful!).  You may need to 
disable SELinux.

2) xfs sticks files in a given directory in the same AG (ie portion of 
the disk), but a subdirectory may end up with a different AG than a 
parent directory.  As the split depth grows, so does fragmentation due 
to files from the parent directories moving into the new sub directories 
that have a different AG.

3) Increasing the split depth is an object, but more files in a single 
directory will cause readdir to slowdown.  The effect is fairly minimal 
even at ~10k files relative to the other costs involved.

4) The bigger issue that high split thresholds require more work to 
happen for every split, but this is somewhat offset as splits tend to 
happen over a larger time range due to the inherent randomness is pg 
data distribution being amplified.  Still, when compounded with point 1 
above, when large splits happen it can be debilitating.

5) pre-splitting PGs is I think the right answer.  It should greatly 
delay the onset of directory fragmentation, avoid a lot of early 
linking/relinking, and in some cases (like RBD) potentially avoid any 
additional splits altogether.  The cost is increased inode cache misses 
when there aren't many objects in the cluster yet.  This could make 
benchmarks on fresh clusters slower, but yield better behavior as the 
cluster grows.

Mark

On 12/08/2016 05:23 AM, Frédéric Nass wrote:
Hi David,

I'm surprised your message didn't get any echo yet. I guess it depends
on how many files your OSDs get to store on filesystem which depends
essentialy on use cases.

We're having similar issues with a 144 osd cluster running 2 pools. Each
one holds 100 M objects.One is replication x3 (256 PGs) and the other is
EC k=5, m=4 (512 PGs).
That's 300 M + 900 M = 1.2 B files stored on XFS filesystem.

We're observing that our PGs subfolders only holds around 120 files each
when they should holds around 320 (we're using default split / merge
values).
All objetcs were created when cluster was running Hammer. We're now
running Jewel (RHCS 2.0 actually).

We ran some tests on a Jewel backup infrastructure. Split happens at
around 320 files per directory, as expected.
We have no idea why we're not seeing 320 files per PG subfolder on our
production cluster pools.

Everything we read suggests to raise the filestore_merge_threshold and
filestore_split_multiple values to 40 / 8 :

https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf
https://bugzilla.redhat.com/show_bug.cgi?id=1219974
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041179.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html

We now need to merge directories (when you need to split apparently :-)

We will do so, by increasing the filestore_merge_threshold in 10 units
steps until maybe 120 to lower it back to 40.
Between each steps we'll run 'rados bench' (in cleanup mode) on both
pools to generate enough deletes operations to trigger merges operations
on each PGs.
By running the 'rados bench' at night our clients won't be much impacted
by blocked requests.

Running this on you cluster would also provoke split when rados bench
writes to the pools.

Also, note that you can set merge and split values to a specific OSD in
ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs
tree when running a 'rados bench'.

Regarding the OSDs flapping, does this happen when scrubbing ? You may
hit the Jewel scrubbing bug Sage reported like 3 weeks ago (look for
'stalls caused by scrub on jewel').
It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0

We are impacted by this bug because we have a lot of objects (200k) per
PGs with, I think, bad split / merge values. Lowering vfs_cache_pressure
to 1 might also help to avoid the flapping.

Regards,

Frederic Nass,
Université de Lorraine.

----- Le 27 Sep 16, à 0:42, David Turner <david.turner@xxxxxxxxxxxxxxxx>
a écrit :

    We are running on Hammer 0.94.7 and have had very bad experiences
    with PG folders splitting a sub-directory further.  OSDs being
    marked out, hundreds of blocked requests, etc.  We have modified our
    settings and watched the behavior match the ceph documentation for
    splitting, but right now the subfolders are splitting outside of
    what the documentation says they should.

    filestore_split_multiple * abs(filestore_merge_threshold) * 16

    Our filestore_merge_threshold is set to 40.  When we had our
    filestore_split_multiple set to 8, we were splitting subfolders when
    a subfolder had (8 * 40 * 16 = ) 5120 objects in the directory.  In
    a different cluster we had to push that back again with elevated
    settings and the subfolders split when they had (16 * 40 * 16 = )
    10240 objects.

    We have another cluster that we're working with that is splitting at
    a value that seems to be a hardcoded maximum.  The settings are (32
    * 40 * 16 = ) 20480 objects before it should split, but it seems to
    be splitting subfolders at 12800 objects.

    Normally I would expect this number to be a power of 2, but we
    recently found another hardcoded maximum of the object map only
    allowing RBD's with a maximum 256,000,000 objects in them.  The
    12800 matches that as being a base 2 followed by a set of zero's to
    be the hardcoded maximum.

    Has anyone else encountered what seems to be a hardcoded maximum
    here?  Are we missing a setting elsewhere that is capping us, or
    diminishing our value?  Much more to the point, though, is there any
    way to mitigate how painful it is to split subfolders in PGs?  So
    far it seems like the only way we can do it is to push up the
    setting to later drop it back down during a week that we plan to
    have our cluster plagued with blocked requests all while cranking
    our osd_heartbeat_grace so that we don't have flapping osds.

    A little more about our setup is that we have 32x 4TB HGST drives
    with 4x 200GB Intel DC3710 journals (8 drives per journal), dual
    hyper-threaded octa-core Xeon (32 virtual cores), 192GB memory, 10Gb
    redundant network... per storage node.

    ------------------------------------------------------------------------

    <https://storagecraft.com>
    	David Turner | Cloud Operations Engineer | StorageCraft Technology
    Corporation <https://storagecraft.com>
    380 Data Drive Suite 300 | Draper | Utah | 84020
    Office: 801.871.2760| Mobile: 385.224.2943

    ------------------------------------------------------------------------

    If you are not the intended recipient of this message or received it
    erroneously, please notify the sender and delete it, together with
    any attachments, and be advised that any dissemination or copying of
    this message is prohibited.

    ------------------------------------------------------------------------

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com