Re: filestore_split_multiple hardcoded maximum?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 9 Dec 2016 22:36:11 +0100

Coincidentally, we've been suffering from split-induced slow requests on one of our clusters for the past week.
I wanted to add that it isn't at all obvious when slow requests are being caused by filestore splitting. (When you increase the filestore/osd logs to 10, probably also 20, all you see is that an object write is taking >30s, which seems totally absurd.) So only after a lot of head scratching I noticed this thread and realized it could be the splitting -- sure enough, our PGs were crossing the 5120 object threshold, one-by-one at a rate of around 5-10 PGs per hour.

I've just sent this PR for comments:

   https://github.com/ceph/ceph/pull/12421

IMHO, this (or something similar) would help operators a bunch in identifying when this is happening.

Thanks!

Dan

On Fri, Dec 9, 2016 at 7:27 PM, David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

Our 32k PGs each have about 25-30k objects (25-30GB per PG).  When we first contracted with Redhat support, they recommended for us to have our setting
 at about 4000 files per directory before splitting into subfolders.  When we split into subfolders with that setting, an osd_heartbeat_grace (how long before an OSD can't be reached before reporting it down to the MONs) of 60 was needed to not flap OSDs during
 subfolder splitting.

With the plan to go back and lower the setting again, we would increase that setting to make it through a holiday weekend or a time where we needed to have higher performance.  When we went to lower it, it was too painful to get through and now we're at what
 looks like a hardcoded maximum of 12,800 objects per subfolder before a split is forced.  At the amount of objects now, we have to use an osd_heartbeat_grace of 240 to avoid flapping OSDs during subfolder splitting.

Unless you NEED to merge your subfolders, you can set your filestore merge threshold to a negative number and it will never merge.  The equation for knowing when to split further takes the absolute value of the merge threshold so you can just invert it to a
 negative number and not change the behavior of splitting while disabling merging.

The OSDs flapping is unrelated to the 10.2.3 bug.  We're currently on 0.94.7 and have had this problem since Firefly.  The flapping is due to the OSD being so involved in the process to split the subfolder that it isn't responding to other requests, that's
 why using osd_heartbeat_grace gets us through the splitting.

1) We do not have SELinux installed on our Ubuntu servers.

2) We monitor and manage our fragmentation and haven't seen much of an issue since we increased our alloc_size in the mount options for XFS.

"5) pre-splitting PGs is I think the right answer."  Pre-splitting PGs is counter-intuitive.  It's a good theory, but an ineffective practice.  When a PG backfills to a new OSD it builds the directory structure according to the current settings of how deep
 the folder structure should be.  So if you lose a drive or add storage, all of the PGs that move are no longer pre-split to where you think they are.  We have seen multiple times where PGs are different depths on different OSDs.  It is not a PG state as to
 how deep it's folder structure is, but a local state per copy of the PG on each OSD.

Ultimately we're looking to Bluestore to be our Knight in Shining Armor to come and save us from all of this, but in the meantime, I have a couple ideas for how to keep our clusters usable.

We add storage regularly without our cluster being completely unusable.  I took that idea and am testing this with some OSDs to weight the OSDs to 0, backfill all of the data off, restart them with new split/merge thresholds, and backfill data back onto them.
  This would build the PG's on the OSDs with the current settings and get us away from the 12,800 objects setting we're stuck at now.  The next round will weight the next set of drives to 0 while we start to backfill onto the previous drives with the new settings.
  I have some very efficient weighting techniques that keep the cluster balanced while doing this, but it did take 2 days to finish backfilling off of the 32 drives.  Cluster performance was fairly poor during this and I can only do 3 out of our 30 nodes at
 a time.... which is a long time of running in a degraded state.

The modification to the ceph-objectstore-tool in 10.2.4 and 0.94.10 looks very promising to help us manage this.  Doing the splits offline would work out quite well for us.  We're testing our QA environment with 10.2.3 and are putting some of that testing on
 hold until 10.2.4 is fixed.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: ceph-users [ceph-users-bounces@lists.ceph.com] on behalf of Mark Nelson [mnelson@xxxxxxxxxx]

Sent: Thursday, December 08, 2016 10:25 AM

To: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  filestore_split_multiple hardcoded maximum?

I don't want to retype it all, but you guys might be interested in the

discussion under section 3 of this post here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html

basically the gist of it is:

1) Make sure SELinux isn't doing security xattr lookups for link/unlink

operations (this makes splitting incredibly painful!).  You may need to

disable SELinux.

2) xfs sticks files in a given directory in the same AG (ie portion of

the disk), but a subdirectory may end up with a different AG than a

parent directory.  As the split depth grows, so does fragmentation due

to files from the parent directories moving into the new sub directories

that have a different AG.

3) Increasing the split depth is an object, but more files in a single

directory will cause readdir to slowdown.  The effect is fairly minimal

even at ~10k files relative to the other costs involved.

4) The bigger issue that high split thresholds require more work to

happen for every split, but this is somewhat offset as splits tend to

happen over a larger time range due to the inherent randomness is pg

data distribution being amplified.  Still, when compounded with point 1

above, when large splits happen it can be debilitating.

5) pre-splitting PGs is I think the right answer.  It should greatly

delay the onset of directory fragmentation, avoid a lot of early

linking/relinking, and in some cases (like RBD) potentially avoid any

additional splits altogether.  The cost is increased inode cache misses

when there aren't many objects in the cluster yet.  This could make

benchmarks on fresh clusters slower, but yield better behavior as the

cluster grows.

Mark

On 12/08/2016 05:23 AM, Frédéric Nass wrote:

> Hi David,

>

> I'm surprised your message didn't get any echo yet. I guess it depends

> on how many files your OSDs get to store on filesystem which depends

> essentialy on use cases.

>

> We're having similar issues with a 144 osd cluster running 2 pools. Each

> one holds 100 M objects.One is replication x3 (256 PGs) and the other is

> EC k=5, m=4 (512 PGs).

> That's 300 M + 900 M = 1.2 B files stored on XFS filesystem.

>

> We're observing that our PGs subfolders only holds around 120 files each

> when they should holds around 320 (we're using default split / merge

> values).

> All objetcs were created when cluster was running Hammer. We're now

> running Jewel (RHCS 2.0 actually).

>

> We ran some tests on a Jewel backup infrastructure. Split happens at

> around 320 files per directory, as expected.

> We have no idea why we're not seeing 320 files per PG subfolder on our

> production cluster pools.

>

> Everything we read suggests to raise the filestore_merge_threshold and

> filestore_split_multiple values to 40 / 8 :

>

> https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

> https://bugzilla.redhat.com/show_bug.cgi?id=1219974

> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-July/041179.html

> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012987.html

>

> We now need to merge directories (when you need to split apparently :-)

>

> We will do so, by increasing the filestore_merge_threshold in 10 units

> steps until maybe 120 to lower it back to 40.

> Between each steps we'll run 'rados bench' (in cleanup mode) on both

> pools to generate enough deletes operations to trigger merges operations

> on each PGs.

> By running the 'rados bench' at night our clients won't be much impacted

> by blocked requests.

>

> Running this on you cluster would also provoke split when rados bench

> writes to the pools.

>

> Also, note that you can set merge and split values to a specific OSD in

> ceph.conf ([osd.123]) so you can see how the OSD reorganizes the PGs

> tree when running a 'rados bench'.

>

> Regarding the OSDs flapping, does this happen when scrubbing ? You may

> hit the Jewel scrubbing bug Sage reported like 3 weeks ago (look for

> 'stalls caused by scrub on jewel').

> It's fixed in 10.2.4 and waiting for QA to make it to RHCS >= 2.0

>

> We are impacted by this bug because we have a lot of objects (200k) per

> PGs with, I think, bad split / merge values. Lowering vfs_cache_pressure

> to 1 might also help to avoid the flapping.

>

> Regards,

>

> Frederic Nass,

> Université de Lorraine.

>

> ----- Le 27 Sep 16, à 0:42, David Turner <david.turner@xxxxxxxxxxxxxxxx>

> a écrit :

>

>     We are running on Hammer 0.94.7 and have had very bad experiences

>     with PG folders splitting a sub-directory further.  OSDs being

>     marked out, hundreds of blocked requests, etc.  We have modified our

>     settings and watched the behavior match the ceph documentation for

>     splitting, but right now the subfolders are splitting outside of

>     what the documentation says they should.

>

>     filestore_split_multiple * abs(filestore_merge_threshold) * 16

>

>     Our filestore_merge_threshold is set to 40.  When we had our

>     filestore_split_multiple set to 8, we were splitting subfolders when

>     a subfolder had (8 * 40 * 16 = ) 5120 objects in the directory.  In

>     a different cluster we had to push that back again with elevated

>     settings and the subfolders split when they had (16 * 40 * 16 = )

>     10240 objects.

>

>     We have another cluster that we're working with that is splitting at

>     a value that seems to be a hardcoded maximum.  The settings are (32

>     * 40 * 16 = ) 20480 objects before it should split, but it seems to

>     be splitting subfolders at 12800 objects.

>

>     Normally I would expect this number to be a power of 2, but we

>     recently found another hardcoded maximum of the object map only

>     allowing RBD's with a maximum 256,000,000 objects in them.  The

>     12800 matches that as being a base 2 followed by a set of zero's to

>     be the hardcoded maximum.

>

>     Has anyone else encountered what seems to be a hardcoded maximum

>     here?  Are we missing a setting elsewhere that is capping us, or

>     diminishing our value?  Much more to the point, though, is there any

>     way to mitigate how painful it is to split subfolders in PGs?  So

>     far it seems like the only way we can do it is to push up the

>     setting to later drop it back down during a week that we plan to

>     have our cluster plagued with blocked requests all while cranking

>     our osd_heartbeat_grace so that we don't have flapping osds.

>

>     A little more about our setup is that we have 32x 4TB HGST drives

>     with 4x 200GB Intel DC3710 journals (8 drives per journal), dual

>     hyper-threaded octa-core Xeon (32 virtual cores), 192GB memory, 10Gb

>     redundant network... per storage node.

>

>     ------------------------------------------------------------------------

>

>     <https://storagecraft.com>

>       David Turner | Cloud Operations Engineer | StorageCraft Technology

>     Corporation <https://storagecraft.com>

>     380 Data Drive Suite 300 | Draper | Utah | 84020

>     Office: 801.871.2760| Mobile: 385.224.2943

>

>     ------------------------------------------------------------------------

>

>     If you are not the intended recipient of this message or received it

>     erroneously, please notify the sender and delete it, together with

>     any attachments, and be advised that any dissemination or copying of

>     this message is prohibited.

>

>     ------------------------------------------------------------------------

>

>

>     _______________________________________________

>     ceph-users mailing list

>     ceph-users@xxxxxxxxxxxxxx

>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com