Re: Welcome to ceph-large

George Mihaiescu <lmihaiescu@xxxxxxxxx> · Fri, 21 Oct 2016 16:58:22 -0400

Very interesting so far, thanks for posting your experiences.

I'm running a cluster with about 800 OSDs still on Hammer 0.94.9 

The main usage is radosgw, so I'm pretty concerned by the reports of broken bucket access after updates to Jewel, as well as broken Ubuntu packages that do not allow osd restarts and so on.

I haven't had time to test yet, so I cannot say for sure how bad the last version of the Debian packages for Jewel is.

Cheers,
George

On Oct 21, 2016, at 4:20 PM, David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

I'm from the company that created the first issue with the osd map cache, but we had to upgrade asap to 0.94.7 before it got released in 0.94.8 due
 to a problem with not being able to scrub or snap trim half a dozen pgs in one of our clusters (a fix that came in 0.94.7).  We haven't upgraded to 0.94.9 yet.  We're still stuck with our workaround to fix the map cache issue in pre-0.94.8 by restarting every
 osd in our clusters before the map cache gets so large that the osds start flapping when they attempt to read through their meta directory.  The longest we can go is 16 days before we start dropping osds left and right with over 400GB of maps on every osd.
  We head it off at ~200GB of maps on each osd by restarting all 4,808 OSDs on 170 storage nodes spread between 8 production clusters every week.  In our largest cluster (1494 osds) the osd maps get up to a total  of ~300TB every week.  As such, we have a very
 efficient script that we'd be willing to share that gets through the 1,494 osds on 60 nodes in under 3 hours.  I won't post it in general in case someone that doesn't understand what it's doing tries to use it in an environment I didn't anticipate and it makes
 things so much worse.

Back on point, we're skipping the upgrade to 0.94.9 and going straight to Jewel.  We're currently regression testing it in our QA environment and will hopefully be pushing it live in a couple weeks.  Things we have seen so far that we know will be happening
 are...

1)  Before you begin, make sure that your crush tunables profile is not on legacy or default and is at least firefly.  If you have to change your tunables profile, then you will have a sizable data shift of backfilling before you can continue with the upgrade.
  Our QA environment had been recently reinstalled and was on the default tunables and after we upgraded the mons we were in a warning state to upgrade our tunables before continuing.

2)  We first tried upgrading our clients and then the cluster.  This was a terrible idea which broke creating RBDs, cloning RBDs, and probably many other things.  The Jewel clients expect that they're interacting with a Jewel cluster.  We redid the upgrade
 by upgrading the mons, then osds, waiting a full day for testing with the Hammer clients, and finally upgraded the clients to Jewel.  This worked flawlessly and we had no issues.  Creating RBDs, cloning, snapshots, deleting, etc all worked without issue.

3)  We don't want the upgrade process to take months by chown'ing all of the osds during the upgrade so we made sure to use the workaround config option:

   setuser match path = /var/lib/ceph/$type/$cluster-$id

That works perfectly well when placed in the [global] section and will allow us to use the same config file on every host while slowly chown'ing everything to the ceph user.

3.a)  A sub point to this is that when we installed the Jewel packages our `df` output was broken saying that it couldn't read the mount points of every osd and we couldn't get the osds started on Jewel without restarting the entire node.  This was because
 the /var/lib/ceph/ directory changed permissions to 750 when the Jewel packages were installed.  We set that back to 755 and were able to upgrade the osds without restarting the node.

4)  When upgrading from Hammer or Infernalis, the documents say to set the sortbitwise flag to enable the new object enumeration API and is also required for BlueStore.  Setting this flag caused our cluster to peer every PG at once.  If you don't have enough
 RAM in your storage nodes, this will be detrimental for you as you could get into an OOM killer death spiral.  Luckily for us large cluster operators, this can be set after the upgrade is complete during a maintenance window of your choosing.  As long as you
 don't need the new object enumeration API or the BlueStore backend then you can wait for this, but you should definitely do it sooner than later so you aren't forced to enable if a later upgrade that you really need forces this flag to be enabled.

This is what we saw while upgrading a miniature version of our production clusters and hopefully we don't run into anything worse when we upgrade production.  I hope these tips are helpful and am very interested to hear anyone else's experience.

<imagebd0e29.JPG>
David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

________________________________________

From: Ceph-large [ceph-large-bounces@xxxxxxxxxxxxxx] on behalf of Stillwell, Bryan J [Bryan.Stillwell@xxxxxxxxxxx]

Sent: Friday, October 21, 2016 12:57 PM

To: ceph-large@xxxxxxxxxxxxxx

Subject:  Welcome to ceph-large

Thanks everyone for joining the list!

As I mentioned in the ceph-users post, this list is for people running

large Ceph clusters (>500 OSDs) to discuss issues and experiences that you

don't run into with small clusters.  Please try and keep conversations

which are related to Ceph, but not specific to running a large cluster on

the ceph-users list.

Personally I run two different Ceph clusters that currently have over

1,300 OSDs each.  Recently I've run into two different bugs which I

believe most of us have either run into or will run into.

The first issue was related to excessive OSD maps being kept for every OSD

which resulted in quite a bit of each OSD's storage being wasted (I saw up

to 20%).  By default 500 OSD maps are stored per OSD, but I saw up to

200,000 OSD maps on some OSDs.  This was made worse by the size of the

clusters since the size of each OSD map was ~1 MB.  Here's a link to the

bug:

http://tracker.ceph.com/issues/13990

Which was fixed in the 0.94.8 release, but that brings me to the next bug

we ran into...

When we attempted to upgrade our clusters from 0.94.6 to the 0.94.9

release we saw a huge number of slow requests at the start of the upgrade.

This was caused by a change to the OSD map encoding (introduced in

0.94.7) which caused all the 0.94.6 OSDs to request full OSD maps from the

0.94.9 mon nodes instead of incremental ones.  This flooded the outgoing

network connection on all the mon nodes any time there was an update to

the OSD map for a couple minutes at a time.  This caused all sorts of

problems and even resulted in an incomplete pg at one point.  The link for

this bug is here:

http://tracker.ceph.com/issues/17386

The solution was to upgrade all the OSDs to 0.94.9 first and then upgrade

the mon nodes.

Now that we're on 0.94.9 things are working pretty well, but next up is to

look into upgrading them to Jewel.  Has anyone gone through that process

and willing to share their experiences?

Thanks,

Bryan

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com