Re: Welcome to ceph-large

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Very interesting so far, thanks for posting your experiences.

I'm running a cluster with about 800 OSDs still on Hammer 0.94.9 

The main usage is radosgw, so I'm pretty concerned by the reports of broken bucket access after updates to Jewel, as well as broken Ubuntu packages that do not allow osd restarts and so on.

I haven't had time to test yet, so I cannot say for sure how bad the last version of the Debian packages for Jewel is.


On Oct 21, 2016, at 4:20 PM, David Turner <david.turner@xxxxxxxxxxxxxxxx> wrote:

I'm from the company that created the first issue with the osd map cache, but we had to upgrade asap to 0.94.7 before it got released in 0.94.8 due to a problem with not being able to scrub or snap trim half a dozen pgs in one of our clusters (a fix that came in 0.94.7).  We haven't upgraded to 0.94.9 yet.  We're still stuck with our workaround to fix the map cache issue in pre-0.94.8 by restarting every osd in our clusters before the map cache gets so large that the osds start flapping when they attempt to read through their meta directory.  The longest we can go is 16 days before we start dropping osds left and right with over 400GB of maps on every osd.  We head it off at ~200GB of maps on each osd by restarting all 4,808 OSDs on 170 storage nodes spread between 8 production clusters every week.  In our largest cluster (1494 osds) the osd maps get up to a total  of ~300TB every week.  As such, we have a very efficient script that we'd be willing to share that gets through the 1,494 osds on 60 nodes in under 3 hours.  I won't post it in general in case someone that doesn't understand what it's doing tries to use it in an environment I didn't anticipate and it makes things so much worse.

Back on point, we're skipping the upgrade to 0.94.9 and going straight to Jewel.  We're currently regression testing it in our QA environment and will hopefully be pushing it live in a couple weeks.  Things we have seen so far that we know will be happening are...

1)  Before you begin, make sure that your crush tunables profile is not on legacy or default and is at least firefly.  If you have to change your tunables profile, then you will have a sizable data shift of backfilling before you can continue with the upgrade.  Our QA environment had been recently reinstalled and was on the default tunables and after we upgraded the mons we were in a warning state to upgrade our tunables before continuing.

2)  We first tried upgrading our clients and then the cluster.  This was a terrible idea which broke creating RBDs, cloning RBDs, and probably many other things.  The Jewel clients expect that they're interacting with a Jewel cluster.  We redid the upgrade by upgrading the mons, then osds, waiting a full day for testing with the Hammer clients, and finally upgraded the clients to Jewel.  This worked flawlessly and we had no issues.  Creating RBDs, cloning, snapshots, deleting, etc all worked without issue.

3)  We don't want the upgrade process to take months by chown'ing all of the osds during the upgrade so we made sure to use the workaround config option:

   setuser match path = /var/lib/ceph/$type/$cluster-$id

That works perfectly well when placed in the [global] section and will allow us to use the same config file on every host while slowly chown'ing everything to the ceph user.

3.a)  A sub point to this is that when we installed the Jewel packages our `df` output was broken saying that it couldn't read the mount points of every osd and we couldn't get the osds started on Jewel without restarting the entire node.  This was because the /var/lib/ceph/ directory changed permissions to 750 when the Jewel packages were installed.  We set that back to 755 and were able to upgrade the osds without restarting the node.

4)  When upgrading from Hammer or Infernalis, the documents say to set the sortbitwise flag to enable the new object enumeration API and is also required for BlueStore.  Setting this flag caused our cluster to peer every PG at once.  If you don't have enough RAM in your storage nodes, this will be detrimental for you as you could get into an OOM killer death spiral.  Luckily for us large cluster operators, this can be set after the upgrade is complete during a maintenance window of your choosing.  As long as you don't need the new object enumeration API or the BlueStore backend then you can wait for this, but you should definitely do it sooner than later so you aren't forced to enable if a later upgrade that you really need forces this flag to be enabled.

This is what we saw while upgrading a miniature version of our production clusters and hopefully we don't run into anything worse when we upgrade production.  I hope these tips are helpful and am very interested to hear anyone else's experience.

<imagebd0e29.JPG> David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

From: Ceph-large [ceph-large-bounces@xxxxxxxxxxxxxx] on behalf of Stillwell, Bryan J [Bryan.Stillwell@xxxxxxxxxxx]
Sent: Friday, October 21, 2016 12:57 PM
To: ceph-large@xxxxxxxxxxxxxx
Subject: Welcome to ceph-large

Thanks everyone for joining the list!

As I mentioned in the ceph-users post, this list is for people running
large Ceph clusters (>500 OSDs) to discuss issues and experiences that you
don't run into with small clusters.  Please try and keep conversations
which are related to Ceph, but not specific to running a large cluster on
the ceph-users list.

Personally I run two different Ceph clusters that currently have over
1,300 OSDs each.  Recently I've run into two different bugs which I
believe most of us have either run into or will run into.

The first issue was related to excessive OSD maps being kept for every OSD
which resulted in quite a bit of each OSD's storage being wasted (I saw up
to 20%).  By default 500 OSD maps are stored per OSD, but I saw up to
200,000 OSD maps on some OSDs.  This was made worse by the size of the
clusters since the size of each OSD map was ~1 MB.  Here's a link to the

Which was fixed in the 0.94.8 release, but that brings me to the next bug
we ran into...

When we attempted to upgrade our clusters from 0.94.6 to the 0.94.9
release we saw a huge number of slow requests at the start of the upgrade.
This was caused by a change to the OSD map encoding (introduced in
0.94.7) which caused all the 0.94.6 OSDs to request full OSD maps from the
0.94.9 mon nodes instead of incremental ones.  This flooded the outgoing
network connection on all the mon nodes any time there was an update to
the OSD map for a couple minutes at a time.  This caused all sorts of
problems and even resulted in an incomplete pg at one point.  The link for
this bug is here:

The solution was to upgrade all the OSDs to 0.94.9 first and then upgrade
the mon nodes.

Now that we're on 0.94.9 things are working pretty well, but next up is to
look into upgrading them to Jewel.  Has anyone gone through that process
and willing to share their experiences?


Ceph-large mailing list

Ceph-large mailing list
Ceph-large mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux