Re: Welcome to ceph-large

"Stillwell, Bryan J" <Bryan.Stillwell@xxxxxxxxxxx> · Mon, 24 Oct 2016 22:30:58 +0000

On 10/21/16, 2:20 PM, "David Turner" <david.turner@xxxxxxxxxxxxxxxx> wrote:

>I'm from the company that created the first issue with the osd map cache,
>but we had to upgrade asap to 0.94.7 before it got released in 0.94.8 due
> to a problem with not being able to scrub or snap trim half a dozen pgs
>in one of our clusters (a fix that came in 0.94.7).  We haven't upgraded
>to 0.94.9 yet.  We're still stuck with our workaround to fix the map
>cache issue in pre-0.94.8 by restarting every
> osd in our clusters before the map cache gets so large that the osds
>start flapping when they attempt to read through their meta directory.
>The longest we can go is 16 days before we start dropping osds left and
>right with over 400GB of maps on every osd.
>  We head it off at ~200GB of maps on each osd by restarting all 4,808
>OSDs on 170 storage nodes spread between 8 production clusters every
>week.  In our largest cluster (1494 osds) the osd maps get up to a total
>of ~300TB every week.  As such, we have a very
> efficient script that we'd be willing to share that gets through the
>1,494 osds on 60 nodes in under 3 hours.  I won't post it in general in
>case someone that doesn't understand what it's doing tries to use it in
>an environment I didn't anticipate and it makes
> things so much worse.

There is a workaround that mostly works for the excessive OSD maps:

osd_pg_epoch_persisted_max_stale=10
osd_map_cache_size=20
osd_map_max_advance=10
osd_mon_report_interval_min=10
osd_pg_stat_report_interval_max=10
paxos_service_trim_min=10
mon_min_osdmap_epochs=20

We used this until we were able to upgrade to 0.94.9.  Maybe it's
something you can try.

>Back on point, we're skipping the upgrade to 0.94.9 and going straight to
>Jewel.  We're currently regression testing it in our QA environment and
>will hopefully be pushing it live in a couple weeks.  Things we have seen
>so far that we know will be happening
> are...
>
>1)  Before you begin, make sure that your crush tunables profile is not
>on legacy or default and is at least firefly.  If you have to change your
>tunables profile, then you will have a sizable data shift of backfilling
>before you can continue with the upgrade.
>  Our QA environment had been recently reinstalled and was on the default
>tunables and after we upgraded the mons we were in a warning state to
>upgrade our tunables before continuing.
>
>2)  We first tried upgrading our clients and then the cluster.  This was
>a terrible idea which broke creating RBDs, cloning RBDs, and probably
>many other things.  The Jewel clients expect that they're interacting
>with a Jewel cluster.  We redid the upgrade
> by upgrading the mons, then osds, waiting a full day for testing with
>the Hammer clients, and finally upgraded the clients to Jewel.  This
>worked flawlessly and we had no issues.  Creating RBDs, cloning,
>snapshots, deleting, etc all worked without issue.
>
>3)  We don't want the upgrade process to take months by chown'ing all of
>the osds during the upgrade so we made sure to use the workaround config
>option:
>
>   setuser match path = /var/lib/ceph/$type/$cluster-$id
>
>That works perfectly well when placed in the [global] section and will
>allow us to use the same config file on every host while slowly chown'ing
>everything to the ceph user.
>
>3.a)  A sub point to this is that when we installed the Jewel packages
>our `df` output was broken saying that it couldn't read the mount points
>of every osd and we couldn't get the osds started on Jewel without
>restarting the entire node.  This was because
> the /var/lib/ceph/ directory changed permissions to 750 when the Jewel
>packages were installed.  We set that back to 755 and were able to
>upgrade the osds without restarting the node.
>
>4)  When upgrading from Hammer or Infernalis, the documents say to set
>the sortbitwise flag to enable the new object enumeration API and is also
>required for BlueStore.  Setting this flag caused our cluster to peer
>every PG at once.  If you don't have enough
> RAM in your storage nodes, this will be detrimental for you as you could
>get into an OOM killer death spiral.  Luckily for us large cluster
>operators, this can be set after the upgrade is complete during a
>maintenance window of your choosing.  As long as you
> don't need the new object enumeration API or the BlueStore backend then
>you can wait for this, but you should definitely do it sooner than later
>so you aren't forced to enable if a later upgrade that you really need
>forces this flag to be enabled.
>
>
>This is what we saw while upgrading a miniature version of our production
>clusters and hopefully we don't run into anything worse when we upgrade
>production.  I hope these tips are helpful and am very interested to hear
>anyone else's experience.

Thanks for posting this!  I hope to have a similar report after we upgrade
to Jewel!

Bryan

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com