On 10/21/16, 2:20 PM, "David Turner" <david.turner@xxxxxxxxxxxxxxxx> wrote: >I'm from the company that created the first issue with the osd map cache, >but we had to upgrade asap to 0.94.7 before it got released in 0.94.8 due > to a problem with not being able to scrub or snap trim half a dozen pgs >in one of our clusters (a fix that came in 0.94.7). We haven't upgraded >to 0.94.9 yet. We're still stuck with our workaround to fix the map >cache issue in pre-0.94.8 by restarting every > osd in our clusters before the map cache gets so large that the osds >start flapping when they attempt to read through their meta directory. >The longest we can go is 16 days before we start dropping osds left and >right with over 400GB of maps on every osd. > We head it off at ~200GB of maps on each osd by restarting all 4,808 >OSDs on 170 storage nodes spread between 8 production clusters every >week. In our largest cluster (1494 osds) the osd maps get up to a total >of ~300TB every week. As such, we have a very > efficient script that we'd be willing to share that gets through the >1,494 osds on 60 nodes in under 3 hours. I won't post it in general in >case someone that doesn't understand what it's doing tries to use it in >an environment I didn't anticipate and it makes > things so much worse. There is a workaround that mostly works for the excessive OSD maps: osd_pg_epoch_persisted_max_stale=10 osd_map_cache_size=20 osd_map_max_advance=10 osd_mon_report_interval_min=10 osd_pg_stat_report_interval_max=10 paxos_service_trim_min=10 mon_min_osdmap_epochs=20 We used this until we were able to upgrade to 0.94.9. Maybe it's something you can try. >Back on point, we're skipping the upgrade to 0.94.9 and going straight to >Jewel. We're currently regression testing it in our QA environment and >will hopefully be pushing it live in a couple weeks. Things we have seen >so far that we know will be happening > are... > >1) Before you begin, make sure that your crush tunables profile is not >on legacy or default and is at least firefly. If you have to change your >tunables profile, then you will have a sizable data shift of backfilling >before you can continue with the upgrade. > Our QA environment had been recently reinstalled and was on the default >tunables and after we upgraded the mons we were in a warning state to >upgrade our tunables before continuing. > >2) We first tried upgrading our clients and then the cluster. This was >a terrible idea which broke creating RBDs, cloning RBDs, and probably >many other things. The Jewel clients expect that they're interacting >with a Jewel cluster. We redid the upgrade > by upgrading the mons, then osds, waiting a full day for testing with >the Hammer clients, and finally upgraded the clients to Jewel. This >worked flawlessly and we had no issues. Creating RBDs, cloning, >snapshots, deleting, etc all worked without issue. > >3) We don't want the upgrade process to take months by chown'ing all of >the osds during the upgrade so we made sure to use the workaround config >option: > > setuser match path = /var/lib/ceph/$type/$cluster-$id > >That works perfectly well when placed in the [global] section and will >allow us to use the same config file on every host while slowly chown'ing >everything to the ceph user. > >3.a) A sub point to this is that when we installed the Jewel packages >our `df` output was broken saying that it couldn't read the mount points >of every osd and we couldn't get the osds started on Jewel without >restarting the entire node. This was because > the /var/lib/ceph/ directory changed permissions to 750 when the Jewel >packages were installed. We set that back to 755 and were able to >upgrade the osds without restarting the node. > >4) When upgrading from Hammer or Infernalis, the documents say to set >the sortbitwise flag to enable the new object enumeration API and is also >required for BlueStore. Setting this flag caused our cluster to peer >every PG at once. If you don't have enough > RAM in your storage nodes, this will be detrimental for you as you could >get into an OOM killer death spiral. Luckily for us large cluster >operators, this can be set after the upgrade is complete during a >maintenance window of your choosing. As long as you > don't need the new object enumeration API or the BlueStore backend then >you can wait for this, but you should definitely do it sooner than later >so you aren't forced to enable if a later upgrade that you really need >forces this flag to be enabled. > > >This is what we saw while upgrading a miniature version of our production >clusters and hopefully we don't run into anything worse when we upgrade >production. I hope these tips are helpful and am very interested to hear >anyone else's experience. Thanks for posting this! I hope to have a similar report after we upgrade to Jewel! Bryan _______________________________________________ Ceph-large mailing list Ceph-large@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com