Re: Cluster Down from reweight-by-utilization

Kevin Hrpcek <kevin.hrpcek@xxxxxxxxxxxxx> · Mon, 6 Nov 2017 16:38:33 -0600

    An update for the list archive and if people have
      similar issues in the future.

      My cluster took about 18 hours after resetting noup for all of the
      OSDs to get to the current epoch. In the end there were 5 that
      took a few hours longer than the others. Other small issues came
      up during the process such as ceph logs filling up /var and
      memory/swap filling probably caused this all to take longer than
      it should have. Simply restarting the OSDs when memory/swap was
      filling up allowed them to catch up faster. The daemons probably
      generated a bit under 1tb of logs throughout the whole process, so
      /var got expanded.

      Once the OSDs all had current epoch I unset noup and let the
      cluster peer/activate PGs. This took another ~6 hours and was likely
      slowed by some of oldest undersized OSD servers not having enough
      cpu/memory to handle it. Throughout the peering/activating I
      periodically briefly unset nodown as a way to see if there were
      OSDs that were having problems and then addressed those.

      In the end everything came back and the cluster is healthy and
      there are no existing PG problems. How the reweight triggered a
      problem this severe is still unknown.

      A couple takeaways: 

      - CPU and memory may not be highly utilized in daily operations
      but is very important for large recovery operations. Having a bit
      more memory and cores would have probably saved hours of time from
      the recovery process and may have prevented my problem altogether.

      - Slowing the map changes by quickly setting nodown,noout,noup
      when everything is already down will help as well.

      Sage, thanks again for your input and advice.

      Kevin

    On 11/04/2017 11:54 PM, Sage Weil
      wrote:

      On Sat, 4 Nov 2017, Kevin Hrpcek wrote:

        Hey Sage,

Thanks for getting back to me this late on a weekend.

Do you now why the OSDs were going down?  Are there any crash dumps in the 
osd logs, or is the OOM killer getting them?

That's a part I can't nail down yet. OSDs didn't crash, after the reweight-by-utilization OSDs on some of our earlier gen
servers started spinning 100% cpu and were overwhelmed. Admittedly these early gen osd servers are undersized on cpu which is
probably why they got overwhelmed, but it hasn't escalated like this before. Heartbeats among the cluster's OSDs started
failing on those OSDs first and then the osd 100% cpu  problem seemed to snowball to all hosts. I'm still trying to figure out
why the relatively small reweighting caused this problem.

The usual strategy here is to set 'noup' and get all of the OSDs to catch 
up on osdmaps (you can check progress via the above status command).  Once 
they are all caught up, unset noup and let them all peer at once.

I tried having noup set for a few hours earlier to see if stopping the moving osdmap target would help but I eventually unset
it while doing more troubleshooting. I'll set it again and let it go overnight. Patience is probably needed with a cluster this
size. I saw this similar situation and was trying your previous solution
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040030.html

The problem that has come up here in the past is when the cluster has been 
unhealthy for a long time and the past intervals use too much memory.  I 
don't see anything in your description about memory usage, though.  If 
that does rear its head there's a patch we can apply to kraken to work 
around it (this is fixed in luminous).

Memory usage doesn't seem too bad, a little tight on some of those early gen servers, but I haven't seen OOM killing things off
yet. I think I saw mention of that patch and luminous handling this type of situation better while googling the issue...larger
osdmap increments or something similar if i recall correctly. My cluster is a few weeks away from a luminous upgrade.

      That's good.  You mgiht also try setting nobackfill and norecover just to 
keep the load off the cluster while it's peering.

s

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com