Re: Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.

"Brian ::" <bc@xxxxxxxx> · Fri, 22 Jul 2016 14:04:27 +0100

Such great detail in this post David.. This will come in very handy
for people in the future

On Thu, Jul 21, 2016 at 8:24 PM, David Turner
<david.turner@xxxxxxxxxxxxxxxx> wrote:
> The Mon store is important and since your cluster isn't healthy, they need
> to hold onto it to make sure that when things come up that the mon can
> replay everything for them.  Once you fix the 2 down and peering PGs, The
> mon store will fix itself in no time at all.  Ceph is rightly refusing to
> compact that database until your cluster is healthy.
>
> It seems like you have a couple things that might help your setup.  First I
> see something very easy to resolve, and that's the blocked requests.  Try
> running the following command:
>
> ceph osd down 71
>
> That command will tell the cluster that osd.71 is down without restarting
> the actual osd daemon.  Osd.71 will come back and tell the mons it's
> actually up, but in the mean time the operations blocking on osd.71 will go
> to a secondary to get the response and clear up.
>
> Second, osd.53  looks to be causing the never ending peering.  A couple
> questions to check things here.  What is your osd_max_backfills set to?
> That is directly related to how fast osd.53 will fill back up.  Something
> you might do to speed that up is to just inject a higher setting for osd.53
> and not the rest of the cluster:
>
> ceph tell osd.53 injectargs '--osd_max_backfills=20'
>
> If this is the problem and the cluster is just waiting for osd.53 to finish
> backfilling, then this will get you there faster.  I'm unfamiliar with the
> strategy you used to rebuild the data for osd.53.  I would have removed the
> osd from the cluster and added it back in with the same weight.  That way
> the osd would start right away and you would see the pgs backfilling onto
> the osd as opposed to it sitting in a perpetual "booting" state.
>
> To remove the osd with minimal impact to the cluster, the following commands
> should get you there.
>
> ceph osd tree | grep 'osd.53 '
> ceph osd set nobackfill
> ceph osd set norecover
> #on the host with osd.53, stop the daemon
> ceph osd down 53
> ceph osd out 53
> ceph osd crush remove osd.53
> ceph auth rm osd.53
> ceph osd rm 53
>
> At this point osd.53 is completely removed from the cluster and you have the
> original weight of the osd to set it to when you bring the osd back in.  The
> down and peering PGs should now be resolved.  Now, completely re-format and
> add the osd back into the cluster.  Make sure to do whatever you need for
> dmcrypt, journals, etc that are specific to your environment.  Once the osd
> is back in the cluster, up and in, reweight the osd to what it was before
> you removed it and unset norecover and nobackfill.
>
> ceph osd crush reweight osd.53 {{ weight_from_tree_command }}
> ceph osd unset nobackfill
> ceph osd unset norecover
>
> At this point everything is back to the way it was and the osd should start
> receiving data.  The only data movement should be refilling osd.53 with the
> data it used to have and everything else should stay the same.  Increasing
> the backfills for this osd will help it fill up faster, but it will be
> slower for client io if you do.  The mon stores will remain "too big" until
> after backfilling onto osd.53 finishes, but once the data stops moving
> around and all of your osds are up and in, the mon stores will compact in no
> time.
>
> I hope this helps.  Ask questions if you have any, and never run a command
> on your cluster that you don't understand.
>
> David Turner
> ________________________________
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Salwasser,
> Zac [zsalwass@xxxxxxxxxx]
> Sent: Thursday, July 21, 2016 12:54 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Heller, Chris
> Subject:  Uncompactable Monitor Store at 69GB -- Re: Cluster in
> warn state, not sure what to do next.
>
> Rephrasing for brevity – I have a monitor store that is 69GB and won’t
> compact any further on restart or with ‘tell compact’.  Has anyone dealt
> with this before?
>
>
>
>
>
>
>
> From: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>
> Date: Thursday, July 21, 2016 at 1:18 PM
> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
> Cc: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>, "Heller, Chris"
> <cheller@xxxxxxxxxx>
> Subject: Cluster in warn state, not sure what to do next.
>
>
>
> Hi,
>
>
>
> I have a cluster that has been in an unhealthy state for a month or so.  We
> realized the OSDs were flapping due to not having user access to enough file
> handles, but it took us a while to realize this and we appear to have done a
> lot of damage to the state of the monitor store in the meantime.
>
>
>
> I’ve been trying to tackle one issue at a time, starting with the size of
> the monitor store.  Compaction, either compact on restart or compact as a
> ‘tell’ operation, does not shrink the size of the monitor store any more
> than it presently is.  Having no luck getting the monitor store to shrink, I
> switched gears to troubleshooting down placement groups.  There are two
> remaining that I cannot fix, and they both claim to be blocked from peering
> by the same osd (osd.53).
>
>
>
> Two days ago, I removed the osd data for osd.53 and restarted it after a
> ‘mkfs’ operation.  It has been in the “booting” state ever since, although
> there is now 72GB of data in the osd data partition for osd.53, indicating
> that some sort of partial “backfilling” has taken place.  Watching the host
> file system indicates that any data coming into that partition at this point
> is only trickling in.
>
>
>
> Here is the output of “ceph health detail”.  I’m wondering if anyone would
> be willing to engage with me to at least get me unstuck.  I am on #ceph as
> salwasser.
>
>
>
> * * *
>
> HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck
> unclean; 15 requests are blocked > 32 sec; 1 osds have slow requests; mds0:
> Behind on trimming (367/30); mds-1: Behind on trimming (364/30);
> mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB;
> mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB;
> mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB;
> mon.a65-121-158-196 store is getting too big! 64023 MB >= 15360 MB;
> mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB
>
> pg 4.285 is stuck inactive since forever, current state down+peering, last
> acting [28,122,114]
>
> pg 1.716 is stuck inactive for 969017.268003, current state down+peering,
> last acting [71,213,55]
>
> pg 4.285 is stuck unclean since forever, current state down+peering, last
> acting [28,122,114]
>
> pg 1.716 is stuck unclean for 969351.417382, current state down+peering,
> last acting [71,213,55]
>
> pg 1.716 is down+peering, acting [71,213,55]
>
> pg 4.285 is down+peering, acting [28,122,114]
>
> 5 ops are blocked > 4194.3 sec
>
> 10 ops are blocked > 2097.15 sec
>
> 5 ops are blocked > 4194.3 sec on osd.71
>
> 10 ops are blocked > 2097.15 sec on osd.71
>
> 1 osds have slow requests
>
> mds0: Behind on trimming (367/30)(max_segments: , num_segments: o)
>
> mds-1: Behind on trimming (364/30)(max_segments: , num_segments: l)
>
> mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB -- 53%
> avail
>
> mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB -- 73%
> avail
>
> mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB -- 81%
> avail
>
> mon.a65-121-158-196 store is getting too big! 64023 MB >= 15360 MB -- 81%
> avail
>
> mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB -- 81%
> avail
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com