Re: Uncompactable Monitor Store at 69GB -- Re: Cluster in warn state, not sure what to do next.

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 21 Jul 2016 21:02:53 +0200



Hi,
The mon's keep all maps going back to the last time the cluster had
HEALTH_OK, which is why the mon leveldb's are so large in your case.
(I see Greg responded with the same info). Focus on getting the
cluster healthy, then the mon sizes should resolve themselves.
-- 
Dan


On Thu, Jul 21, 2016 at 8:54 PM, Salwasser, Zac <zsalwass@xxxxxxxxxx> wrote:
> Rephrasing for brevity – I have a monitor store that is 69GB and won’t
> compact any further on restart or with ‘tell compact’.  Has anyone dealt
> with this before?
>
>
>
>
>
>
>
> From: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>
> Date: Thursday, July 21, 2016 at 1:18 PM
> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
> Cc: "Salwasser, Zac" <zsalwass@xxxxxxxxxx>, "Heller, Chris"
> <cheller@xxxxxxxxxx>
> Subject: Cluster in warn state, not sure what to do next.
>
>
>
> Hi,
>
>
>
> I have a cluster that has been in an unhealthy state for a month or so.  We
> realized the OSDs were flapping due to not having user access to enough file
> handles, but it took us a while to realize this and we appear to have done a
> lot of damage to the state of the monitor store in the meantime.
>
>
>
> I’ve been trying to tackle one issue at a time, starting with the size of
> the monitor store.  Compaction, either compact on restart or compact as a
> ‘tell’ operation, does not shrink the size of the monitor store any more
> than it presently is.  Having no luck getting the monitor store to shrink, I
> switched gears to troubleshooting down placement groups.  There are two
> remaining that I cannot fix, and they both claim to be blocked from peering
> by the same osd (osd.53).
>
>
>
> Two days ago, I removed the osd data for osd.53 and restarted it after a
> ‘mkfs’ operation.  It has been in the “booting” state ever since, although
> there is now 72GB of data in the osd data partition for osd.53, indicating
> that some sort of partial “backfilling” has taken place.  Watching the host
> file system indicates that any data coming into that partition at this point
> is only trickling in.
>
>
>
> Here is the output of “ceph health detail”.  I’m wondering if anyone would
> be willing to engage with me to at least get me unstuck.  I am on #ceph as
> salwasser.
>
>
>
> * * *
>
> HEALTH_WARN 2 pgs down; 2 pgs peering; 2 pgs stuck inactive; 2 pgs stuck
> unclean; 15 requests are blocked > 32 sec; 1 osds have slow requests; mds0:
> Behind on trimming (367/30); mds-1: Behind on trimming (364/30);
> mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB;
> mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB;
> mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB;
> mon.a65-121-158-196 store is getting too big! 64023 MB >= 15360 MB;
> mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB
>
> pg 4.285 is stuck inactive since forever, current state down+peering, last
> acting [28,122,114]
>
> pg 1.716 is stuck inactive for 969017.268003, current state down+peering,
> last acting [71,213,55]
>
> pg 4.285 is stuck unclean since forever, current state down+peering, last
> acting [28,122,114]
>
> pg 1.716 is stuck unclean for 969351.417382, current state down+peering,
> last acting [71,213,55]
>
> pg 1.716 is down+peering, acting [71,213,55]
>
> pg 4.285 is down+peering, acting [28,122,114]
>
> 5 ops are blocked > 4194.3 sec
>
> 10 ops are blocked > 2097.15 sec
>
> 5 ops are blocked > 4194.3 sec on osd.71
>
> 10 ops are blocked > 2097.15 sec on osd.71
>
> 1 osds have slow requests
>
> mds0: Behind on trimming (367/30)(max_segments: , num_segments: o)
>
> mds-1: Behind on trimming (364/30)(max_segments: , num_segments: l)
>
> mon.a65-121-158-160 store is getting too big! 74468 MB >= 15360 MB -- 53%
> avail
>
> mon.a65-121-158-161 store is getting too big! 73881 MB >= 15360 MB -- 73%
> avail
>
> mon.a65-121-158-195 store is getting too big! 64963 MB >= 15360 MB -- 81%
> avail
>
> mon.a65-121-158-196 store is getting too big! 64023 MB >= 15360 MB -- 81%
> avail
>
> mon.a65-121-158-197 store is getting too big! 63632 MB >= 15360 MB -- 81%
> avail
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com