Re: Advice needed: stuck cluster halfway upgraded, comms issues and MON space usage

Sam Skipsey <aoanla@xxxxxxxxx> · Mon, 22 Mar 2021 12:20:04 +0000

Hi Dan:

Thanks for the reply - at present, our mons and mgrs are off [because of
the unsustainable nature of the filesystem usage]. We'll try putting them
on again for long enough to get "ceph status" out of them, but because the
mgr was unable to actually talk to anything, and reply at that point.

(And thanks for the link to the bug tracker - I guess this mismatch of
expectations is why the devs are so keen to move to containerised
deployments where there is no co-location of different types of server, as
it means they don't need to worry as much about the assumptions about when
it's okay to restart a service on package update. Disappointing that it
seems stale after 2 years...)

Sam

On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:

> Hi Sam,
>
> The daemons restart (for *some* releases) because of this:
> https://tracker.ceph.com/issues/21672
> In short, if the selinux module changes, and if you have selinux
> enabled, then midway through yum update, there will be a systemctl
> restart ceph.target issued.
>
> For the rest -- I think you should focus on getting the PGs all
> active+clean as soon as possible, because the degraded and remapped
> states are what leads to mon / osdmap growth.
> This kind of scenario is why we wrote this tool:
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
> It will use pg-upmap-items to force the PGs to the OSDs where they are
> currently residing.
>
> But there is some clarification needed before you go ahead with that.
> Could you share `ceph status`, `ceph health detail`?
>
> Cheers, Dan
>
>
> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla@xxxxxxxxx> wrote:
> >
> > Hi everyone:
> >
> > I posted to the list on Friday morning (UK time), but apparently my email
> > is still in moderation (I have an email from the list bot telling me that
> > it's held for moderation but no updates).
> >
> > Since this is a bit urgent - we have ~3PB of storage offline - I'm
> posting
> > again.
> >
> > To save retyping the whole thing, I will direct you to a copy of the
> email
> > I wrote on Friday:
> >
> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt
> >
> > (Since that was sent, we did successfully add big SSDs to the MON hosts
> so
> > they don't fill up their disks with store.db s).
> >
> > I would appreciate any advice - assuming this also doesn't get stuck in
> > moderation queues.
> >
> > --
> > Sam Skipsey (he/him, they/them)
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Sam Skipsey (he/him, they/them)
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx