Re: can't get healthy cluster to trim osdmaps (13.2.8)

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Mon, 23 Mar 2020 14:56:55 +0100

OK, to reply myself :-)

I wasn't very smart about decoding the output of "ceph-kvstore-tool get ..."
so I added dump of creating_pgs.pgs into get_trim_to function.

now I have the list of PGs which seem to be stuck in creating state
in monitors DB. If i query them, they're active+clean as I wrote.

I suppose I could remove them using ceph-kvstore-tool, right?

however I'd rather ask before I proceed:

is it safe to remove them from DB, if they all seem to be already created?

how do I do it? Stop all monitors, use the tool and start them again?
(I've moved all the services to other cluster, so this won't cause any outage)

I'd be very grateful for guidance here..

thanks in advance

BR

nik

On Mon, Mar 23, 2020 at 11:29:53AM +0100, Nikola Ciprich wrote:
> OK, so after some debugging, I've pinned the problem down to
> OSDMonitor::get_trim_to:
> 
>     std::lock_guard<std::mutex> l(creating_pgs_lock);
>     if (!creating_pgs.pgs.empty()) {
>       return 0;
>     }
> 
> apparently creating_pgs.pgs.empty() is not true, do I understand it
> correctly that cluster thinks the list of creating pgs is not empty?
> 
> all pgs are in clean+active state, so maybe there's something malformed
> in the db? How can I check?
> 
> I tried dumping list of creating_pgs according to
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030297.html
> but to no avail
> 
> On Tue, Mar 17, 2020 at 12:25:29PM +0100, Nikola Ciprich wrote:
> > Hello dear cephers,
> > 
> > lately, there's been some discussion about slow requests hanging
> > in "wait for new map" status. At least in my case, it's being caused
> > by osdmaps not being properly trimmed. I tried all possible steps
> > to force osdmap pruning (restarting mons, restarting everyging,
> > poking crushmap), to no avail. Still all OSDs keep min osdmap version
> > 1, while newest is 4734. Otherwise cluster is healthy, with no down
> > OSDs, network communication works flawlessly, all seems to be fine.
> > Just can't get old osdmaps to go away.. I's very small cluster and I've
> > moved all production traffic elsewhere, so I'm free to investigate
> > and debug, however I'm out of ideas on what to try or where to look.
> > 
> > Any ideas somebody please?
> > 
> > The cluster is running 13.2.8
> > 
> > I'd be very grateful for any tips
> > 
> > with best regards
> > 
> > nikola ciprich
> > 
> > -- 
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28.rijna 168, 709 00 Ostrava
> > 
> > tel.:   +420 591 166 214
> > fax:    +420 596 621 273
> > mobil:  +420 777 093 799
> > www.linuxbox.cz
> > 
> > mobil servis: +420 737 238 656
> > email servis: servis@xxxxxxxxxxx
> > -------------------------------------
> > 
> 
> -- 
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@xxxxxxxxxxx
> -------------------------------------
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx