Weird behaviour of mon_osd_down_out_subtree_limit=host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



“Friday fun”… not!

We set mon_osd_down_out_subtree_limit=host some time ago. Now we needed to take down all OSDs on one host and as expected nothing happened (noout was _not_ set). All the PGs showed as stuck degraded.

Then we took 3 OSDs on the host up and then down again because of slow request madness.

Since then there’s some weirdness I don’t have an explanation for

1) there are 8 active+remapped PGs (hosted on completely different hosts from the one we were working on). Why?

2) How does mon_osd_down_out_subtree_limit even work? How does it tell the whole host is down? If I start just one OSD, is the host still down? Will it “out” all the other OSDs?
Doesn’t look like it, because I just started one OSD and it didn’t out all the others.

3) after starting the one OSD, there are some backfills occuring, even though I set “nobackfill”

4) the one OSD I started on this host now consumes 6.5GB memory (RSS). All other OSDs in the cluster consume ~1.2-1.5 GB. No idea why…
(and it’s the vanilla tcmalloc version)

Doh…

Any ideas welcome. I can’t even start all the OSDs if they start consuming this amount of memory.


Jan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux