“Friday fun”… not! We set mon_osd_down_out_subtree_limit=host some time ago. Now we needed to take down all OSDs on one host and as expected nothing happened (noout was _not_ set). All the PGs showed as stuck degraded. Then we took 3 OSDs on the host up and then down again because of slow request madness. Since then there’s some weirdness I don’t have an explanation for 1) there are 8 active+remapped PGs (hosted on completely different hosts from the one we were working on). Why? 2) How does mon_osd_down_out_subtree_limit even work? How does it tell the whole host is down? If I start just one OSD, is the host still down? Will it “out” all the other OSDs? Doesn’t look like it, because I just started one OSD and it didn’t out all the others. 3) after starting the one OSD, there are some backfills occuring, even though I set “nobackfill” 4) the one OSD I started on this host now consumes 6.5GB memory (RSS). All other OSDs in the cluster consume ~1.2-1.5 GB. No idea why… (and it’s the vanilla tcmalloc version) Doh… Any ideas welcome. I can’t even start all the OSDs if they start consuming this amount of memory. Jan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com