On Sun, 30 Jun 2019, Bryan Henderson wrote: > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). Yes--if it didn't, that a bug. Any logs would be helpful. I'm a bit confused about what happened here, though: that 600 second interval is only important if *every* OSD in the system is down. If you reboot the data center, why didn't *any* OSD daemons start? (And even if none did, having the ceph -s report all OSDs down instead of up isn't going to change anything except whether your pager is going off, right?) sage > > And that's worst case. Though details of how OSDs watch each other are vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report that > to the monitor, which would believe it within about a minute and mark the OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com