On Sat, Jun 29, 2019 at 8:13 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote: > > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). It sounds like you had the whole cluster off and turned it on, and those servers didn't come up. This is why. The methods of detecting an OSD as down are 1) OSD heartbeat peers. That's as Robert describes (by default). 2) When an OSD is connected to a monitor, they heartbeat each other at very long intervals and the monitor flags the OSD down if it disappears and isn't connected to a different monitor. In your case, the OSD wasn't connected to any monitor, and it hadn't set up any heartbeat peers. Normally in the case of a restart then somebody who used to have a connection to the OSD would still be running and flag it as dead. But if *all* the daemons in the cluster lose their soft state, that can't happen. -Greg > > And that's worst case. Though details of how OSDs watch each other are vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report that > to the monitor, which would believe it within about a minute and mark the OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com