Thank you all for the clarification and suggestion. Here is a small experience report what happened during the network maintenance, maybe it is useful for others too: As previously written the Ceph cluster is stretched across two data centers and has a size of 39 storage nodes with a total of 525 OSDs and 5 monitor nodes. The problem: Due to a network maintenance the connection between the two data center will be down for approximately 8-15 seconds, which will affect the Ceph cluster's public network. Before the maintenance we set the following flags: "noout, nobackfill, norebalance, noscrub, nodeep-scrub" During maintenance: The network between the two data centers was down for a total of 12 seconds. At first, it seemed everything worked fine. Some OSDs where marked as down but came back quickly and monitor nodes started a new election. But then more and more OSDs where marked as down wrongly, in reality, their process was up and they had network connectivity. Moreover, two monitors of one data center couldn't join the quorum anymore. The network team quickly figured out what the problem was: A wrong MTU size. After they fixed it the two monitor nodes rejoined the quorum and nearly all OSDs came up again. Only 36 OSD remained down and by checking them it revealed they were really down. After a total time of 40 minutes, the cluster reached a healthy state again. No data loss. Best, Martin On Thu, Oct 4, 2018 at 11:09 AM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote: > > Mons are also on a 30s timeout. > Even a short loss of quorum isn‘t noticeable for ongoing IO. > > Paul > > > Am 04.10.2018 um 11:03 schrieb Martin Palma <martin@xxxxxxxx>: > > > > Also monitor election? That is the most fear we have since the monitor > > nodes will no see each other for that timespan... > >> On Thu, Oct 4, 2018 at 10:21 AM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote: > >> > >> 10 seconds is far below any relevant timeout values (generally 20-30 seconds); so you will be fine without any special configuration. > >> > >> Paul > >> > >> Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin <k0ste@xxxxxxxx>: > >> > >>>> What can we do of best handling this scenario to have minimal or no > >>>> impact on Ceph? > >>>> > >>>> We plan to set "noout", "nobackfill", "norebalance", "noscrub", > >>>> "nodeep", "scrub" are there any other suggestions? > >>> > >>> ceph osd set noout > >>> > >>> ceph osd pause > >>> > >>> > >>> > >>> k > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com