Hey Andras,
Three mons is possibly too few for such a large cluster.
We've had lots of good stable experience with 5-mon clusters.
I've never tried 7, so I can't say if that would lead to other
problems (e.g. leader/peon sync scalability).
That said, our 10000-osd bigbang tests managed with only 3
mons, and I assume that outside of this full system reboot
scenario your 3 cope well enough. You should probably add 2
more, but I wouldn't expect that alone to solve this problem
in the future.
Instead, with a slightly tuned procedure and a bit of osd
log grepping, I think you could've booted this cluster more
quickly than 4 hours with those mere 3 mons.
As you know, each osds boot process requires the
downloading of all known osdmaps. If all osds are booting
together, and the mons are saturated, the osds can become
sluggish when responding to their peers, which could lead to
the flapping scenario you saw. Flapping leads to new osdmap
epochs that then need to be distributed, worsening the issue.
It's good that you used nodown and noout, because without
these the boot time would've been even longer. Next time also
set noup and noin to further reduce the osdmap churn.
One other thing: there's a debug_osd level -- 10 or 20, I
forget exactly -- that you can set to watch the maps sync up
on each osd. Grep the osd logs for some variations on "map"
and "epoch".
In short, here's what I would've done:
0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout <-- with these, there
should be zero new osdmaps generated while the osds boot.
2. start booting osds. set the necessary debug_osd level to
see the osdmap sync progress in the ceph-osd logs.
3. if the mons are over saturated, boot progressively --
one rack at a time, for example.
4. once all osds have caught up to the current osdmap,
unset noup. The osds should then all "boot" (as far as the
mons are concerned) and be marked up. (this might be sluggish
on a 3400 osd cluster, perhaps taking a few 10s of seconds).
the pgs should be active+clean at this point.
5. unset nodown, noin, noout. which should change nothing
provided all went well.
Hope that helps for next time!
Dan