On Wed, Mar 31, 2021 at 6:46 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > Hello Yongseok, > > On Wed, Mar 31, 2021 at 1:13 AM Yongseok Oh <yongseok.oh@xxxxxxxxxxxx> wrote: > > ... > > A few things I have analyzed > > - Rejoining process consumes a considerable amount of time. That's a known issue. (Sometimes respawning MDS happened. Increasing mds_heartbeat_grace doesn't help.) > > Please turn up logging to: > > debug_mds = 5 > > to get an idea what the MDS is doing when respawn occurs. If it helps, here is a log with 2/5 from a recent failover which took 3.5 minutes: https://termbin.com/b022 This is 14.2.11 with the optimized recall/cache tuning. Indeed rejoin is always the longest step -- even with cephfs_metadata on SSDs. These MDSs had the cache limit set to 8GB, and you can see that the rejoining MDS needed 56GB while booting. I haven't had a chance to test the rejoin/openfiletables optimizations yet. (https://github.com/ceph/ceph/pull/37383) But I had understood that this is intended to decrease that rejoin memory usage -- will it also speed things up? -- dan _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx