Quoting Frank Schilder (frans@xxxxxx): > Dear Yan and Stefan, > > it happened again and there were only very few ops in the queue. I > pulled the ops list and the cache. Please find a zip file here: > "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l" . > Its a bit more than 100MB. > > The active MDS failed over to the standby after or during the dump > cache operation. Is this expected? As a result, the cluster is healthy > and I can't do further diagnostics. In case you need more information, > we have to wait until next time. > > Some further observations: > > There was no load on the system. I start suspecting that this is not a load-induced event. It is also not cause by excessive atime updates, the FS is mounted with relatime. Could it have to do with the large level-2 network (ca. 550 client servers in the same broadcast domain)? I include our kernel tuning profile below, just in case. The cluster networks (back and front) are isolated VLANs, no gateways, no routing. I am pretty sure you hit bug #26982: https://tracker.ceph.com/issues/26982 "mds: crash when dumping ops in flight". So, if you need a reason to update to 13.2.5 there you have it. Sorry that I not realized beforehand you could hit this bug as you're running 13.2.2. So I would update to 13.2.5 and try again. Gr. Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com