Re: mimic: MDS standby-replay causing blocked ops (MDS bug?)

Stefan Kooman <stefan@xxxxxx> · Sat, 18 May 2019 14:14:35 +0200

Quoting Frank Schilder (frans@xxxxxx):
> Dear Yan and Stefan,
> 
> it happened again and there were only very few ops in the queue. I
> pulled the ops list and the cache. Please find a zip file here:
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; .
> Its a bit more than 100MB.
> 
> The active MDS failed over to the standby after or during the dump
> cache operation. Is this expected? As a result, the cluster is healthy
> and I can't do further diagnostics. In case you need more information,
> we have to wait until next time.

> 
> Some further observations:
> 
> There was no load on the system. I start suspecting that this is not a load-induced event. It is also not cause by excessive atime updates, the FS is mounted with relatime. Could it have to do with the large level-2 network (ca. 550 client servers in the same broadcast domain)? I include our kernel tuning profile below, just in case. The cluster networks (back and front) are isolated VLANs, no gateways, no routing.

I am pretty sure you hit bug #26982: https://tracker.ceph.com/issues/26982

"mds: crash when dumping ops in flight".

So, if you need a reason to update to 13.2.5 there you have it. Sorry
that I not realized beforehand you could hit this bug as you're running
13.2.2.

So I would update to 13.2.5 and try again.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com