snapshotted cephfs deleting files 'no space left on device'

Magnus HAGDORN <Magnus.Hagdorn@xxxxxxxx> · Thu, 14 Oct 2021 07:28:38 +0000

Hi all,
we have hit the problem where a directory tree containing over a
million entries was deleted on a snapshotted cephfs. The cluster
reports mostly healthy except for some slow MDS responses. However, the
filesystem became unusable. The MDS reports

ceph daemon mds.`hostname -s` perf dump | grep stray
        "num_strays": 211378,
        "num_strays_delayed": 0,
        "num_strays_enqueuing": 0,
        "strays_created": 2489960,
        "strays_enqueued": 2344793,
        "strays_reintegrated": 64668,
        "strays_migrated": 2562,

We have deleted a bunch of snapshots and the snaptrim has completed.

Possibly we made matters worse by reducing the number of active MDS
from 2 to 1. The 2nd MDS has been stopping since yesterday.

I presume we could just wait and the problem will resolve itself
eventually. However, is there a way to speed up the recovery process.
The cephfs is currently online. Would it help to shut it down? Is there
some setting that we could temporarily change to deal with the strays?
Do we need to remove all snapshots?

The cluster is running nautilus. I was aware of this problem but was
assured that these large directories would not get deleted. I believe
newer versions of cephfs have dealt with this issue. Is that correct?

Suggestions are greatly appreciated.

Cheers
magnus
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx