On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > Do note that while this works and is unlikely to break anything, it's > not entirely ideal. The MDS was trying to probe the size and mtime of > any files which were opened by clients that have since disappeared. By > removing that list of open files, it can't do that any more, so you > may have some inaccurate metadata about individual file sizes or > mtimes. Understood, and thank you for the additional details. However, when the difference is having a working filesystem, or having a filesystem permanently down because the ceph-mds rejoin is impossible to complete, I'll accept the risk involved. I'd prefer to see the rejoin process able to proceed without chewing up memory until the machine deadlocks on itself, but I don't yet know enough about the internals of the rejoin process to even attempt to comment on how that could be done. Ideally, it seems like flushing the current recovery/rejoin status periodically and monitoring memory usage during recovery would help to fix the problem. From what I could see, ceph-mds just continued to allocate memory as it processed every open handle, and never released any of it until it was killed. jonathan -- Jonathan Woytek http://www.dryrose.com KB3HOZ PGP: 462C 5F50 144D 6B09 3B65 FCE8 C1DC DEC4 E8B6 AABC _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com