Re: MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> Do note that while this works and is unlikely to break anything, it's
> not entirely ideal. The MDS was trying to probe the size and mtime of
> any files which were opened by clients that have since disappeared. By
> removing that list of open files, it can't do that any more, so you
> may have some inaccurate metadata about individual file sizes or
> mtimes.

Understood, and thank you for the additional details. However, when
the difference is having a working filesystem, or having a filesystem
permanently down because the ceph-mds rejoin is impossible to
complete, I'll accept the risk involved. I'd prefer to see the rejoin
process able to proceed without chewing up memory until the machine
deadlocks on itself, but I don't yet know enough about the internals
of the rejoin process to even attempt to comment on how that could be
done. Ideally, it seems like flushing the current recovery/rejoin
status periodically and monitoring memory usage during recovery would
help to fix the problem. From what I could see, ceph-mds just
continued to allocate memory as it processed every open handle, and
never released any of it until it was killed.

jonathan
-- 
Jonathan Woytek
http://www.dryrose.com
KB3HOZ
PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux