Re: MDS stuck in 'rejoin' after network fragmentation caused OSD flapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I second that you do not have nearly enough RAM in these servers and I don't you have at least 72 CPU cores either which means you again don't have the minimum recommendation for the amount of OSDs you have, let alone everything else.  I would suggest you start by moving your MDS daemons off of these nodes as they'll be there most hungry and problematic of the remaining services.  It would also probably make sense to just move the mon, and mgr daemons to the new host as well.

On Sun, Aug 19, 2018, 8:01 AM Christian Wuerdig <christian.wuerdig@xxxxxxxxx> wrote:
It should be added though that you're running at only 1/3 of the
recommended RAM usage for the OSD setup alone - not to mention that
you also co-host MON, MGR and MDS deamons on there. The next time you
run into an issue - in particular with OSD recovery - you may be in a
pickle again and then it might not be so easy to get going.
On Fri, 17 Aug 2018 at 02:48, Jonathan Woytek <woytek@xxxxxxxxxxx> wrote:
>
> On Thu, Aug 16, 2018 at 10:15 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > Do note that while this works and is unlikely to break anything, it's
> > not entirely ideal. The MDS was trying to probe the size and mtime of
> > any files which were opened by clients that have since disappeared. By
> > removing that list of open files, it can't do that any more, so you
> > may have some inaccurate metadata about individual file sizes or
> > mtimes.
>
> Understood, and thank you for the additional details. However, when
> the difference is having a working filesystem, or having a filesystem
> permanently down because the ceph-mds rejoin is impossible to
> complete, I'll accept the risk involved. I'd prefer to see the rejoin
> process able to proceed without chewing up memory until the machine
> deadlocks on itself, but I don't yet know enough about the internals
> of the rejoin process to even attempt to comment on how that could be
> done. Ideally, it seems like flushing the current recovery/rejoin
> status periodically and monitoring memory usage during recovery would
> help to fix the problem. From what I could see, ceph-mds just
> continued to allocate memory as it processed every open handle, and
> never released any of it until it was killed.
>
> jonathan
> --
> Jonathan Woytek
> http://www.dryrose.com
> KB3HOZ
> PGP:  462C 5F50 144D 6B09 3B65  FCE8 C1DC DEC4 E8B6 AABC
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux