Re: 12.2.4 Both Ceph MDS nodes crashed. Please help.

"Yan, Zheng" <ukernel@xxxxxxxxx> · Sat, 5 May 2018 08:49:20 +0800

On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
> Forgot to reply to all:
>
> Sure thing!
>
> I couldn't install the ceph-mds-dbg packages without upgrading. I just
> finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5
>
> From here I'm not really sure how to do generate the backtrace so I hope I
> did it right. For others on Ubuntu this is what I did:
>
> * firstly up the debug_mds to 20 and debug_ms to 1:
> ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
>
> * install the debug packages
> ceph-mds-dbg in my case
>
> * I also added these options to /etc/ceph/ceph.conf just in case they
> restart.
>
> * Now allow pids to dump (stolen partly from redhat docs and partly from
> ubuntu)
> echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> /etc/systemd/system.conf
> sysctl fs.suid_dumpable=2
> sysctl kernel.core_pattern=/tmp/core
> systemctl daemon-reload
> systemctl restart ceph-mds@$(hostname -s)
>
> * A crash was created in /var/crash by apport but gdb cant read it. I used
> apport-unpack and then ran GDB on what is inside:
>

core dump should be in /tmp/core

> apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> cd /root/crash_dump/
> gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> /root/ceph_mds_$(hostname -s)_backtrace
>
> * This left me with the attached backtraces (which I think are wrong as I
> see a lot of ?? yet gdb says
> /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was
> loaded)
>
>  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
>  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
>

Try running ceph-mds inside gdb. It should be easy to locate the bug
once we have correct coredump file.

Regards
Yan, Zheng

>
> The log files are pretty large (one 4.1G and the other 200MB)
>
> kh10-8 (200MB) mds log --
> https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
> kh09-8 (4.1GB) mds log --
> https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh09-8.log
>
> On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx>
> wrote:
>>
>> Hello Sean,
>>
>> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan <lookcrabs@xxxxxxxxx>
>> wrote:
>> > I was creating a new user and mount point. On another hardware node I
>> > mounted CephFS as admin to mount as root. I created /aufstest and then
>> > unmounted. From there it seems that both of my mds nodes crashed for
>> > some
>> > reason and I can't start them any more.
>> >
>> > https://pastebin.com/1ZgkL9fa -- my mds log
>> >
>> > I have never had this happen in my tests so now I have live data here.
>> > If
>> > anyone can lend a hand or point me in the right direction while
>> > troubleshooting that would be a godsend!
>>
>> Thanks for keeping the list apprised of your efforts. Since this is so
>> easily reproduced for you, I would suggest that you next get higher
>> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
>> a segmentation fault, a backtrace with debug symbols from gdb would
>> also be helpful.
>>
>> --
>> Patrick Donnelly
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com