Most of this is over my head but the last line of the logs on both mds servers show something similar to:
When I search for this in ceph user and devel mailing list the only mention I can see is from 12.0.3:
https://marc.info/?l=ceph-devel&m=149726392820648&w=2 -- ceph-devel
I don't see any mention of journal.cc in my logs however so I hope they are not related. I also have not experienced any major loss in my cluster as of yet and cephfs-journal-tool shows my journals as healthy. To trigger this bug I created a cephfs directory and user called aufstest. Here is the part of the log with the crash mentioning aufstest.
https://pastebin.com/EL5ALLuE
I created a new bug ticket on ceph.com with all of the current info as I believe this isn't a problem with my setup specifically and anyone else trying this will have the same issue.
https://tracker.ceph.com/issues/23972
I hope this is the correct path. If anyone can guide me in the right direction for troubleshooting this further I would be grateful.
0> 2018-05-01 15:37:46.871932 7fd10163b700 -1 *** Caught signal (Segmentation fault) **
in thread 7fd10163b700 thread_name:mds_rank_progr
When I search for this in ceph user and devel mailing list the only mention I can see is from 12.0.3:
https://marc.info/?l=ceph-deve
I don't see any mention of journal.cc in my logs however so I hope they are not related. I also have not experienced any major loss in my cluster as of yet and cephfs-journal-tool shows my journals as healthy. To trigger this bug I created a cephfs directory and user called aufstest. Here is the part of the log with the crash mentioning aufstest.
https://pastebin.com/EL5ALLuE
I created a new bug ticket on ceph.com with all of the current info as I believe this isn't a problem with my setup specifically and anyone else trying this will have the same issue.
https://tracker.ceph.com/issues/23972
I hope this is the correct path. If anyone can guide me in the right direction for troubleshooting this further I would be grateful.
On Tue, May 1, 2018 at 6:19 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
Forgot to reply to all:
Sure thing!
I couldn't install the ceph-mds-dbg packages without upgrading. I just finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5
From here I'm not really sure how to do generate the backtrace so I hope I did it right. For others on Ubuntu this is what I did:
* firstly up the debug_mds to 20 and debug_ms to 1:
ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
* install the debug packages
ceph-mds-dbg in my case
* I also added these options to /etc/ceph/ceph.conf just in case they restart.
* Now allow pids to dump (stolen partly from redhat docs and partly from ubuntu)
echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a /etc/systemd/system.conf sysctl fs.suid_dumpable=2sysctl kernel.core_pattern=/tmp/coregdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee /root/ceph_mds_$(hostname -s)_backtrace
systemctl daemon-reload
systemctl restart ceph-mds@$(hostname -s)
* A crash was created in /var/crash by apport but gdb cant read it. I used apport-unpack and then ran GDB on what is inside:
apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
cd /root/crash_dump/
* This left me with the attached backtraces (which I think are wrong as I see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/ 23dc5ef4fec1dacebba2c6445f05c8 fe6b8a7c.debug was loaded)
kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
The log files are pretty large (one 4.1G and the other 200MB)
kh10-8 (200MB) mds log -- https://griffin-objstore.opensciencedatacloud.org/logs/ce ph-mds.kh10-8.log
kh09-8 (4.1GB) mds log -- https://griffin-objstore.opensciencedatacloud.org/logs/ce ph-mds.kh09-8.log On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:Hello Sean,
On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan <lookcrabs@xxxxxxxxx> wrote:
> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!
Thanks for keeping the list apprised of your efforts. Since this is so
easily reproduced for you, I would suggest that you next get higher
debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
a segmentation fault, a backtrace with debug symbols from gdb would
also be helpful.
--
Patrick Donnelly
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com