Re: Crashed MDS (segfault)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 17, 2019 at 10:19 PM Gustavo Tonini <gustavotonini@xxxxxxxxx> wrote:
>
> No. The cluster was just rebalancing.
>
> The journal seems damaged:
>
> ceph@deployer:~$ cephfs-journal-tool --rank=fs_padrao:0 journal inspect
> 2019-10-16 17:46:29.596 7fcd34cbf700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol

corrupted journal shouldn't cause error like this. This is more like
network issue. please double check network config of your cluster.

> Overall journal integrity: DAMAGED
> Corrupt regions:
> 0x1c5e4d904ab-1c5e4d9ddbc
> ceph@deployer:~$
>
> Could a journal reset help with this?
>
> I could snapshot all FS pools and export the journal before to guarantee a rollback to this state if something goes wrong with jounal reset.
>
> On Thu, Oct 17, 2019, 09:07 Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>
>> On Tue, Oct 15, 2019 at 12:03 PM Gustavo Tonini <gustavotonini@xxxxxxxxx> wrote:
>> >
>> > Dear ceph users,
>> > we're experiencing a segfault during MDS startup (replay process) which is making our FS inaccessible.
>> >
>> > MDS log messages:
>> >
>> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== osd.26 192.168.8.209:6821/2419345 3 ==== osd_op_reply(21 1.00000000 [getxattr] v0'0 uv0 ondisk = -61 ((61) No data available)) v8 ==== 154+0+0 (3715233608 0 0) 0x2776340 con 0x18bd500
>> > Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
>> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 and 544
>> > Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 'ceph fs volume v011' (expecting 'ceph fs volume v011')
>> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 0x1799c00) open_parents [1,head]
>> > Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 0x100 [...2,head] ~mds0/ auth v275131 snaprealm=0x1799c00 f(v0 1=1+0) n(v76166 rc2020-07-17 15:29:27.000000 b41838692297 -3184=-3168+-16)/n() (iversion lock) 0x18bf800]
>> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
>> > Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 482
>> > Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph fs volume v011' (expecting 'ceph fs volume v011')
>> > Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.205 7f3c00589700 -1 *** Caught signal (Segmentation fault) **#012 in thread 7f3c00589700 thread_name:fn_anonymous#012#012 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1: (()+0x11390) [0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, SnapRealm const&)+0x42) [0x72cb92]#012 3: (SnapRealm::merge_to(SnapRealm*)+0x308) [0x72f488]#012 4: (CInode::decode_snap_blob(ceph::buffer::list&)+0x53) [0x6e1f63]#012 5: (CInode::decode_store(ceph::buffer::list::iterator&)+0x76) [0x702b86]#012 6: (CInode::_fetched(ceph::buffer::list&, ceph::buffer::list&, Context*)+0x1b2) [0x702da2]#012 7: (MDSIOContextBase::complete(int)+0x119) [0x74fcc9]#012 8: (Finisher::finisher_thread_entry()+0x12e) [0x7f3c0ebffece]#012 9: (()+0x76ba) [0x7f3c0e4806ba]#012 10: (clone()+0x6d) [0x7f3c0dca941d]#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> > Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels ---
>> > Oct 15 03:41:39.895473 mds1 ceph-mds:    0/ 5 none
>> > Oct 15 03:41:39.895473 mds1 ceph-mds:    0/ 1 lockdep
>> >
>>
>> looks like snap info for root inode is corrupted. did you do any
>> unusually operation before this happened?
>>
>>
>> >
>> > Cluster status information:
>> >
>> >   cluster:
>> >     id:     b8205875-e56f-4280-9e52-6aab9c758586
>> >     health: HEALTH_WARN
>> >             1 filesystem is degraded
>> >             1 nearfull osd(s)
>> >             11 pool(s) nearfull
>> >
>> >   services:
>> >     mon: 3 daemons, quorum mon1,mon2,mon3
>> >     mgr: mon1(active), standbys: mon2, mon3
>> >     mds: fs_padrao-1/1/1 up  {0=mds1=up:replay(laggy or crashed)}
>> >     osd: 90 osds: 90 up, 90 in
>> >
>> >   data:
>> >     pools:   11 pools, 1984 pgs
>> >     objects: 75.99 M objects, 285 TiB
>> >     usage:   457 TiB used, 181 TiB / 639 TiB avail
>> >     pgs:     1896 active+clean
>> >              87   active+clean+scrubbing+deep+repair
>> >              1    active+clean+scrubbing
>> >
>> >   io:
>> >     client:   89 KiB/s wr, 0 op/s rd, 3 op/s wr
>> >
>> > Has anyone seen anything like this?
>> >
>> > Regards,
>> > Gustavo.
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux