Re: MDS Read-Only state in production CephFS

John Spray <jspray@xxxxxxxxxx> · Tue, 28 Mar 2017 20:37:39 +0100

On Tue, Mar 28, 2017 at 7:12 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> Thank you very much. I've located the directory that's layout is against
> that pool. I've dug around to attempt to create a pool with the same ID as
> the deleted one, but for fairly obvious reasons, that doesn't seem to exist.

So there's a candidate fix on a branch called wip-19401-jewel, you can
see builds here:
https://shaman.ceph.com/repos/ceph/wip-19401-jewel/df5ca2d8e3f930ddae5708c50c6495c03b3dc078/
-- click through to one of those and do "repo url" to get to some
built artifacts.

Hopefully you're running one of centos 7, ubuntu xenial or ubuntu
trusty, and therefore one of those builds will work for you (use the
"default" variants rather than the "notcmalloc" variants) -- you
should only need to pick out the ceph-mds package rather than
upgrading everything.

Cheers,
John

> On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jspray@xxxxxxxxxx> wrote:
>>
>> On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>> > If I follow the recommendations of this doc, do you suspect we will
>> > recover?
>> >
>> > http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/
>>
>> You might, but it's overkill and introduces its own risks -- your
>> metadata isn't really corrupt, you're just hitting a bug in the
>> running code where it's overreacting.  I'm writing a patch now.
>>
>> John
>>
>>
>>
>>
>> > On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>> >>
>> >> I did do that. We were experimenting with an ec backed pool on the fs.
>> >> It
>> >> was stuck in an incomplete+creating state over night for only 128 pgs
>> >> so I
>> >> deleted the pool this morning. At the time of deletion, the only issue
>> >> was
>> >> the stuck 128 pgs.
>> >>
>> >> On Tue, Mar 28, 2017 at 12:29 PM, John Spray <jspray@xxxxxxxxxx> wrote:
>> >>>
>> >>> Did you at some point add a new data pool to the filesystem, and then
>> >>> remove the pool?  With a little investigation I've found that the MDS
>> >>> currently doesn't handle that properly:
>> >>> http://tracker.ceph.com/issues/19401
>> >>>
>> >>> John
>> >>>
>> >>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jspray@xxxxxxxxxx> wrote:
>> >>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bdeetz@xxxxxxxxx>
>> >>> > wrote:
>> >>> >> Running Jewel 10.2.5 on my production cephfs cluster and came into
>> >>> >> this ceph
>> >>> >> status
>> >>> >>
>> >>> >> [ceph-admin@mds1 brady]$ ceph status
>> >>> >>     cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
>> >>> >>      health HEALTH_WARN
>> >>> >>             mds0: Behind on trimming (2718/30)
>> >>> >>             mds0: MDS in read-only mode
>> >>> >>      monmap e17: 5 mons at
>> >>> >>
>> >>> >>
>> >>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0}
>> >>> >>             election epoch 378, quorum 0,1,2,3,4
>> >>> >> mon0,mon1,mon2,osd2,osd3
>> >>> >>       fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby
>> >>> >>      osdmap e172126: 235 osds: 235 up, 235 in
>> >>> >>             flags sortbitwise,require_jewel_osds
>> >>> >>       pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112 Mobjects
>> >>> >>             874 TB used, 407 TB / 1282 TB avail
>> >>> >>                 5670 active+clean
>> >>> >>                   13 active+clean+scrubbing+deep
>> >>> >>                   13 active+clean+scrubbing
>> >>> >>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr
>> >>> >>
>> >>> >> I've tried rebooting both mds servers. I've started a rolling
>> >>> >> reboot
>> >>> >> across
>> >>> >> all of my osd nodes, but each node takes about 10 minutes fully
>> >>> >> rejoin. so
>> >>> >> it's going to take a while. Any recommendations other than reboot?
>> >>> >
>> >>> > As it says in the log, your MDSs are going read only because of
>> >>> > errors
>> >>> > writing to the OSDs:
>> >>> > 2017-03-28 08:04:12.379747 7f25ed0af700 -1 log_channel(cluster) log
>> >>> > [ERR] : failed to store backtrace on ino 10003a398a6 object, pool
>> >>> > 20,
>> >>> > errno -2
>> >>> >
>> >>> > These messages are also scary and indicates that something has gone
>> >>> > seriously wrong, either with the storage of the metadata or
>> >>> > internally
>> >>> > with the MDS:
>> >>> > 2017-03-28 08:04:12.251543 7f25ef2b5700 -1 log_channel(cluster) log
>> >>> > [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28
>> >>> > 07:56:45.803267
>> >>> > -223=-221+-2)
>> >>> > 2017-03-28 08:04:12.251564 7f25ef2b5700 -1 log_channel(cluster) log
>> >>> > [ERR] : unmatched fragstat on 608, inode has f(v10 m2017-03-28
>> >>> > 07:56:45.803267 -223=-221+-2), dirfrags have f(v0 m2017-03-28
>> >>> > 07:56:45.803267)
>> >>> >
>> >>> > The case that I know of that causes ENOENT on object writes is when
>> >>> > the pool no longer exists.  You can set "debug objecter = 10" on the
>> >>> > MDS and look for a message like "check_op_pool_dne tid <something>
>> >>> > concluding pool <pool> dne".
>> >>> >
>> >>> > Otherwise, go look at the OSD logs from the timestamp where the
>> >>> > failed
>> >>> > write is happening to see if there's anything there.
>> >>> >
>> >>> > John
>> >>> >
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> Attached are my mds logs during the failure.
>> >>> >>
>> >>> >> Any ideas?
>> >>> >>
>> >>> >> _______________________________________________
>> >>> >> ceph-users mailing list
>> >>> >> ceph-users@xxxxxxxxxxxxxx
>> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >>
>> >>
>> >>
>> >
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com