Re: MDS Read-Only state in production CephFS

Brady Deetz <bdeetz@xxxxxxxxx> · Tue, 28 Mar 2017 18:59:22 -0500

That worked for us!
Thank you very much for throwing that together in such a short time. 

How can I buy you a beer? Bitcoin? 

On Mar 28, 2017 4:13 PM, "John Spray" <jspray@xxxxxxxxxx> wrote:
On Tue, Mar 28, 2017 at 8:44 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> Thanks John. Since we're on 10.2.5, the mds package has a dependency on

> 10.2.6

>

> Do you feel it is safe to perform a cluster upgrade to 10.2.6 in this state?

Yes, shouldn't be an issue to upgrade the whole system to 10.2.6 while

you're at it.  Just make a mental note that the "10.2.6-1.gdf5ca2d" is

a different 10.2.6 than the official release.

I forget how picky the dependencies are, if they demand the *exact*

same version (including the trailing -1.gdf5ca2d) then I would just

use the candidate fix version for all the packages on the node where

you're running the MDS.

John

> [root@mds0 ceph-admin]# rpm -Uvh ceph-mds-10.2.6-1.gdf5ca2d.el7.x86_64.rpm

> error: Failed dependencies:

>         ceph-base = 1:10.2.6-1.gdf5ca2d.el7 is needed by

> ceph-mds-1:10.2.6-1.gdf5ca2d.el7.x86_64

>         ceph-mds = 1:10.2.5-0.el7 is needed by (installed)

> ceph-1:10.2.5-0.el7.x86_64

>

>

>

> On Tue, Mar 28, 2017 at 2:37 PM, John Spray <jspray@xxxxxxxxxx> wrote:

>>

>> On Tue, Mar 28, 2017 at 7:12 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>> > Thank you very much. I've located the directory that's layout is against

>> > that pool. I've dug around to attempt to create a pool with the same ID

>> > as

>> > the deleted one, but for fairly obvious reasons, that doesn't seem to

>> > exist.

>>

>> So there's a candidate fix on a branch called wip-19401-jewel, you can

>> see builds here:

>>

>> https://shaman.ceph.com/repos/ceph/wip-19401-jewel/df5ca2d8e3f930ddae5708c50c6495c03b3dc078/

>> -- click through to one of those and do "repo url" to get to some

>> built artifacts.

>>

>> Hopefully you're running one of centos 7, ubuntu xenial or ubuntu

>> trusty, and therefore one of those builds will work for you (use the

>> "default" variants rather than the "notcmalloc" variants) -- you

>> should only need to pick out the ceph-mds package rather than

>> upgrading everything.

>>

>> Cheers,

>> John

>>

>>

>> > On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jspray@xxxxxxxxxx> wrote:

>> >>

>> >> On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>> >> > If I follow the recommendations of this doc, do you suspect we will

>> >> > recover?

>> >> >

>> >> > http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

>> >>

>> >> You might, but it's overkill and introduces its own risks -- your

>> >> metadata isn't really corrupt, you're just hitting a bug in the

>> >> running code where it's overreacting.  I'm writing a patch now.

>> >>

>> >> John

>> >>

>> >>

>> >>

>> >>

>> >> > On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz <bdeetz@xxxxxxxxx>

>> >> > wrote:

>> >> >>

>> >> >> I did do that. We were experimenting with an ec backed pool on the

>> >> >> fs.

>> >> >> It

>> >> >> was stuck in an incomplete+creating state over night for only 128

>> >> >> pgs

>> >> >> so I

>> >> >> deleted the pool this morning. At the time of deletion, the only

>> >> >> issue

>> >> >> was

>> >> >> the stuck 128 pgs.

>> >> >>

>> >> >> On Tue, Mar 28, 2017 at 12:29 PM, John Spray <jspray@xxxxxxxxxx>

>> >> >> wrote:

>> >> >>>

>> >> >>> Did you at some point add a new data pool to the filesystem, and

>> >> >>> then

>> >> >>> remove the pool?  With a little investigation I've found that the

>> >> >>> MDS

>> >> >>> currently doesn't handle that properly:

>> >> >>> http://tracker.ceph.com/issues/19401

>> >> >>>

>> >> >>> John

>> >> >>>

>> >> >>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jspray@xxxxxxxxxx>

>> >> >>> wrote:

>> >> >>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bdeetz@xxxxxxxxx>

>> >> >>> > wrote:

>> >> >>> >> Running Jewel 10.2.5 on my production cephfs cluster and came

>> >> >>> >> into

>> >> >>> >> this ceph

>> >> >>> >> status

>> >> >>> >>

>> >> >>> >> [ceph-admin@mds1 brady]$ ceph status

>> >> >>> >>     cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10

>> >> >>> >>      health HEALTH_WARN

>> >> >>> >>             mds0: Behind on trimming (2718/30)

>> >> >>> >>             mds0: MDS in read-only mode

>> >> >>> >>      monmap e17: 5 mons at

>> >> >>> >>

>> >> >>> >>

>> >> >>> >>

>> >> >>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0}

>> >> >>> >>             election epoch 378, quorum 0,1,2,3,4

>> >> >>> >> mon0,mon1,mon2,osd2,osd3

>> >> >>> >>       fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby

>> >> >>> >>      osdmap e172126: 235 osds: 235 up, 235 in

>> >> >>> >>             flags sortbitwise,require_jewel_osds

>> >> >>> >>       pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112

>> >> >>> >> Mobjects

>> >> >>> >>             874 TB used, 407 TB / 1282 TB avail

>> >> >>> >>                 5670 active+clean

>> >> >>> >>                   13 active+clean+scrubbing+deep

>> >> >>> >>                   13 active+clean+scrubbing

>> >> >>> >>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr

>> >> >>> >>

>> >> >>> >> I've tried rebooting both mds servers. I've started a rolling

>> >> >>> >> reboot

>> >> >>> >> across

>> >> >>> >> all of my osd nodes, but each node takes about 10 minutes fully

>> >> >>> >> rejoin. so

>> >> >>> >> it's going to take a while. Any recommendations other than

>> >> >>> >> reboot?

>> >> >>> >

>> >> >>> > As it says in the log, your MDSs are going read only because of

>> >> >>> > errors

>> >> >>> > writing to the OSDs:

>> >> >>> > 2017-03-28 08:04:12.379747 7f25ed0af700 -1 log_channel(cluster)

>> >> >>> > log

>> >> >>> > [ERR] : failed to store backtrace on ino 10003a398a6 object, pool

>> >> >>> > 20,

>> >> >>> > errno -2

>> >> >>> >

>> >> >>> > These messages are also scary and indicates that something has

>> >> >>> > gone

>> >> >>> > seriously wrong, either with the storage of the metadata or

>> >> >>> > internally

>> >> >>> > with the MDS:

>> >> >>> > 2017-03-28 08:04:12.251543 7f25ef2b5700 -1 log_channel(cluster)

>> >> >>> > log

>> >> >>> > [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28

>> >> >>> > 07:56:45.803267

>> >> >>> > -223=-221+-2)

>> >> >>> > 2017-03-28 08:04:12.251564 7f25ef2b5700 -1 log_channel(cluster)

>> >> >>> > log

>> >> >>> > [ERR] : unmatched fragstat on 608, inode has f(v10 m2017-03-28

>> >> >>> > 07:56:45.803267 -223=-221+-2), dirfrags have f(v0 m2017-03-28

>> >> >>> > 07:56:45.803267)

>> >> >>> >

>> >> >>> > The case that I know of that causes ENOENT on object writes is

>> >> >>> > when

>> >> >>> > the pool no longer exists.  You can set "debug objecter = 10" on

>> >> >>> > the

>> >> >>> > MDS and look for a message like "check_op_pool_dne tid

>> >> >>> > <something>

>> >> >>> > concluding pool <pool> dne".

>> >> >>> >

>> >> >>> > Otherwise, go look at the OSD logs from the timestamp where the

>> >> >>> > failed

>> >> >>> > write is happening to see if there's anything there.

>> >> >>> >

>> >> >>> > John

>> >> >>> >

>> >> >>> >

>> >> >>> >

>> >> >>> >>

>> >> >>> >> Attached are my mds logs during the failure.

>> >> >>> >>

>> >> >>> >> Any ideas?

>> >> >>> >>

>> >> >>> >> _______________________________________________

>> >> >>> >> ceph-users mailing list

>> >> >>> >> ceph-users@xxxxxxxxxxxxxx

>> >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >>> >>

>> >> >>

>> >> >>

>> >> >

>> >

>> >

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com