Re: MDS Read-Only state in production CephFS

John Spray <jspray@xxxxxxxxxx> · Wed, 29 Mar 2017 12:48:23 +0100

Yes, it should just be a question of deleting them.  When I tried it
here, I found that nothing in the deletion path objected to the
non-existence of the data pool, so it shouldn't complain.

If you want to make sure it's safe to subsequently install jewel
releases that might not have the fix, then make sure all vestiges are
going by doing a "ceph daemon mds.<id> flush journal" on the mds node
after you're done deleting things.

John

On Wed, Mar 29, 2017 at 12:34 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> At this point, I probably have some cleanup to perform. I imagine I need to
> delete any directory or file pointed at that pool. Anything else?
>
> On Mar 29, 2017 3:46 AM, "John Spray" <jspray@xxxxxxxxxx> wrote:
>>
>> On Wed, Mar 29, 2017 at 12:59 AM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>> > That worked for us!
>> >
>> > Thank you very much for throwing that together in such a short time.
>> >
>> > How can I buy you a beer? Bitcoin?
>>
>> No problem, I appreciate the testing.
>>
>> John
>>
>> >
>> > On Mar 28, 2017 4:13 PM, "John Spray" <jspray@xxxxxxxxxx> wrote:
>> >>
>> >> On Tue, Mar 28, 2017 at 8:44 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>> >> > Thanks John. Since we're on 10.2.5, the mds package has a dependency
>> >> > on
>> >> > 10.2.6
>> >> >
>> >> > Do you feel it is safe to perform a cluster upgrade to 10.2.6 in this
>> >> > state?
>> >>
>> >> Yes, shouldn't be an issue to upgrade the whole system to 10.2.6 while
>> >> you're at it.  Just make a mental note that the "10.2.6-1.gdf5ca2d" is
>> >> a different 10.2.6 than the official release.
>> >>
>> >> I forget how picky the dependencies are, if they demand the *exact*
>> >> same version (including the trailing -1.gdf5ca2d) then I would just
>> >> use the candidate fix version for all the packages on the node where
>> >> you're running the MDS.
>> >>
>> >> John
>> >>
>> >> > [root@mds0 ceph-admin]# rpm -Uvh
>> >> > ceph-mds-10.2.6-1.gdf5ca2d.el7.x86_64.rpm
>> >> > error: Failed dependencies:
>> >> >         ceph-base = 1:10.2.6-1.gdf5ca2d.el7 is needed by
>> >> > ceph-mds-1:10.2.6-1.gdf5ca2d.el7.x86_64
>> >> >         ceph-mds = 1:10.2.5-0.el7 is needed by (installed)
>> >> > ceph-1:10.2.5-0.el7.x86_64
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Mar 28, 2017 at 2:37 PM, John Spray <jspray@xxxxxxxxxx>
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Mar 28, 2017 at 7:12 PM, Brady Deetz <bdeetz@xxxxxxxxx>
>> >> >> wrote:
>> >> >> > Thank you very much. I've located the directory that's layout is
>> >> >> > against
>> >> >> > that pool. I've dug around to attempt to create a pool with the
>> >> >> > same
>> >> >> > ID
>> >> >> > as
>> >> >> > the deleted one, but for fairly obvious reasons, that doesn't seem
>> >> >> > to
>> >> >> > exist.
>> >> >>
>> >> >> So there's a candidate fix on a branch called wip-19401-jewel, you
>> >> >> can
>> >> >> see builds here:
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://shaman.ceph.com/repos/ceph/wip-19401-jewel/df5ca2d8e3f930ddae5708c50c6495c03b3dc078/
>> >> >> -- click through to one of those and do "repo url" to get to some
>> >> >> built artifacts.
>> >> >>
>> >> >> Hopefully you're running one of centos 7, ubuntu xenial or ubuntu
>> >> >> trusty, and therefore one of those builds will work for you (use the
>> >> >> "default" variants rather than the "notcmalloc" variants) -- you
>> >> >> should only need to pick out the ceph-mds package rather than
>> >> >> upgrading everything.
>> >> >>
>> >> >> Cheers,
>> >> >> John
>> >> >>
>> >> >>
>> >> >> > On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jspray@xxxxxxxxxx>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bdeetz@xxxxxxxxx>
>> >> >> >> wrote:
>> >> >> >> > If I follow the recommendations of this doc, do you suspect we
>> >> >> >> > will
>> >> >> >> > recover?
>> >> >> >> >
>> >> >> >> > http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/
>> >> >> >>
>> >> >> >> You might, but it's overkill and introduces its own risks -- your
>> >> >> >> metadata isn't really corrupt, you're just hitting a bug in the
>> >> >> >> running code where it's overreacting.  I'm writing a patch now.
>> >> >> >>
>> >> >> >> John
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz
>> >> >> >> > <bdeetz@xxxxxxxxx>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> I did do that. We were experimenting with an ec backed pool on
>> >> >> >> >> the
>> >> >> >> >> fs.
>> >> >> >> >> It
>> >> >> >> >> was stuck in an incomplete+creating state over night for only
>> >> >> >> >> 128
>> >> >> >> >> pgs
>> >> >> >> >> so I
>> >> >> >> >> deleted the pool this morning. At the time of deletion, the
>> >> >> >> >> only
>> >> >> >> >> issue
>> >> >> >> >> was
>> >> >> >> >> the stuck 128 pgs.
>> >> >> >> >>
>> >> >> >> >> On Tue, Mar 28, 2017 at 12:29 PM, John Spray
>> >> >> >> >> <jspray@xxxxxxxxxx>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Did you at some point add a new data pool to the filesystem,
>> >> >> >> >>> and
>> >> >> >> >>> then
>> >> >> >> >>> remove the pool?  With a little investigation I've found that
>> >> >> >> >>> the
>> >> >> >> >>> MDS
>> >> >> >> >>> currently doesn't handle that properly:
>> >> >> >> >>> http://tracker.ceph.com/issues/19401
>> >> >> >> >>>
>> >> >> >> >>> John
>> >> >> >> >>>
>> >> >> >> >>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray
>> >> >> >> >>> <jspray@xxxxxxxxxx>
>> >> >> >> >>> wrote:
>> >> >> >> >>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz
>> >> >> >> >>> > <bdeetz@xxxxxxxxx>
>> >> >> >> >>> > wrote:
>> >> >> >> >>> >> Running Jewel 10.2.5 on my production cephfs cluster and
>> >> >> >> >>> >> came
>> >> >> >> >>> >> into
>> >> >> >> >>> >> this ceph
>> >> >> >> >>> >> status
>> >> >> >> >>> >>
>> >> >> >> >>> >> [ceph-admin@mds1 brady]$ ceph status
>> >> >> >> >>> >>     cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
>> >> >> >> >>> >>      health HEALTH_WARN
>> >> >> >> >>> >>             mds0: Behind on trimming (2718/30)
>> >> >> >> >>> >>             mds0: MDS in read-only mode
>> >> >> >> >>> >>      monmap e17: 5 mons at
>> >> >> >> >>> >>
>> >> >> >> >>> >>
>> >> >> >> >>> >>
>> >> >> >> >>> >>
>> >> >> >> >>> >>
>> >> >> >> >>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0}
>> >> >> >> >>> >>             election epoch 378, quorum 0,1,2,3,4
>> >> >> >> >>> >> mon0,mon1,mon2,osd2,osd3
>> >> >> >> >>> >>       fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1
>> >> >> >> >>> >> up:standby
>> >> >> >> >>> >>      osdmap e172126: 235 osds: 235 up, 235 in
>> >> >> >> >>> >>             flags sortbitwise,require_jewel_osds
>> >> >> >> >>> >>       pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112
>> >> >> >> >>> >> Mobjects
>> >> >> >> >>> >>             874 TB used, 407 TB / 1282 TB avail
>> >> >> >> >>> >>                 5670 active+clean
>> >> >> >> >>> >>                   13 active+clean+scrubbing+deep
>> >> >> >> >>> >>                   13 active+clean+scrubbing
>> >> >> >> >>> >>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr
>> >> >> >> >>> >>
>> >> >> >> >>> >> I've tried rebooting both mds servers. I've started a
>> >> >> >> >>> >> rolling
>> >> >> >> >>> >> reboot
>> >> >> >> >>> >> across
>> >> >> >> >>> >> all of my osd nodes, but each node takes about 10 minutes
>> >> >> >> >>> >> fully
>> >> >> >> >>> >> rejoin. so
>> >> >> >> >>> >> it's going to take a while. Any recommendations other than
>> >> >> >> >>> >> reboot?
>> >> >> >> >>> >
>> >> >> >> >>> > As it says in the log, your MDSs are going read only
>> >> >> >> >>> > because
>> >> >> >> >>> > of
>> >> >> >> >>> > errors
>> >> >> >> >>> > writing to the OSDs:
>> >> >> >> >>> > 2017-03-28 08:04:12.379747 7f25ed0af700 -1
>> >> >> >> >>> > log_channel(cluster)
>> >> >> >> >>> > log
>> >> >> >> >>> > [ERR] : failed to store backtrace on ino 10003a398a6
>> >> >> >> >>> > object,
>> >> >> >> >>> > pool
>> >> >> >> >>> > 20,
>> >> >> >> >>> > errno -2
>> >> >> >> >>> >
>> >> >> >> >>> > These messages are also scary and indicates that something
>> >> >> >> >>> > has
>> >> >> >> >>> > gone
>> >> >> >> >>> > seriously wrong, either with the storage of the metadata or
>> >> >> >> >>> > internally
>> >> >> >> >>> > with the MDS:
>> >> >> >> >>> > 2017-03-28 08:04:12.251543 7f25ef2b5700 -1
>> >> >> >> >>> > log_channel(cluster)
>> >> >> >> >>> > log
>> >> >> >> >>> > [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28
>> >> >> >> >>> > 07:56:45.803267
>> >> >> >> >>> > -223=-221+-2)
>> >> >> >> >>> > 2017-03-28 08:04:12.251564 7f25ef2b5700 -1
>> >> >> >> >>> > log_channel(cluster)
>> >> >> >> >>> > log
>> >> >> >> >>> > [ERR] : unmatched fragstat on 608, inode has f(v10
>> >> >> >> >>> > m2017-03-28
>> >> >> >> >>> > 07:56:45.803267 -223=-221+-2), dirfrags have f(v0
>> >> >> >> >>> > m2017-03-28
>> >> >> >> >>> > 07:56:45.803267)
>> >> >> >> >>> >
>> >> >> >> >>> > The case that I know of that causes ENOENT on object writes
>> >> >> >> >>> > is
>> >> >> >> >>> > when
>> >> >> >> >>> > the pool no longer exists.  You can set "debug objecter =
>> >> >> >> >>> > 10"
>> >> >> >> >>> > on
>> >> >> >> >>> > the
>> >> >> >> >>> > MDS and look for a message like "check_op_pool_dne tid
>> >> >> >> >>> > <something>
>> >> >> >> >>> > concluding pool <pool> dne".
>> >> >> >> >>> >
>> >> >> >> >>> > Otherwise, go look at the OSD logs from the timestamp where
>> >> >> >> >>> > the
>> >> >> >> >>> > failed
>> >> >> >> >>> > write is happening to see if there's anything there.
>> >> >> >> >>> >
>> >> >> >> >>> > John
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >>
>> >> >> >> >>> >> Attached are my mds logs during the failure.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Any ideas?
>> >> >> >> >>> >>
>> >> >> >> >>> >> _______________________________________________
>> >> >> >> >>> >> ceph-users mailing list
>> >> >> >> >>> >> ceph-users@xxxxxxxxxxxxxx
>> >> >> >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >> >>> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com