Re: MDS Read-Only state in production CephFS

Brady Deetz <bdeetz@xxxxxxxxx> · Tue, 28 Mar 2017 13:12:48 -0500

Thank you very much. I've located the directory that's layout is against that pool. I've dug around to attempt to create a pool with the same ID as the deleted one, but for fairly obvious reasons, that doesn't seem to exist.

On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jspray@xxxxxxxxxx> wrote:
On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> If I follow the recommendations of this doc, do you suspect we will recover?

>

> http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

You might, but it's overkill and introduces its own risks -- your

metadata isn't really corrupt, you're just hitting a bug in the

running code where it's overreacting.  I'm writing a patch now.

John

> On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>>

>> I did do that. We were experimenting with an ec backed pool on the fs. It

>> was stuck in an incomplete+creating state over night for only 128 pgs so I

>> deleted the pool this morning. At the time of deletion, the only issue was

>> the stuck 128 pgs.

>>

>> On Tue, Mar 28, 2017 at 12:29 PM, John Spray <jspray@xxxxxxxxxx> wrote:

>>>

>>> Did you at some point add a new data pool to the filesystem, and then

>>> remove the pool?  With a little investigation I've found that the MDS

>>> currently doesn't handle that properly:

>>> http://tracker.ceph.com/issues/19401

>>>

>>> John

>>>

>>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jspray@xxxxxxxxxx> wrote:

>>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>>> >> Running Jewel 10.2.5 on my production cephfs cluster and came into

>>> >> this ceph

>>> >> status

>>> >>

>>> >> [ceph-admin@mds1 brady]$ ceph status

>>> >>     cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10

>>> >>      health HEALTH_WARN

>>> >>             mds0: Behind on trimming (2718/30)

>>> >>             mds0: MDS in read-only mode

>>> >>      monmap e17: 5 mons at

>>> >>

>>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0}

>>> >>             election epoch 378, quorum 0,1,2,3,4

>>> >> mon0,mon1,mon2,osd2,osd3

>>> >>       fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby

>>> >>      osdmap e172126: 235 osds: 235 up, 235 in

>>> >>             flags sortbitwise,require_jewel_osds

>>> >>       pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112 Mobjects

>>> >>             874 TB used, 407 TB / 1282 TB avail

>>> >>                 5670 active+clean

>>> >>                   13 active+clean+scrubbing+deep

>>> >>                   13 active+clean+scrubbing

>>> >>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr

>>> >>

>>> >> I've tried rebooting both mds servers. I've started a rolling reboot

>>> >> across

>>> >> all of my osd nodes, but each node takes about 10 minutes fully

>>> >> rejoin. so

>>> >> it's going to take a while. Any recommendations other than reboot?

>>> >

>>> > As it says in the log, your MDSs are going read only because of errors

>>> > writing to the OSDs:

>>> > 2017-03-28 08:04:12.379747 7f25ed0af700 -1 log_channel(cluster) log

>>> > [ERR] : failed to store backtrace on ino 10003a398a6 object, pool 20,

>>> > errno -2

>>> >

>>> > These messages are also scary and indicates that something has gone

>>> > seriously wrong, either with the storage of the metadata or internally

>>> > with the MDS:

>>> > 2017-03-28 08:04:12.251543 7f25ef2b5700 -1 log_channel(cluster) log

>>> > [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28 07:56:45.803267

>>> > -223=-221+-2)

>>> > 2017-03-28 08:04:12.251564 7f25ef2b5700 -1 log_channel(cluster) log

>>> > [ERR] : unmatched fragstat on 608, inode has f(v10 m2017-03-28

>>> > 07:56:45.803267 -223=-221+-2), dirfrags have f(v0 m2017-03-28

>>> > 07:56:45.803267)

>>> >

>>> > The case that I know of that causes ENOENT on object writes is when

>>> > the pool no longer exists.  You can set "debug objecter = 10" on the

>>> > MDS and look for a message like "check_op_pool_dne tid <something>

>>> > concluding pool <pool> dne".

>>> >

>>> > Otherwise, go look at the OSD logs from the timestamp where the failed

>>> > write is happening to see if there's anything there.

>>> >

>>> > John

>>> >

>>> >

>>> >

>>> >>

>>> >> Attached are my mds logs during the failure.

>>> >>

>>> >> Any ideas?

>>> >>

>>> >> _______________________________________________

>>> >> ceph-users mailing list

>>> >> ceph-users@xxxxxxxxxxxxxx

>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >>

>>

>>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com