Re: MDS Read-Only state in production CephFS

John Spray <jspray@xxxxxxxxxx> · Tue, 28 Mar 2017 18:29:38 +0100

Did you at some point add a new data pool to the filesystem, and then
remove the pool?  With a little investigation I've found that the MDS
currently doesn't handle that properly:
http://tracker.ceph.com/issues/19401

John

On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jspray@xxxxxxxxxx> wrote:
> On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>> Running Jewel 10.2.5 on my production cephfs cluster and came into this ceph
>> status
>>
>> [ceph-admin@mds1 brady]$ ceph status
>>     cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
>>      health HEALTH_WARN
>>             mds0: Behind on trimming (2718/30)
>>             mds0: MDS in read-only mode
>>      monmap e17: 5 mons at
>> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0}
>>             election epoch 378, quorum 0,1,2,3,4 mon0,mon1,mon2,osd2,osd3
>>       fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby
>>      osdmap e172126: 235 osds: 235 up, 235 in
>>             flags sortbitwise,require_jewel_osds
>>       pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112 Mobjects
>>             874 TB used, 407 TB / 1282 TB avail
>>                 5670 active+clean
>>                   13 active+clean+scrubbing+deep
>>                   13 active+clean+scrubbing
>>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr
>>
>> I've tried rebooting both mds servers. I've started a rolling reboot across
>> all of my osd nodes, but each node takes about 10 minutes fully rejoin. so
>> it's going to take a while. Any recommendations other than reboot?
>
> As it says in the log, your MDSs are going read only because of errors
> writing to the OSDs:
> 2017-03-28 08:04:12.379747 7f25ed0af700 -1 log_channel(cluster) log
> [ERR] : failed to store backtrace on ino 10003a398a6 object, pool 20,
> errno -2
>
> These messages are also scary and indicates that something has gone
> seriously wrong, either with the storage of the metadata or internally
> with the MDS:
> 2017-03-28 08:04:12.251543 7f25ef2b5700 -1 log_channel(cluster) log
> [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28 07:56:45.803267
> -223=-221+-2)
> 2017-03-28 08:04:12.251564 7f25ef2b5700 -1 log_channel(cluster) log
> [ERR] : unmatched fragstat on 608, inode has f(v10 m2017-03-28
> 07:56:45.803267 -223=-221+-2), dirfrags have f(v0 m2017-03-28
> 07:56:45.803267)
>
> The case that I know of that causes ENOENT on object writes is when
> the pool no longer exists.  You can set "debug objecter = 10" on the
> MDS and look for a message like "check_op_pool_dne tid <something>
> concluding pool <pool> dne".
>
> Otherwise, go look at the OSD logs from the timestamp where the failed
> write is happening to see if there's anything there.
>
> John
>
>
>
>>
>> Attached are my mds logs during the failure.
>>
>> Any ideas?
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com