Re: Ceph mds stuck on replay forever

Zhi Zhang <zhang.david2011@xxxxxxxxx> · Tue, 12 Jan 2016 10:43:04 +0800

Regards,
Zhi Zhang (David)
Contact: zhang.david2011@xxxxxxxxx
              zhangz.david@xxxxxxxxxxx

On Mon, Jan 11, 2016 at 5:06 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Mon, Jan 11, 2016 at 11:33 AM, Zhi Zhang <zhang.david2011@xxxxxxxxx> wrote:
>> Hi Guys,
>>
>> Last week one of our clusters became unavailable because MDSes always
>> stuck on replay state.
>>
>> From the log, we found some clues:
>>
>> MDS wants to replay log from 31885100783 to 31935430656. The last
>> prefetch expects 4194304 bytes from one mds log. At this point, the
>> requested_pos is increased to write_pos which is 31935430656, so there
>> will be no prefetch any more.
>>
>> But this mds log (already the latest one) is only 4192831 bytes, so
>> finally read_pos can only reach 31935429183, which is less than
>> expected write_pos. MDS wants to read more for replay, but there is no
>> mds log can be read, hence MDS is stuck forever.
>>
>> See below mds logs:
>> ----------------------------------------------------------------------------------------
>> 2016-01-08 13:26:34.468773 7f16a78a7700 10 mds.0.log replay start,
>> from 31885100783 to 31935430656
>>
>> 2016-01-08 13:26:35.164194 7f16a50a2700 10 mds.0.journaler(ro)
>> _issue_read reading 31931236352~4194304, read pointers
>> 31889294570/31931236352/31935430656
>> 2016-01-08 13:26:35.164219 7f16a50a2700 10 mds.0.objecter _op_submit
>> op 0x536ba80
>> 2016-01-08 13:26:35.164235 7f16a50a2700 10 mds.0.objecter _calc_target
>>  pgid 2.ccfa8253 acting [0,6]
>> 2016-01-08 13:26:35.164241 7f16a50a2700 20 mds.0.objecter _get_session
>> s=0x53b0f00 osd=0 2
>> 2016-01-08 13:26:35.164243 7f16a50a2700 20 mds.0.objecter  note: not
>> requesting commit
>> 2016-01-08 13:26:35.164256 7f16a50a2700 10 mds.0.objecter _op_submit
>> oid 200.00001dbd @2 @2 [read 0~4194304] tid 0 osd.0
>>
>> 2016-01-08 13:26:35.172434 7f16a98ab700 10 mds.0.objecter ms_dispatch
>> 0x5370000 osd_op_reply(22 200.00001dbd [read 0~4192831] v0'0 uv42142
>> ondisk = 0) v6
>> 2016-01-08 13:26:35.172493 7f16a98ab700 10 mds.0.objecter in handle_osd_op_reply
>>
>> 2016-01-08 13:26:35.172832 7f16a78a7700 10 mds.0.journaler(ro)
>> _finish_read got 31931236352~4192831
>> 2016-01-08 13:26:35.172852 7f16a78a7700 10 mds.0.journaler(ro)
>> _assimilate_prefetch 31931236352~4192831
>> 2016-01-08 13:26:35.172896 7f16a78a7700 10 mds.0.journaler(ro)
>> _assimilate_prefetch read_buf now 31889339375~46089808, read pointers
>> 31889339375/31935429183/31935430656
>>
>> 2016-01-08 13:26:42.541242 7f16a50a2700 10 mds.0.journaler(ro)
>> _is_readable read_buf.length() == 0, but need 12 for next entry;
>> fetch_len is 41943040
>> 2016-01-08 13:26:42.541251 7f16a50a2700 10 mds.0.journaler(ro)
>> _is_readable: not readable, returning false
>> 2016-01-08 13:26:42.541261 7f16a50a2700 10 mds.0.journaler(ro) _prefetch
>> 2016-01-08 13:26:42.541272 7f16a50a2700 10 mds.0.journaler(ro)
>> _prefetch fetch_len=41943040 read_pos=31935429183 adjustment=1473
>> target=31977373696 write_pos=31935430656
>> 2016-01-08 13:26:42.541282 7f16a50a2700 10 mds.0.journaler(ro)
>> _is_readable read_buf.length() == 0, but need 12 for next entry;
>> fetch_len is 41943040
>> 2016-01-08 13:26:42.541292 7f16a50a2700 10 mds.0.journaler(ro)
>> _is_readable: not readable, returning false
>> 2016-01-08 13:26:42.541302 7f16a50a2700 10 mds.0.journaler(ro)
>> wait_for_readable at 31935429183 onreadable 0x7f16a50a1d40
>> ----------------------------------------------------------------------------------------
>>
>
> this is like http://tracker.ceph.com/issues/13167. which version ceph
> are you using?

I am using ceph-0.94.1. I will try to merge your this fix.

Thanks.

>
> Regards
> Yan, Zheng
>
>>
>> From mds log object, 200.00001dbd is the latest one whose size is
>> 4192831 bytes. The write_pos in 200.00000000 object is 31935430656.
>>
>> The strange thing is the modified time of this two objects.
>> 200.00000000 object's modified time is older than 200.00001dbd
>> object's. Does this mean mds update log meta info first, then write
>> log data? In our case, writing log data to 200.00001dbd fails somehow?
>>
>> ----------------------------------------------------------------------------------------
>> [ceph@xxx /data/lib/ceph/osd/ceph-0/current/2.94_head/DIR_4/DIR_9/DIR_4]$ ls -lh
>> total 4.0K
>> -rw-r--r-- 1 root root 90 Jan  8 02:06 200.00000000__head_844F3494__2
>>
>> [ceph@xxx /data/lib/ceph/osd/ceph-0/current/2.53_head/DIR_3/DIR_5/DIR_2]$ ls -l
>> total 4096
>> -rw-r--r-- 1 root root 4192831 Jan  8 02:09 200.00001dbd__head_CCFA8253__2
>> ----------------------------------------------------------------------------------------
>>
>>
>> Regards,
>> Zhi Zhang (David)
>> Contact: zhang.david2011@xxxxxxxxx
>>               zhangz.david@xxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html