Regards, Zhi Zhang (David) Contact: zhang.david2011@xxxxxxxxx zhangz.david@xxxxxxxxxxx On Mon, Jan 11, 2016 at 5:06 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Mon, Jan 11, 2016 at 11:33 AM, Zhi Zhang <zhang.david2011@xxxxxxxxx> wrote: >> Hi Guys, >> >> Last week one of our clusters became unavailable because MDSes always >> stuck on replay state. >> >> From the log, we found some clues: >> >> MDS wants to replay log from 31885100783 to 31935430656. The last >> prefetch expects 4194304 bytes from one mds log. At this point, the >> requested_pos is increased to write_pos which is 31935430656, so there >> will be no prefetch any more. >> >> But this mds log (already the latest one) is only 4192831 bytes, so >> finally read_pos can only reach 31935429183, which is less than >> expected write_pos. MDS wants to read more for replay, but there is no >> mds log can be read, hence MDS is stuck forever. >> >> See below mds logs: >> ---------------------------------------------------------------------------------------- >> 2016-01-08 13:26:34.468773 7f16a78a7700 10 mds.0.log replay start, >> from 31885100783 to 31935430656 >> >> 2016-01-08 13:26:35.164194 7f16a50a2700 10 mds.0.journaler(ro) >> _issue_read reading 31931236352~4194304, read pointers >> 31889294570/31931236352/31935430656 >> 2016-01-08 13:26:35.164219 7f16a50a2700 10 mds.0.objecter _op_submit >> op 0x536ba80 >> 2016-01-08 13:26:35.164235 7f16a50a2700 10 mds.0.objecter _calc_target >> pgid 2.ccfa8253 acting [0,6] >> 2016-01-08 13:26:35.164241 7f16a50a2700 20 mds.0.objecter _get_session >> s=0x53b0f00 osd=0 2 >> 2016-01-08 13:26:35.164243 7f16a50a2700 20 mds.0.objecter note: not >> requesting commit >> 2016-01-08 13:26:35.164256 7f16a50a2700 10 mds.0.objecter _op_submit >> oid 200.00001dbd @2 @2 [read 0~4194304] tid 0 osd.0 >> >> 2016-01-08 13:26:35.172434 7f16a98ab700 10 mds.0.objecter ms_dispatch >> 0x5370000 osd_op_reply(22 200.00001dbd [read 0~4192831] v0'0 uv42142 >> ondisk = 0) v6 >> 2016-01-08 13:26:35.172493 7f16a98ab700 10 mds.0.objecter in handle_osd_op_reply >> >> 2016-01-08 13:26:35.172832 7f16a78a7700 10 mds.0.journaler(ro) >> _finish_read got 31931236352~4192831 >> 2016-01-08 13:26:35.172852 7f16a78a7700 10 mds.0.journaler(ro) >> _assimilate_prefetch 31931236352~4192831 >> 2016-01-08 13:26:35.172896 7f16a78a7700 10 mds.0.journaler(ro) >> _assimilate_prefetch read_buf now 31889339375~46089808, read pointers >> 31889339375/31935429183/31935430656 >> >> 2016-01-08 13:26:42.541242 7f16a50a2700 10 mds.0.journaler(ro) >> _is_readable read_buf.length() == 0, but need 12 for next entry; >> fetch_len is 41943040 >> 2016-01-08 13:26:42.541251 7f16a50a2700 10 mds.0.journaler(ro) >> _is_readable: not readable, returning false >> 2016-01-08 13:26:42.541261 7f16a50a2700 10 mds.0.journaler(ro) _prefetch >> 2016-01-08 13:26:42.541272 7f16a50a2700 10 mds.0.journaler(ro) >> _prefetch fetch_len=41943040 read_pos=31935429183 adjustment=1473 >> target=31977373696 write_pos=31935430656 >> 2016-01-08 13:26:42.541282 7f16a50a2700 10 mds.0.journaler(ro) >> _is_readable read_buf.length() == 0, but need 12 for next entry; >> fetch_len is 41943040 >> 2016-01-08 13:26:42.541292 7f16a50a2700 10 mds.0.journaler(ro) >> _is_readable: not readable, returning false >> 2016-01-08 13:26:42.541302 7f16a50a2700 10 mds.0.journaler(ro) >> wait_for_readable at 31935429183 onreadable 0x7f16a50a1d40 >> ---------------------------------------------------------------------------------------- >> > > this is like http://tracker.ceph.com/issues/13167. which version ceph > are you using? I am using ceph-0.94.1. I will try to merge your this fix. Thanks. > > Regards > Yan, Zheng > >> >> From mds log object, 200.00001dbd is the latest one whose size is >> 4192831 bytes. The write_pos in 200.00000000 object is 31935430656. >> >> The strange thing is the modified time of this two objects. >> 200.00000000 object's modified time is older than 200.00001dbd >> object's. Does this mean mds update log meta info first, then write >> log data? In our case, writing log data to 200.00001dbd fails somehow? >> >> ---------------------------------------------------------------------------------------- >> [ceph@xxx /data/lib/ceph/osd/ceph-0/current/2.94_head/DIR_4/DIR_9/DIR_4]$ ls -lh >> total 4.0K >> -rw-r--r-- 1 root root 90 Jan 8 02:06 200.00000000__head_844F3494__2 >> >> [ceph@xxx /data/lib/ceph/osd/ceph-0/current/2.53_head/DIR_3/DIR_5/DIR_2]$ ls -l >> total 4096 >> -rw-r--r-- 1 root root 4192831 Jan 8 02:09 200.00001dbd__head_CCFA8253__2 >> ---------------------------------------------------------------------------------------- >> >> >> Regards, >> Zhi Zhang (David) >> Contact: zhang.david2011@xxxxxxxxx >> zhangz.david@xxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html