Re: PGs stuck in active+clean+replay

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxx> · Tue, 27 Oct 2015 18:22:42 +0000

Hi Greg,

No, unfortunately I haven¹t found any resolution to it.  We are using
cephfs, the whole installation is on 0.94.4.  What I did notice is that
performance is extremely poor when backfilling is happening.  I wonder if
timeouts of some kind could cause PG¹s to get stuck in replay.  I lowered
the Œosd max backfills¹ parameter today from the default 10 all the way
down to 1 to see if it improves things.  Client read/write performance has
definitely improved since then, whether this improves the
Œstuck-in-replay¹ situation, I¹m still waiting to see.

Andras

On 10/27/15, 2:06 PM, "Gregory Farnum" <gfarnum@xxxxxxxxxx> wrote:

>On Tue, Oct 27, 2015 at 11:03 AM, Gregory Farnum <gfarnum@xxxxxxxxxx>
>wrote:
>> On Thu, Oct 22, 2015 at 3:58 PM, Andras Pataki
>> <apataki@xxxxxxxxxxxxxxxxxxxx> wrote:
>>> Hi ceph users,
>>>
>>> We¹ve upgraded to 0.94.4 (all ceph daemons got restarted)  and are in
>>>the
>>> middle of doing some rebalancing due to crush changes (removing some
>>>disks).
>>> During the rebalance, I see that some placement groups get stuck in
>>> Œactive+clean+replay¹ for a long time (essentially until I restart the
>>>OSD
>>> they are on).  All IO for these PGs gets queued, and clients hang.
>>>
>>> ceph health details the blocked ops in it:
>>>
>>> 4 ops are blocked > 2097.15 sec
>>> 1 ops are blocked > 131.072 sec
>>> 2 ops are blocked > 2097.15 sec on osd.41
>>> 2 ops are blocked > 2097.15 sec on osd.119
>>> 1 ops are blocked > 131.072 sec on osd.124
>>>
>>> ceph pg dump | grep replay
>>> dumped all in format plain
>>> 2.121b 3836 0 0 0 0 15705994377 3006 3006 active+clean+replay
>>>2015-10-22
>>> 14:12:01.104564 123840'2258640 125080:1252265 [41,111] 41 [41,111] 41
>>> 114515'2258631 2015-10-20 18:44:09.757620 114515'2258631 2015-10-20
>>> 18:44:09.757620
>>> 2.b4 3799 0 0 0 0 15604827445 3003 3003 active+clean+replay 2015-10-22
>>> 13:57:25.490150 119558'2322127 125084:1174759 [119,75] 119 [119,75] 119
>>> 114515'2322124 2015-10-20 11:00:51.448239 114515'2322124 2015-10-17
>>> 09:22:14.676006
>>>
>>> Both osd.41 and OSD.119 are doing this ³replay².
>>>
>>> The end of the log of osd.41:
>>>
>>> 2015-10-22 10:44:35.727000 7f037929b700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.170:6913/121602 pipe(0x3b4d0000 sd=125 :6827 s=2 pgs=618 cs=1
>>>l=0
>>> c=0x374398c0).fault with nothing to send, going to standby
>>> 2015-10-22 10:50:25.954404 7f038adae700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3adff000 sd=229 :6827 s=2 pgs=94 cs=3
>>>l=0
>>> c=0x3e9d0940).fault with nothing to send, going to standby
>>> 2015-10-22 12:11:28.029214 7f03a0e0d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.106:6864/102556 pipe(0x40afe000 sd=621 :6827 s=2 pgs=91 cs=3
>>>l=0
>>> c=0x3acf5860).fault with nothing to send, going to standby
>>> 2015-10-22 12:45:45.404765 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3cec0).accept connect_seq 1 vs existing 1 state standby
>>> 2015-10-22 12:45:45.405232 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3cec0).accept connect_seq 2 vs existing 1 state standby
>>> 2015-10-22 12:52:49.062752 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3ba20).accept connect_seq 3 vs existing 3 state standby
>>> 2015-10-22 12:52:49.063169 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x37b3ba20).accept connect_seq 4 vs existing 3 state standby
>>> 2015-10-22 13:02:16.573546 7f038050d700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.102:6837/77957 pipe(0x39cbe000 sd=578 :6827 s=2 pgs=110 cs=3
>>>l=0
>>> c=0x37b92000).fault with nothing to send, going to standby
>>> 2015-10-22 13:07:58.667432 7f036525c700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6809/141110 pipe(0x3f637000 sd=405 :6827 s=2 pgs=146 cs=5
>>>l=0
>>> c=0x3e9d0940).fault with nothing to send, going to standby
>>> 2015-10-22 13:25:35.020722 7f038191a700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.111:6841/71447 pipe(0x3e78e000 sd=205 :6827 s=2 pgs=82 cs=3 l=0
>>> c=0x36bf5860).fault with nothing to send, going to standby
>>> 2015-10-22 13:45:48.610068 7f0361620700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=0 pgs=0 cs=0 l=0
>>> c=0x373e11e0).accept we reset (peer sent cseq 1), sending RESETSESSION
>>> 2015-10-22 13:45:48.880698 7f0361620700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6841/99063 pipe(0x3e43b000 sd=539 :6827 s=2 pgs=199 cs=1
>>>l=0
>>> c=0x373e11e0).reader missed message?  skipped from seq 0 to 825623574
>>> 2015-10-22 14:11:32.967937 7f035d9e4700  0 -- 10.4.36.105:6827/98624 >>
>>> 10.4.36.105:6802/98037 pipe(0x3ce82000 sd=63 :43711 s=2 pgs=144 cs=3
>>>l=0
>>> c=0x3bf8c100).fault with nothing to send, going to standby
>>> 2015-10-22 14:12:35.338635 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 30.079053 secs
>>> 2015-10-22 14:12:35.338875 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 30.079053 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:12:35.339050 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 30.063995 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:13:11.817243 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 66.557970 secs
>>> 2015-10-22 14:13:11.817408 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 66.557970 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:13:11.817459 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 66.542913 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:14:07.789899 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 122.530339 secs
>>> 2015-10-22 14:14:07.790050 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 122.530339 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:14:07.790234 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 122.515282 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:16:08.916570 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 243.657192 secs
>>> 2015-10-22 14:16:08.916766 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 243.657192 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:16:08.916789 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 243.642134 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:20:06.142134 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 480.882700 secs
>>> 2015-10-22 14:20:06.142151 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 480.882700 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:20:06.142410 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 480.867642 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:28:05.979668 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> 2 slow requests, 2 included below; oldest blocked for > 960.720279 secs
>>> 2015-10-22 14:28:05.979684 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 960.720279 seconds old, received at 2015-10-22
>>>14:12:05.259156:
>>> osd_op(client.734338.0:50618164 100000b8f73.000003ef [read 0~65536]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>> 2015-10-22 14:28:05.979720 7f03afffb700  0 log_channel(cluster) log
>>>[WRN] :
>>> slow request 960.705221 seconds old, received at 2015-10-22
>>>14:12:05.274213:
>>> osd_op(client.734338.0:50618166 100000b8f73.000003ef [read
>>>65536~131072]
>>> 2.338a921b ack+read+known_if_redirected e124995) currently waiting for
>>> replay end
>>>
>>> The process seems to be pretty busy (using about 20% of a CPU core or
>>>so) -
>>> I.e. Definitely not locked up.
>>>
>>> When I restart it, the problem seems to resolve itself.  However, the
>>> shutdown of the OSD takes a long time (half an hour or more).  During
>>>this
>>> time the osd is spewing megabytes of messages like
>>>
>>> 2015-10-22 14:40:52.225902 7f03a83cc700 10
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/6d49ead4/10000022ccf.000036f0/head//2 = 0
>>> 2015-10-22 14:40:52.225919 7f03a83cc700 15
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/5959ead4/10000147a83.00000807/head//2
>>> 2015-10-22 14:40:52.243392 7f03a83cc700 20
>>> filestore(/var/lib/ceph/osd/ceph-41) lfn_unlink: clearing omap on
>>> 5959ead4/10000147a83.00000807/head//2 in cid 2.ad4_head
>>> 2015-10-22 14:40:52.252792 7f03a83cc700 10
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/5959ead4/10000147a83.00000807/head//2 = 0
>>> 2015-10-22 14:40:52.252809 7f03a83cc700 15
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/ff59ead4/10000022e89.0000af9c/head//2
>>> 2015-10-22 14:40:52.252897 7f03a83cc700 20
>>> filestore(/var/lib/ceph/osd/ceph-41) lfn_unlink: clearing omap on
>>> ff59ead4/10000022e89.0000af9c/head//2 in cid 2.ad4_head
>>> 2015-10-22 14:40:52.253132 7f03a83cc700 10 filestore oid:
>>> ff59ead4/10000022e89.0000af9c/head//2 not skipping op, *spos
>>>11788505.0.14
>>> 2015-10-22 14:40:52.253142 7f03a83cc700 10 filestore  > header.spos
>>>0.0.0
>>> 2015-10-22 14:40:52.253145 7f03a83cc700 20 filestore remove_map_header:
>>> removing 1604853 oid ff59ead4/10000022e89.0000af9c/head//2
>>> 2015-10-22 14:40:52.253163 7f03a83cc700 20 filestore clear_header:
>>>clearing
>>> seq 1604853
>>> 2015-10-22 14:40:52.253828 7f03a83cc700 10
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/ff59ead4/10000022e89.0000af9c/head//2 = 0
>>> 2015-10-22 14:40:52.253847 7f03a83cc700 15
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/f379ead4/100000b1bc9.00000146/head//2
>>> 2015-10-22 14:40:52.253948 7f03a83cc700 20
>>> filestore(/var/lib/ceph/osd/ceph-41) lfn_unlink: clearing omap on
>>> f379ead4/100000b1bc9.00000146/head//2 in cid 2.ad4_head
>>> 2015-10-22 14:40:52.254195 7f03a83cc700 10 filestore oid:
>>> f379ead4/100000b1bc9.00000146/head//2 not skipping op, *spos
>>>11788505.0.15
>>> 2015-10-22 14:40:52.254206 7f03a83cc700 10 filestore  > header.spos
>>>0.0.0
>>> 2015-10-22 14:40:52.254210 7f03a83cc700 20 filestore remove_map_header:
>>> removing 1604854 oid f379ead4/100000b1bc9.00000146/head//2
>>> 2015-10-22 14:40:52.254229 7f03a83cc700 20 filestore clear_header:
>>>clearing
>>> seq 1604854
>>> 2015-10-22 14:40:52.254953 7f03a83cc700 10
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/f379ead4/100000b1bc9.00000146/head//2 = 0
>>> 2015-10-22 14:40:52.254972 7f03a83cc700 15
>>> filestore(/var/lib/ceph/osd/ceph-41) remove
>>> 2.ad4_head/7489ead4/10000022af1.00007093/head//2
>>> 2015-10-22 14:40:52.255086 7f03a83cc700 20
>>> filestore(/var/lib/ceph/osd/ceph-41) lfn_unlink: clearing omap on
>>> 7489ead4/10000022af1.00007093/head//2 in cid 2.ad4_head
>>> 2015-10-22 14:40:52.255318 7f03a83cc700 10 filestore oid:
>>> 7489ead4/10000022af1.00007093/head//2 not skipping op, *spos
>>>11788505.0.16
>>>
>>> Seemingly doing some very large cleanup before shutting down.
>>>
>>> Today I¹ve had something like a dozen such osd¹s out of about a
>>>hundred that
>>> we have on the cluster.  Any ideas what might be going on here?  On
>>>some of
>>> these I¹ve waited a few hours, but it doesn¹t seem to resolve itself 
>>> should I wait more even?
>>
>> Have you learned anything else about this? Replay should definitely
>> not be taking that long. :(
>>
>> Do I correctly assume you're using CephFS on top of this pool?
>
>Oh, and what version are you running?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com