We are using replica 2 and min size is
2. A small amount of data is sitting around from when we were
running the default 3.
Looks like the problem started around here: 2017-06-22 14:54:29.173982 7f3c39f6f700 0 log_channel(cluster) log [INF] : 1.2c9 deep-scrub ok 2017-06-22 14:54:29.690401 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.8 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398) 2017-06-22 14:54:29.690423 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.10 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398) 2017-06-22 14:54:29.690429 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.11 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398) 2017-06-22 14:54:29.907210 7f3c3776a700 -1 osd.13 25313 heartbeat_check: no reply from osd.8 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207) 2017-06-22 14:54:29.907221 7f3c3776a700 -1 osd.13 25313 heartbeat_check: no reply from osd.10 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207) 2017-06-22 14:54:29.907227 7f3c3776a700 -1 osd.13 25313 heartbeat_check: no reply from osd.11 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207) 2017-06-22 14:54:30.690551 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.8 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548) 2017-06-22 14:54:30.690573 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.10 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548) 2017-06-22 14:54:30.690579 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.11 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548) 2017-06-22 14:54:31.690708 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.8 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706) 2017-06-22 14:54:31.690729 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.10 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706) 2017-06-22 14:54:31.690735 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.11 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706) 2017-06-22 14:54:32.690862 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.8 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860) 2017-06-22 14:54:32.690884 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.10 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860) 2017-06-22 14:54:32.690890 7f3c6e03d700 -1 osd.13 25313 heartbeat_check: no reply from osd.11 since back 2017-06-22 14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860) 2017-06-22 14:54:32.955768 7f3c5675c700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.3:6804/54002870 pipe(0x7f3ca7475400 sd=116 :6805 s=2 pgs=15531 cs=1 l=0 c=0x7f3c935ee700).fault with nothing to send, going to standby 2017-06-22 14:54:32.958675 7f3c2ea0e700 0 -- 172.16.31.7:0/2128624 >> 172.16.31.3:6808/54002870 pipe(0x7f3c9c150000 sd=189 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3c97726880).fault 2017-06-22 14:54:32.958712 7f3c2c3e8700 0 -- 172.16.31.7:0/2128624 >> 172.16.31.3:6810/54002870 pipe(0x7f3ca1727400 sd=233 :0 s=1 pgs=0 cs=0 l=1 c=0x7f3c9cb16300).fault 2017-06-22 14:54:34.176427 7f3c33a5e700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.3:6800/55002870 pipe(0x7f3c99679400 sd=216 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f3c9532d200).accept connect_seq 0 vs existing 0 state connecting 2017-06-22 14:54:34.545873 7f3c3ef79700 0 log_channel(cluster) log [INF] : 2.1b5 continuing backfill to osd.30 from (25014'10407450,25294'10411861] MIN to 25294'10411861 2017-06-22 14:54:34.546531 7f3c3e778700 0 log_channel(cluster) log [INF] : 2.145 continuing backfill to osd.30 from (25014'10399385,25294'10404028] MIN to 25294'10404028 2017-06-22 14:54:34.546551 7f3c43782700 0 log_channel(cluster) log [INF] : 1.2b3 continuing backfill to osd.30 from (24856'173854,25294'177823] MIN to 25294'177823 2017-06-22 14:54:57.873097 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c95e0b400 sd=188 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f3c9fc71f80).accept we reset (peer sent cseq 1), sending RESETSESSION 2017-06-22 14:54:57.874965 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c95e0b400 sd=188 :6805 s=2 pgs=15769 cs=1 l=0 c=0x7f3c9fc71f80).reader missed message? skipped from seq 0 to 1739054688 2017-06-22 14:54:57.875902 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c9af11400 sd=188 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f3c9fc72e80).accept we reset (peer sent cseq 2), sending RESETSESSION 2017-06-22 14:54:57.878969 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c9af11400 sd=188 :6805 s=2 pgs=15771 cs=1 l=0 c=0x7f3c9fc72e80).reader missed message? skipped from seq 0 to 2095419103 2017-06-22 14:54:57.880075 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c9af10000 sd=188 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f3c87f3b480).accept we reset (peer sent cseq 2), sending RESETSESSION 2017-06-22 14:54:57.880781 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c9af10000 sd=188 :6805 s=2 pgs=15772 cs=1 l=0 c=0x7f3c87f3b480).reader missed message? skipped from seq 0 to 1022945821 2017-06-22 14:54:57.881842 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c99679400 sd=188 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f3c91431a80).accept we reset (peer sent cseq 2), sending RESETSESSION 2017-06-22 14:54:57.902933 7f3c27763700 0 -- 172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857 pipe(0x7f3c99679400 sd=188 :6805 s=2 pgs=15773 cs=1 l=0 c=0x7f3c91431a80).fault with nothing to send, going to standby 2017-06-22 14:56:52.538631 7f3c6e03d700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 31.862172 secs 2017-06-22 14:56:52.538641 7f3c6e03d700 0 log_channel(cluster) log [WRN] : slow request 31.665672 seconds old, received at 2017-06-22 14:56:20.872915: osd_op(client.365488.1:6212453 2.debad545 10009cc83a6.00000666 [read 0~4194304 [1@-1]] snapc 0=[] ack+read+known_if_redirected e25348) currently waiting for active 2017-06-22 14:56:52.538646 7f3c6e03d700 0 log_channel(cluster) log [WRN] : slow request 31.862172 seconds old, received at 2017-06-22 14:56:20.676415: osd_op(client.365488.1:6212450 2.f781b45 10009cc83a6.00000664 [read 0~4194304 [1@-1]] snapc 0=[] ack+read+known_if_redirected e25348) currently waiting for active 2017-06-22 14:57:18.140672 7f3c6e03d700 0 log_channel(cluster) log [WRN] : 3 slow requests, 1 included below; oldest blocked for > 57.464203 secs 2017-06-22 14:57:18.140683 7f3c6e03d700 0 log_channel(cluster) log [WRN] : slow request 30.554865 seconds old, received at 2017-06-22 14:56:47.585754: osd_op(client.364255.1:1681646 2.b387afb5 1000a234aea.00000136 [write 0~4194304 [1@-1]] snapc 1=[] ondisk+write+known_if_redirected e25351) currently waiting for active On 06/23/2017 03:28 PM, David Turner wrote:
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com