Hi, I have sometime laggy ssd drives (intel s3610), maybe a firmware bug or controller bug. (I have tested with differents firmware, don't help). When this occur, I have this kind of error Aug 29 15:43:37 ceph5-7 kernel: [447163.801090] sd 0:0:3:0: Power-on or device reset occurred Just before this, the disk is lagging for 1-2min, and iowait increase a lot. Then monitor is mark it down. (at almost the same time than the disk reset, not sure it's related) Is it possible to reduce the threshold to force the osd down faster ? (maybe 30s for example). (I'm running ceph nautilus 14.2.6) The osd log is: 020-08-29 15:43:37.610 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618a5f27200, latency = 93.062018 2020-08-29 15:43:37.610 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618a2956c00, latency = 93.031658 2020-08-29 15:43:37.610 7fa1025b7700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0x10543123000~1000 (direct) since 447078s, timeout is 5s 2020-08-29 15:43:37.610 7fa1025b7700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 93.0167s, num_ios = 3072 2020-08-29 15:43:37.642 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56197c0f1200, latency = 92.892911 2020-08-29 15:43:37.642 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56194151db00, latency = 82.923060 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56197c0f0900, latency = 92.611975 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56190ea76000, latency = 92.439439 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x561964172000, latency = 92.861119 2020-08-29 15:43:37.646 7fa0fedb0700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0xf10fa5c000~1000 (direct) since 447091s, timeout is 5s 2020-08-29 15:43:37.646 7fa0fd5ad700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0xe08a1000~1000 (direct) since 447079s, timeout is 5s 2020-08-29 15:43:37.646 7fa0fedb0700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 79.5633s, num_ios = 2048 2020-08-29 15:43:37.646 7fa0fd5ad700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.3464s, num_ios = 18 2020-08-29 15:43:37.646 7fa1045bb700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0x103a3da1000~1000 (direct) since 447078s, timeout is 5s 2020-08-29 15:43:37.646 7fa0ffdb2700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0x100a4d26000~1000 (direct) since 447078s, timeout is 5s 2020-08-29 15:43:37.646 7fa0ffdb2700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 93.022s, num_ios = 1024 2020-08-29 15:43:37.646 7fa1045bb700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.5954s, num_ios = 1536 2020-08-29 15:43:37.646 7fa1005b3700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0x1003a505000~1000 (direct) since 447079s, timeout is 5s 2020-08-29 15:43:37.646 7fa101db6700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0x1012ea45000~1000 (direct) since 447080s, timeout is 5s 2020-08-29 15:43:37.646 7fa1005b3700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.3891s, num_ios = 3584 2020-08-29 15:43:37.646 7fa101db6700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 91.1452s, num_ios = 2048 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56191dbce300, latency = 82.924723 2020-08-29 15:43:37.646 7fa0fcdac700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read 0xfe7a8b7000~1000 (direct) since 447083s, timeout is 5s 2020-08-29 15:43:37.646 7fa0fcdac700 0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 87.9135s, num_ios = 512 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56196a946300, latency = 82.923270 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618c70d8f00, latency = 92.344641 2020-08-29 15:43:37.646 7fa129e17700 0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5619748d1500, latency = 92.333448 d for _txc_committed_kv, latency = 91.2568s, txc = 0x56196ab26f00 2020-08-29 15:43:37.774 7fa10e5cf700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.43 down, but it is still running ..... ..... 2020-08-29 15:43:37.774 7fa10e5cf700 0 log_channel(cluster) log [DBG] : map e604017 wrongly marked me down at e604016 2020-08-29 15:43:39.494 7fa0fd5ad700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) aio_submit retries 15 monitor logs --------------- mon1 ---- 2020-08-29 15:43:09.836 7f3ed3030700 0 log_channel(audit) log [DBG] : from='client.835295911 X.X.0.43:0/3364930077' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch 2020-08-29 15:43:10.140 7f3ed5835700 0 log_channel(cluster) log [WRN] : Health check update: 715 slow ops, oldest one blocked for 62 sec, daemons [osd.13,osd.14,osd.15,osd.16,osd.17,osd.20,osd.22,osd.23,osd.26,osd.27]... have slow ops. (SLOW_OPS) mon2 ---- 2020-08-29 15:43:20.140 7f3ed5835700 0 log_channel(cluster) log [WRN] : Health check update: 817 slow ops, oldest one blocked for 73 sec, daemons [osd.13,osd.14,osd.15,osd.16,osd.17,osd.20,osd.22,osd.23,osd.26,osd.27]... have slow ops. (SLOW_OPS) mon3 ------ 2020-08-29 15:43:08.066 7f3d213c5700 0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.5.0.93:0/955438826"} v 0) v1 2020-08-29 15:43:08.066 7f3d213c5700 0 log_channel(audit) log [INF] : from='client.835295776 X.X.0.92:0/2918303586' entity='client.admin' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.5.0.93:0/955438826"}]: dispatch 2020-08-29 15:43:09.974 7f3d213c5700 0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1 2020-08-29 15:43:09.974 7f3d213c5700 0 log_channel(audit) log [DBG] : from='client.? X.X.0.44:0/728500154' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch 2020-08-29 15:43:10.330 7f3d213c5700 0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1 2020-08-29 15:43 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx