how to reduce osd down interval on laggy disk ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have sometime laggy ssd drives (intel s3610), maybe a firmware bug or controller bug.
(I have tested with differents firmware, don't help).

When this occur,
I have this kind of error

Aug 29 15:43:37 ceph5-7 kernel: [447163.801090] sd 0:0:3:0: Power-on or device reset occurred

Just before this,
the disk is lagging for 1-2min, and iowait increase a lot.

Then monitor is mark it down. (at almost  the same time than the disk reset, not sure it's related)

Is it possible to reduce the threshold to force the osd down faster ? (maybe 30s for example).

(I'm running ceph nautilus 14.2.6)




The osd log is:

020-08-29 15:43:37.610 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618a5f27200, latency = 93.062018
2020-08-29 15:43:37.610 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618a2956c00, latency = 93.031658
2020-08-29 15:43:37.610 7fa1025b7700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0x10543123000~1000 (direct) since 447078s, timeout is 5s
2020-08-29 15:43:37.610 7fa1025b7700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 93.0167s, num_ios = 3072
2020-08-29 15:43:37.642 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56197c0f1200, latency = 92.892911
2020-08-29 15:43:37.642 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56194151db00, latency = 82.923060
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56197c0f0900, latency = 92.611975
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56190ea76000, latency = 92.439439
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x561964172000, latency = 92.861119
2020-08-29 15:43:37.646 7fa0fedb0700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0xf10fa5c000~1000 (direct) since 447091s, timeout is 5s
2020-08-29 15:43:37.646 7fa0fd5ad700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0xe08a1000~1000 (direct) since 447079s, timeout is 5s
2020-08-29 15:43:37.646 7fa0fedb0700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 79.5633s, num_ios = 2048
2020-08-29 15:43:37.646 7fa0fd5ad700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.3464s, num_ios = 18
2020-08-29 15:43:37.646 7fa1045bb700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0x103a3da1000~1000 (direct) since 447078s, timeout is 5s
2020-08-29 15:43:37.646 7fa0ffdb2700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0x100a4d26000~1000 (direct) since 447078s, timeout is 5s
2020-08-29 15:43:37.646 7fa0ffdb2700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 93.022s, num_ios = 1024
2020-08-29 15:43:37.646 7fa1045bb700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.5954s, num_ios = 1536
2020-08-29 15:43:37.646 7fa1005b3700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0x1003a505000~1000 (direct) since 447079s, timeout is 5s
2020-08-29 15:43:37.646 7fa101db6700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0x1012ea45000~1000 (direct) since 447080s, timeout is 5s
2020-08-29 15:43:37.646 7fa1005b3700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 92.3891s, num_ios = 3584
2020-08-29 15:43:37.646 7fa101db6700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 91.1452s, num_ios = 2048
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56191dbce300, latency = 82.924723
2020-08-29 15:43:37.646 7fa0fcdac700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) read stalled read  0xfe7a8b7000~1000 (direct) since 447083s, timeout is 5s
2020-08-29 15:43:37.646 7fa0fcdac700  0 bluestore(/var/lib/ceph/osd/ceph-43) log_latency_fn slow operation observed for _do_read, latency = 87.9135s, num_ios = 512
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x56196a946300, latency = 82.923270
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5618c70d8f00, latency = 92.344641
2020-08-29 15:43:37.646 7fa129e17700  0 bluestore(/var/lib/ceph/osd/ceph-43) _txc_state_proc slow aio_wait, txc = 0x5619748d1500, latency = 92.333448
d for _txc_committed_kv, latency = 91.2568s, txc = 0x56196ab26f00
2020-08-29 15:43:37.774 7fa10e5cf700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.43 down, but it is still running
.....
.....
2020-08-29 15:43:37.774 7fa10e5cf700  0 log_channel(cluster) log [DBG] : map e604017 wrongly marked me down at e604016
2020-08-29 15:43:39.494 7fa0fd5ad700 -1 bdev(0x5618a06c4000 /var/lib/ceph/osd/ceph-43/block) aio_submit retries 15




monitor logs
---------------

mon1
----
2020-08-29 15:43:09.836 7f3ed3030700  0 log_channel(audit) log [DBG] : from='client.835295911 X.X.0.43:0/3364930077' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-08-29 15:43:10.140 7f3ed5835700  0 log_channel(cluster) log [WRN] : Health check update: 715 slow ops, oldest one blocked for 62 sec, daemons [osd.13,osd.14,osd.15,osd.16,osd.17,osd.20,osd.22,osd.23,osd.26,osd.27]... have slow ops. (SLOW_OPS)


mon2
----
2020-08-29 15:43:20.140 7f3ed5835700  0 log_channel(cluster) log [WRN] : Health check update: 817 slow ops, oldest one blocked for 73 sec, daemons [osd.13,osd.14,osd.15,osd.16,osd.17,osd.20,osd.22,osd.23,osd.26,osd.27]... have slow ops. (SLOW_OPS)



mon3
------
2020-08-29 15:43:08.066 7f3d213c5700  0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.5.0.93:0/955438826"} v 0) v1
2020-08-29 15:43:08.066 7f3d213c5700  0 log_channel(audit) log [INF] : from='client.835295776 X.X.0.92:0/2918303586' entity='client.admin' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "10.5.0.93:0/955438826"}]: dispatch
2020-08-29 15:43:09.974 7f3d213c5700  0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-08-29 15:43:09.974 7f3d213c5700  0 log_channel(audit) log [DBG] : from='client.? X.X.0.44:0/728500154' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-08-29 15:43:10.330 7f3d213c5700  0 mon.ceph5-2@1(peon) e5 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-08-29 15:43
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux