Looks like disk i/o is too slow. You can try configuring ceph.conf with
settings like "osd client op priority"
http://docs.ceph.com/docs/jewel/rados/configuration/osd-config-ref/
(which is not loading for me at the moment...)
On 01/05/2017 04:43 PM, Oliver Dzombic wrote:
Hi,
any idea of the root cause of this, inside a KVM VM, running qcow2 on
cephfs dmesg shows:
[846193.473396] ata1.00: status: { DRDY }
[846196.231058] ata1: soft resetting link
[846196.386714] ata1.01: NODEV after polling detection
[846196.391048] ata1.00: configured for MWDMA2
[846196.391053] ata1.00: retrying FLUSH 0xea Emask 0x4
[846196.391671] ata1: EH complete
[1019646.935659] UDP: bad checksum. From 122.224.153.109:46252 to
193.24.210.48:161 ulen 49
[1107679.421951] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1107679.423407] ata1.00: failed command: FLUSH CACHE EXT
[1107679.424871] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[1107679.427596] ata1.00: status: { DRDY }
[1107684.482035] ata1: link is slow to respond, please be patient (ready=0)
[1107689.480237] ata1: device not ready (errno=-16), forcing hardreset
[1107689.480267] ata1: soft resetting link
[1107689.637701] ata1.00: configured for MWDMA2
[1107689.637707] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107704.638255] ata1.00: qc timeout (cmd 0xea)
[1107704.638282] ata1.00: FLUSH failed Emask 0x4
[1107709.687013] ata1: link is slow to respond, please be patient (ready=0)
[1107710.095069] ata1: soft resetting link
[1107710.246403] ata1.01: NODEV after polling detection
[1107710.247225] ata1.00: configured for MWDMA2
[1107710.247229] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107710.248170] ata1: EH complete
[1199723.323256] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1199723.324769] ata1.00: failed command: FLUSH CACHE EXT
[1199723.326734] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Hostmachine is running Kernel 4.5.4
Hostmachine dmesg:
[1235641.055673] INFO: task qemu-kvm:18287 blocked for more than 120
seconds.
[1235641.056066] Not tainted 4.5.4ceph-vps-default #1
[1235641.056315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1235641.056583] qemu-kvm D ffff8812f939bb58 0 18287 1
0x00000080
[1235641.056587] ffff8812f939bb58 ffff881034c02b80 ffff881b7044ab80
ffff8812f939c000
[1235641.056590] 0000000000000000 7fffffffffffffff ffff881c7ffd7b70
ffffffff818c1d90
[1235641.056592] ffff8812f939bb70 ffffffff818c1525 ffff88103fa16d00
ffff8812f939bc18
[1235641.056594] Call Trace:
[1235641.056603] [<ffffffff818c1d90>] ? bit_wait+0x50/0x50
[1235641.056605] [<ffffffff818c1525>] schedule+0x35/0x80
[1235641.056609] [<ffffffff818c41d1>] schedule_timeout+0x231/0x2d0
[1235641.056613] [<ffffffff8115a19c>] ? ktime_get+0x3c/0xb0
[1235641.056622] [<ffffffff818c1d90>] ? bit_wait+0x50/0x50
[1235641.056624] [<ffffffff818c0b96>] io_schedule_timeout+0xa6/0x110
[1235641.056626] [<ffffffff818c1dab>] bit_wait_io+0x1b/0x60
[1235641.056627] [<ffffffff818c1950>] __wait_on_bit+0x60/0x90
[1235641.056632] [<ffffffff811eb46b>] wait_on_page_bit+0xcb/0xf0
[1235641.056636] [<ffffffff8112c6e0>] ? autoremove_wake_function+0x40/0x40
[1235641.056638] [<ffffffff811eb58f>] __filemap_fdatawait_range+0xff/0x180
[1235641.056641] [<ffffffff811eda61>] ?
__filemap_fdatawrite_range+0xd1/0x100
[1235641.056644] [<ffffffff811eb624>] filemap_fdatawait_range+0x14/0x30
[1235641.056646] [<ffffffff811edb9f>]
filemap_write_and_wait_range+0x3f/0x70
[1235641.056649] [<ffffffff814383f9>] ceph_fsync+0x69/0x5c0
[1235641.056656] [<ffffffff811678dd>] ? do_futex+0xfd/0x530
[1235641.056663] [<ffffffff812a737d>] vfs_fsync_range+0x3d/0xb0
[1235641.056668] [<ffffffff810038e9>] ?
syscall_trace_enter_phase1+0x139/0x150
[1235641.056670] [<ffffffff812a744d>] do_fsync+0x3d/0x70
[1235641.056673] [<ffffffff812a7703>] SyS_fdatasync+0x13/0x20
[1235641.056676] [<ffffffff818c506e>] entry_SYSCALL_64_fastpath+0x12/0x71
This sometimes happens, on a healthy cluster, running
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
OSD Servers running Kernel 4.5.5
Maybe it will cause the VM to refuse IO and has to be restarted. Maybe
not and it will continue.
Any input is appriciated. Thank you !
--
~~~~~~
David Welch
DevOps
ARS
http://thinkars.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com