cephfs ata1.00: status: { DRDY }

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Thu, 5 Jan 2017 22:43:20 +0100

Hi,

any idea of the root cause of this, inside a KVM VM, running qcow2 on
cephfs dmesg shows:

[846193.473396] ata1.00: status: { DRDY }
[846196.231058] ata1: soft resetting link
[846196.386714] ata1.01: NODEV after polling detection
[846196.391048] ata1.00: configured for MWDMA2
[846196.391053] ata1.00: retrying FLUSH 0xea Emask 0x4
[846196.391671] ata1: EH complete
[1019646.935659] UDP: bad checksum. From 122.224.153.109:46252 to
193.24.210.48:161 ulen 49
[1107679.421951] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1107679.423407] ata1.00: failed command: FLUSH CACHE EXT
[1107679.424871] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[1107679.427596] ata1.00: status: { DRDY }
[1107684.482035] ata1: link is slow to respond, please be patient (ready=0)
[1107689.480237] ata1: device not ready (errno=-16), forcing hardreset
[1107689.480267] ata1: soft resetting link
[1107689.637701] ata1.00: configured for MWDMA2
[1107689.637707] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107704.638255] ata1.00: qc timeout (cmd 0xea)
[1107704.638282] ata1.00: FLUSH failed Emask 0x4
[1107709.687013] ata1: link is slow to respond, please be patient (ready=0)
[1107710.095069] ata1: soft resetting link
[1107710.246403] ata1.01: NODEV after polling detection
[1107710.247225] ata1.00: configured for MWDMA2
[1107710.247229] ata1.00: retrying FLUSH 0xea Emask 0x4
[1107710.248170] ata1: EH complete
[1199723.323256] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
0x6 frozen
[1199723.324769] ata1.00: failed command: FLUSH CACHE EXT
[1199723.326734] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)

Hostmachine is running Kernel 4.5.4

Hostmachine dmesg:

[1235641.055673] INFO: task qemu-kvm:18287 blocked for more than 120
seconds.
[1235641.056066]       Not tainted 4.5.4ceph-vps-default #1
[1235641.056315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1235641.056583] qemu-kvm        D ffff8812f939bb58     0 18287      1
0x00000080
[1235641.056587]  ffff8812f939bb58 ffff881034c02b80 ffff881b7044ab80
ffff8812f939c000
[1235641.056590]  0000000000000000 7fffffffffffffff ffff881c7ffd7b70
ffffffff818c1d90
[1235641.056592]  ffff8812f939bb70 ffffffff818c1525 ffff88103fa16d00
ffff8812f939bc18
[1235641.056594] Call Trace:
[1235641.056603]  [<ffffffff818c1d90>] ? bit_wait+0x50/0x50
[1235641.056605]  [<ffffffff818c1525>] schedule+0x35/0x80
[1235641.056609]  [<ffffffff818c41d1>] schedule_timeout+0x231/0x2d0
[1235641.056613]  [<ffffffff8115a19c>] ? ktime_get+0x3c/0xb0
[1235641.056622]  [<ffffffff818c1d90>] ? bit_wait+0x50/0x50
[1235641.056624]  [<ffffffff818c0b96>] io_schedule_timeout+0xa6/0x110
[1235641.056626]  [<ffffffff818c1dab>] bit_wait_io+0x1b/0x60
[1235641.056627]  [<ffffffff818c1950>] __wait_on_bit+0x60/0x90
[1235641.056632]  [<ffffffff811eb46b>] wait_on_page_bit+0xcb/0xf0
[1235641.056636]  [<ffffffff8112c6e0>] ? autoremove_wake_function+0x40/0x40
[1235641.056638]  [<ffffffff811eb58f>] __filemap_fdatawait_range+0xff/0x180
[1235641.056641]  [<ffffffff811eda61>] ?
__filemap_fdatawrite_range+0xd1/0x100
[1235641.056644]  [<ffffffff811eb624>] filemap_fdatawait_range+0x14/0x30
[1235641.056646]  [<ffffffff811edb9f>]
filemap_write_and_wait_range+0x3f/0x70
[1235641.056649]  [<ffffffff814383f9>] ceph_fsync+0x69/0x5c0
[1235641.056656]  [<ffffffff811678dd>] ? do_futex+0xfd/0x530
[1235641.056663]  [<ffffffff812a737d>] vfs_fsync_range+0x3d/0xb0
[1235641.056668]  [<ffffffff810038e9>] ?
syscall_trace_enter_phase1+0x139/0x150
[1235641.056670]  [<ffffffff812a744d>] do_fsync+0x3d/0x70
[1235641.056673]  [<ffffffff812a7703>] SyS_fdatasync+0x13/0x20
[1235641.056676]  [<ffffffff818c506e>] entry_SYSCALL_64_fastpath+0x12/0x71

This sometimes happens, on a healthy cluster, running

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

OSD Servers running Kernel 4.5.5

Maybe it will cause the VM to refuse IO and has to be restarted. Maybe
not and it will continue.

Any input is appriciated. Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com