HI,
We are running VM images on ceph using RBD. We are seeing a problem where one of our VMs gets into problems due to IO not completing. iostat on the VM shows IO remaining in the queue, and disk utilisation for ceph based vdisks is 100%.
Upon investigation the problem seems to be with the message worker (msgr-worker-0) thread for one OSD. Restarting the OSD process fixes the problem: the IO gets completed and the VM is unfrozen, life continues without a problem. Until it happens again. Recurrence is between 30 minutes and over 24 hours. The VMs affected are always in the same pool, but it's not always the same VM that is affected. The problem occurs with different VMs on different hypervisors. The problem occurs on different ceph nodes and with different OSDs.
When the problem occurs, we see on the network a sudden jump from <100mbit/sec to >4gbit/sec of continuous traffic. This traffic is between the hypervisor and one OSD, in these cases always one of our HDD OSDs. The traffic is not visible within the VM, only on the hypervisor.
If the client is rebooted, the problem is gone. If the OSD is restarted, the problem is gone.
This is happening several times per day after we made several changes at the same time:
- add physical ram to the ceph nodes
- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv max' to 'bluestore cache autotune = 1' and 'osd memory target = 20401094656'.
- update ceph from 12.2.8 to 12.2.11
- update clients from 12.2.8 to 12.2.11
We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix this problem.
My request is that someone takes a look at our findings below and can give some insight into whether this is a bug, a misconfiguration or perhaps some idea of where to take a closer look.
our setup is:
8 identical nodes, each with 4 HDDs (8TB, 7k rpm) and 6 SSDs (4TB). There are a number of pools that using crush rules map to either the HDDs or the SSDs. The pool that always has this problem is called 'prod_slow' and goes to the HDDs.
I tracked down the osd by looking at the client port the client receives most traffic from (all 4gbps is read traffic, outgoing from ceph, incoming to client).
root@ceph-03:~# netstat -tlpn|grep 6804
tcp 0 0 10.60.8.11:6804 0.0.0.0:* LISTEN 3741/ceph-osd
tcp 0 0 10.60.6.11:6804 0.0.0.0:* LISTEN 3730/ceph-osd
root@ceph-03:~# ps uafx|grep 3730
ceph 3730 44.3 6.3 20214604 16848524 ? Ssl Jun05 524:14 /usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph
root@ceph-03:~# ps -L -p3730
PID LWP TTY TIME CMD
3730 3730 ? 00:00:05 ceph-osd
3730 3778 ? 00:00:00 log
3730 3791 ? 05:19:49 msgr-worker-0
3730 3802 ? 00:01:18 msgr-worker-1
3730 3810 ? 00:01:25 msgr-worker-2
3730 3842 ? 00:00:00 service
3730 3845 ? 00:00:00 admin_socket
3730 4015 ? 00:00:00 ceph-osd
3730 4017 ? 00:00:00 safe_timer
3730 4018 ? 00:00:03 safe_timer
3730 4019 ? 00:00:00 safe_timer
3730 4020 ? 00:00:00 safe_timer
3730 4021 ? 00:00:14 bstore_aio
3730 4023 ? 00:00:05 bstore_aio
3730 4280 ? 00:00:32 rocksdb:bg0
3730 4634 ? 00:00:00 dfin
3730 4635 ? 00:00:12 finisher
3730 4636 ? 00:00:51 bstore_kv_sync
3730 4637 ? 00:00:12 bstore_kv_final
3730 4638 ? 00:00:27 bstore_mempool
3730 5803 ? 00:03:08 ms_dispatch
3730 5804 ? 00:00:00 ms_local
3730 5805 ? 00:00:00 ms_dispatch
3730 5806 ? 00:00:00 ms_local
3730 5807 ? 00:00:00 ms_dispatch
3730 5808 ? 00:00:00 ms_local
3730 5809 ? 00:00:00 ms_dispatch
3730 5810 ? 00:00:00 ms_local
3730 5811 ? 00:00:00 ms_dispatch
3730 5812 ? 00:00:00 ms_local
3730 5813 ? 00:00:00 ms_dispatch
3730 5814 ? 00:00:00 ms_local
3730 5815 ? 00:00:00 ms_dispatch
3730 5816 ? 00:00:00 ms_local
3730 5817 ? 00:00:00 safe_timer
3730 5818 ? 00:00:00 fn_anonymous
3730 5819 ? 00:00:02 safe_timer
3730 5820 ? 00:00:00 tp_peering
3730 5821 ? 00:00:00 tp_peering
3730 5822 ? 00:00:00 fn_anonymous
3730 5823 ? 00:00:00 fn_anonymous
3730 5824 ? 00:00:00 safe_timer
3730 5825 ? 00:00:00 safe_timer
3730 5826 ? 00:00:00 safe_timer
3730 5827 ? 00:00:00 safe_timer
3730 5828 ? 00:00:00 osd_srv_agent
3730 5829 ? 00:01:15 tp_osd_tp
3730 5830 ? 00:01:27 tp_osd_tp
3730 5831 ? 00:01:40 tp_osd_tp
3730 5832 ? 00:00:49 tp_osd_tp
3730 5833 ? 03:09:49 tp_osd_tp
3730 5834 ? 00:00:00 tp_osd_remove
3730 5835 ? 00:00:00 tp_osd_recovery
3730 5836 ? 00:00:00 tp_osd_cmd
3730 5837 ? 00:00:06 osd_srv_heartbt
3730 5950 ? 00:00:00 signal_handler
3730 140250 ? 00:00:02 rocksdb:bg0
root@ceph-03:~#
stracing those shows a lot of details I'll add at the bottom of this post.
after increasing the debug logging on the OSD (and finally shutting down the OSD), the log shows:
2019-06-06 09:25:43.647703 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.649637 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.651376 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.653280 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.655156 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.656798 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.658183 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.659937 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.661915 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.664783 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
'the ms_handle_reset con' messages appear about 500 times per second. This continues until we restart the OSD. Once the OSD is up again, it starts to recover degraded PGs and all is fine.
I see the traffic isn't actually coming from the harddisk itself. When checking on the ceph node the HDD is doing very little. I think this traffic is either coming from the cache or maybe it's just messaging between the client and ceph, where the client retries to get the data over and over.
A strace of the process during problem state is here: https://gist.github.com/mvernimmen/5fe39d4722aa8dcff8294285460db021
ceph.conf is here: https://gist.github.com/mvernimmen/91f21b5ab8274920017fa2de743a5743
This is where my knowledge on debugging starts to run thin. I'd love to learn how to continue, but I'm in need of some help here. Thank you for your time!
Best regards,
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com