OSD hanging on 12.2.12 by message worker

Max Vernimmen <vernimmen@xxxxxxxxxxxxx> · Thu, 6 Jun 2019 11:04:23 +0200

HI,

We are running VM images on ceph using RBD. We are seeing a problem where one of our VMs gets into problems due to IO not completing. iostat on the VM shows IO remaining in the queue, and disk utilisation for ceph based vdisks is 100%.

Upon investigation the problem seems to be with the message worker (msgr-worker-0) thread for one OSD. Restarting the OSD process fixes the problem: the IO gets completed and the VM is unfrozen, life continues without a problem. Until it happens again. Recurrence is between 30 minutes and over 24 hours. The VMs affected are always in the same pool, but it's not always the same VM that is affected. The problem occurs with different VMs on different hypervisors. The problem occurs on different ceph nodes and with different OSDs.

When the problem occurs, we see on the network a sudden jump from <100mbit/sec to >4gbit/sec of continuous traffic. This traffic is between the hypervisor and one OSD, in these cases always one of our HDD OSDs. The traffic is not visible within the VM, only on the hypervisor.

If the client is rebooted,  the problem is gone. If the OSD  is restarted, the problem is gone.
This is happening several times per day after we made several changes at the same time:

add physical ram to the ceph nodes
move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv max' to 'bluestore cache autotune = 1' and 'osd memory target = 20401094656'.
update ceph from 12.2.8 to 12.2.11
update clients from 12.2.8 to 12.2.11

We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix this problem.

My request is that someone takes a look at our findings below and can give some insight into whether this is a bug, a misconfiguration or perhaps some idea of where to take a closer look.

our setup is:
8 identical nodes, each with 4 HDDs (8TB, 7k rpm) and  6 SSDs (4TB). There are a number of pools that using crush rules  map to either the HDDs  or the SSDs. The pool that always has this problem is called 'prod_slow' and goes to the HDDs.

I tracked down the osd by looking at the client port the client receives most traffic from (all 4gbps is read traffic, outgoing from ceph, incoming to client).

root@ceph-03:~# netstat -tlpn|grep 6804
tcp        0      0 10.60.8.11:6804         0.0.0.0:*               LISTEN      3741/ceph-osd
tcp        0      0 10.60.6.11:6804         0.0.0.0:*               LISTEN      3730/ceph-osd

root@ceph-03:~# ps uafx|grep 3730
ceph        3730 44.3  6.3 20214604 16848524 ?   Ssl  Jun05 524:14 /usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph

root@ceph-03:~# ps -L -p3730
    PID     LWP TTY          TIME CMD
   3730    3730 ?        00:00:05 ceph-osd
   3730    3778 ?        00:00:00 log
   3730    3791 ?        05:19:49 msgr-worker-0
   3730    3802 ?        00:01:18 msgr-worker-1
   3730    3810 ?        00:01:25 msgr-worker-2
   3730    3842 ?        00:00:00 service
   3730    3845 ?        00:00:00 admin_socket
   3730    4015 ?        00:00:00 ceph-osd
   3730    4017 ?        00:00:00 safe_timer
   3730    4018 ?        00:00:03 safe_timer
   3730    4019 ?        00:00:00 safe_timer
   3730    4020 ?        00:00:00 safe_timer
   3730    4021 ?        00:00:14 bstore_aio
   3730    4023 ?        00:00:05 bstore_aio
   3730    4280 ?        00:00:32 rocksdb:bg0
   3730    4634 ?        00:00:00 dfin
   3730    4635 ?        00:00:12 finisher
   3730    4636 ?        00:00:51 bstore_kv_sync
   3730    4637 ?        00:00:12 bstore_kv_final
   3730    4638 ?        00:00:27 bstore_mempool
   3730    5803 ?        00:03:08 ms_dispatch
   3730    5804 ?        00:00:00 ms_local
   3730    5805 ?        00:00:00 ms_dispatch
   3730    5806 ?        00:00:00 ms_local
   3730    5807 ?        00:00:00 ms_dispatch
   3730    5808 ?        00:00:00 ms_local
   3730    5809 ?        00:00:00 ms_dispatch
   3730    5810 ?        00:00:00 ms_local
   3730    5811 ?        00:00:00 ms_dispatch
   3730    5812 ?        00:00:00 ms_local
   3730    5813 ?        00:00:00 ms_dispatch
   3730    5814 ?        00:00:00 ms_local
   3730    5815 ?        00:00:00 ms_dispatch
   3730    5816 ?        00:00:00 ms_local
   3730    5817 ?        00:00:00 safe_timer
   3730    5818 ?        00:00:00 fn_anonymous
   3730    5819 ?        00:00:02 safe_timer
   3730    5820 ?        00:00:00 tp_peering
   3730    5821 ?        00:00:00 tp_peering
   3730    5822 ?        00:00:00 fn_anonymous
   3730    5823 ?        00:00:00 fn_anonymous
   3730    5824 ?        00:00:00 safe_timer
   3730    5825 ?        00:00:00 safe_timer
   3730    5826 ?        00:00:00 safe_timer
   3730    5827 ?        00:00:00 safe_timer
   3730    5828 ?        00:00:00 osd_srv_agent
   3730    5829 ?        00:01:15 tp_osd_tp
   3730    5830 ?        00:01:27 tp_osd_tp
   3730    5831 ?        00:01:40 tp_osd_tp
   3730    5832 ?        00:00:49 tp_osd_tp
   3730    5833 ?        03:09:49 tp_osd_tp
   3730    5834 ?        00:00:00 tp_osd_remove
   3730    5835 ?        00:00:00 tp_osd_recovery
   3730    5836 ?        00:00:00 tp_osd_cmd
   3730    5837 ?        00:00:06 osd_srv_heartbt
   3730    5950 ?        00:00:00 signal_handler
   3730  140250 ?        00:00:02 rocksdb:bg0

root@ceph-03:~#

stracing those shows a lot of details I'll add at  the bottom of this post.
after increasing the debug logging on the OSD (and finally shutting down the OSD), the log shows:
2019-06-06 09:25:43.647703 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.649637 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.651376 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.653280 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.655156 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.656798 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.658183 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.659937 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.661915 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.664783 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con 0x55dc9eb08800 session 0x55dbc1918a00

'the ms_handle_reset con' messages appear about 500 times per second. This continues until we restart the OSD. Once the OSD is up again, it starts to recover degraded PGs and all is fine. 
I see the traffic isn't actually coming from the harddisk itself. When checking on the ceph node the HDD is doing very little. I think this traffic is either coming from the cache or maybe it's just messaging between the client and ceph, where the client retries to get the data over and over.

A strace of the process during problem state is here: https://gist.github.com/mvernimmen/5fe39d4722aa8dcff8294285460db021
ceph.conf is here: https://gist.github.com/mvernimmen/91f21b5ab8274920017fa2de743a5743

This is where my knowledge on debugging starts to run thin. I'd love to learn how to continue, but I'm in need of some help here. Thank you for your time!

Best regards,

-- 
Max Vernimmen
Senior DevOps Engineer
Textkernel

------------------------------------------------------------------------------
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
-----------------------------------------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com