Re: Troubleshooting hanging storage backend whenever there is any cluster change

David Turner <drakonstein@xxxxxxxxx> · Thu, 11 Oct 2018 14:47:50 -0400

My first guess is to ask what your crush rules are.  `ceph osd crush rule dump` along with `ceph osd pool ls detail` would be helpful.  Also if you have a `ceph status` output from a time where the VM RBDs aren't working might explain something.

On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck - Profihost AG <n.fahldieck@xxxxxxxxxxxx> wrote:
Hi everyone,

since some time we experience service outages in our Ceph cluster

whenever there is any change to the HEALTH status. E. g. swapping

storage devices, adding storage devices, rebooting Ceph hosts, during

backfills ect.

Just now I had a recent situation, where several VMs hung after I

rebooted one Ceph host. We have 3 replications for each PG, 3 mon, 3

mgr, 3 mds and 71 osds spread over 9 hosts.

We use Ceph as a storage backend for our Proxmox VE (PVE) environment.

The outages are in the form of blocked virtual file systems of those

virtual machines running in our PVE cluster.

It feels similar to stuck and inactive PGs to me. Honestly though I'm

not really sure on how to debug this problem or which log files to examine.

OS: Debian 9

Kernel: 4.12 based upon SLE15-SP1

# ceph version

ceph version 12.2.8-133-gded2f6836f

(ded2f6836f6331a58f5c817fca7bfcd6c58795aa) luminous (stable)

Can someone guide me? I'm more than happy to provide more information

as needed.

Thanks in advance

Nils

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com