Re: Help: pool not responding

Mario Giammarco <mgiammarco@xxxxxxxxx> · Mon, 29 Feb 2016 19:43:50 +0000 (UTC)

Thank you for your time.
Dimitar Boichev <Dimitar.Boichev@...> writes:

> 
> I am sure that I speak for the majority of people reading this, when I say
that I didn't get anything from your emails.
> Could you provide more debug information ?
> Like (but not limited to):
> ceph -s 
> ceph health details
> ceph osd tree

I asked infact what I need to provide because honestly I do not know.

Here is ceph -s:

    cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
     health HEALTH_WARN
            4 pgs incomplete
            4 pgs stuck inactive
            4 pgs stuck unclean
     monmap e8: 3 mons at
{0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0}
            election epoch 832, quorum 0,1,2 0,1,2
     osdmap e2400: 3 osds: 3 up, 3 in
      pgmap v5883297: 288 pgs, 4 pools, 391 GB data, 100 kobjects
            1090 GB used, 4481 GB / 5571 GB avail
                 284 active+clean
                   4 incomplete

ceph health detail:

    cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
     health HEALTH_WARN
            4 pgs incomplete
            4 pgs stuck inactive
            4 pgs stuck unclean
     monmap e8: 3 mons at
{0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0}
            election epoch 832, quorum 0,1,2 0,1,2
     osdmap e2400: 3 osds: 3 up, 3 in
      pgmap v5883297: 288 pgs, 4 pools, 391 GB data, 100 kobjects
            1090 GB used, 4481 GB / 5571 GB avail
                 284 active+clean
                   4 incomplete

ceph osd tree:

ID WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 5.42999 root default                                             
-2 1.81000     host proxmox-quad3                                   
 0 1.81000         osd.0               up  1.00000          1.00000 
-3 1.81000     host proxmox-zotac                                   
 1 1.81000         osd.1               up  1.00000          1.00000 
-4 1.81000     host proxmox-hp                                      
 3 1.81000         osd.3               up  1.00000          1.00000 

> 
> I am really having a bad time trying to decode the exact problems.
> First you had network issues, then osd failed (in the same time or after?),
> Then the cluser did not have enough free space to recover I suppose  ?
> 
It is a three server/osd test/evaluation system with Ceph and Proxmox PVE.
The load is very light and there is a lot of free space.

So:

- I NEVER had network issues. People TOLD me that I must have network
problems. I changed cables and switches just in case but nothing improved. 
- One disk had bad sectors. So I added another disk/osd and then removed the
osd. Following official documentation. After that the cluster runned ok for
two months. So there was enough free space and the cluster has recovered.
- Then one day I discovered that proxmox backup was hanged and I see that it
was because ceph was not responding.

> Regarding the slow SSD disks, what disks are you using ?

I said SSHD that is a standard hdd with ssd cache. It is 7200rpms but in
benchmarks it is better than a 10000rpm disk.

Thanks again,
Mario

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com