Re: PGS stuck inactive and osd down

Vincenzo Pii <vincenzo.pii@xxxxxxxxxxxxx> · Fri, 13 May 2016 09:33:46 +0200

On 12 May 2016, at 19:27, Vincenzo Pii <vincenzo.pii@xxxxxxxxxxxxx> wrote:

I have installed a new ceph cluster with ceph-ansible (using the same version and playbook that had worked before, with some necessary changes to variables).

The only major difference is that now an osd (osd3) has a disk twice as big as the others, thus a different weight (check the crushmap excerpt below).

Ceph version is jewel (10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)) and the setup has a single monitor node (it will be three in production) and three osds.

Any help to find the issue will be highly appreciated!

# ceph status
    cluster f7f42c59-b8ec-4d68-bb09-41f7a10c6223
     health HEALTH_ERR
            448 pgs are stuck inactive for more than 300 seconds
            448 pgs stuck inactive
     monmap e1: 1 mons at {sbb=10.2.48.205:6789/0}
            election epoch 3, quorum 0 sbb
      fsmap e8: 0/0/1 up
     osdmap e10: 3 osds: 0 up, 0 in
            flags sortbitwise
      pgmap v11: 448 pgs, 4 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                 448 creating

From the crushmap:

host osd1 {
        id -2           # do not change unnecessarily
        # weight 1.811
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.811
}
host osd2 {
        id -3           # do not change unnecessarily
        # weight 1.811
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 1.811
}
host osd3 {
        id -4           # do not change unnecessarily
        # weight 3.630
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 3.630
}

Vincenzo Pii | TERALYTICS
DevOps Engineer
Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41 (0) 79 191 11 08
email: vincenzo.pii@xxxxxxxxxxxxxx
www.teralytics.net
Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries
This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately. 

Problem found, I misconfigured the public_network and cluster_network variables for some of the hosts (I moved some configuration to host_vars).
It was easy to spot once I checked the logs of those hosts.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com