Re: Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 25 Mar 2015 06:52:25 -0700

On Wed, Mar 25, 2015 at 1:20 AM, Udo Lembke <ulembke@xxxxxxxxxxxx> wrote:
> Hi,
> due to two more hosts (now 7 storage nodes) I want to create an new
> ec-pool and get an strange effect:
>
> ceph@admin:~$ ceph health detail
> HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2
> pgs stuck undersized; 2 pgs undersized

This is the big clue: you have two undersized PGs!

> pg 22.3e5 is stuck unclean since forever, current state
> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]

2147483647 is the largest number you can represent in a signed 32-bit
integer. There's an output error of some kind which is fixed
elsewhere; this should be "-1".

So for whatever reason (in general it's hard on CRUSH trying to select
N entries out of N choices), CRUSH hasn't been able to map an OSD to
this slot for you. You'll want to figure out why that is and fix it.
-Greg

> pg 22.240 is stuck unclean since forever, current state
> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
> pg 22.3e5 is stuck undersized for 406.614447, current state
> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
> pg 22.240 is stuck undersized for 406.616563, current state
> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
> pg 22.3e5 is stuck degraded for 406.614566, current state
> active+undersized+degraded, last acting [76,15,82,11,57,29,2147483647]
> pg 22.240 is stuck degraded for 406.616679, current state
> active+undersized+degraded, last acting [38,85,17,74,2147483647,10,58]
> pg 22.3e5 is active+undersized+degraded, acting
> [76,15,82,11,57,29,2147483647]
> pg 22.240 is active+undersized+degraded, acting
> [38,85,17,74,2147483647,10,58]
>
> But I have only 91 OSDs (84 Sata + 7 SSDs) not 2147483647!
> Where the heck came the 2147483647 from?
>
> I do following commands:
> ceph osd erasure-code-profile set 7hostprofile k=5 m=2
> ruleset-failure-domain=host
> ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile
>
> my version:
> ceph -v
> ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>
>
> I found an issue in my crush-map - one SSD was twice in the map:
> host ceph-061-ssd {
>         id -16          # do not change unnecessarily
>         # weight 0.000
>         alg straw
>         hash 0  # rjenkins1
> }
> root ssd {
>         id -13          # do not change unnecessarily
>         # weight 0.780
>         alg straw
>         hash 0  # rjenkins1
>         item ceph-01-ssd weight 0.170
>         item ceph-02-ssd weight 0.170
>         item ceph-03-ssd weight 0.000
>         item ceph-04-ssd weight 0.170
>         item ceph-05-ssd weight 0.170
>         item ceph-06-ssd weight 0.050
>         item ceph-07-ssd weight 0.050
>         item ceph-061-ssd weight 0.000
> }
>
> Host ceph-061-ssd don't excist and osd-61 is the SSD from ceph-03-ssd,
> but after fix the crusmap the issue with the osd 2147483647 still excist.
>
> Any idea how to fix that?
>
> regards
>
> Udo
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com