Re: Undersized pgs problem

Vasiliy Angapov <angapov@xxxxxxxxx> · Sun, 29 Nov 2015 14:38:21 +0800

Bob,
Thanks for explanation, sounds resonable! But how it could happen that
host is down and its OSDs are still IN cluster?
I mean NOOUT flag is not set and my timeouts are fully default...

But if I remember correctly host was not completely down, it was
pingable but not other services were reachable like SSH or any others.
Is it possible that OSDs were still sending some information to
monitors making them look like IN?

2015-11-29 2:10 GMT+08:00 Bob R <bobr@xxxxxxxxxxxxxx>:
> Vasiliy,
>
> Your OSDs are marked as 'down' but 'in'.
>
> "Ceph OSDs have two known states that can be combined. Up and Down only
> tells you whether the OSD is actively involved in the cluster. OSD states
> also are expressed in terms of cluster replication: In and Out. Only when a
> Ceph OSD is tagged as Out does the self-healing process occur"
>
> Bob
>
> On Fri, Nov 27, 2015 at 6:15 AM, Mart van Santen <mart@xxxxxxxxxxxx> wrote:
>>
>>
>> Dear Vasilily,
>>
>>
>>
>> On 11/27/2015 02:00 PM, Irek Fasikhov wrote:
>>
>> You have time to synchronize?
>>
>> С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>>
>> 2015-11-27 15:57 GMT+03:00 Vasiliy Angapov <angapov@xxxxxxxxx>:
>>>
>>> > It seams that you played around with crushmap, and done something
>>> > wrong.
>>> > Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>> > devices renamed to 'device' think threre is you problem.
>>> Is this a mistake actually? What I did is removed a bunch of OSDs from
>>> my cluster that's why the numeration is sparse. But is it an issue to
>>> a have a sparse numeration of OSDs?
>>
>>
>> I think this is normal and should be no problem. I had this also
>> previously.
>>
>>>
>>> > Hi.
>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>> > -3 14.56000     host slpeah001
>>> > -2 14.56000     host slpeah002
>>> What exactly is wrong here?
>>
>>
>> I do not know how the weight of the hosts contribute to determine were to
>> store the 3-th copy of the PG. As you explained, you have enough space on
>> all hosts, but maybe if the weights of the hosts do not count up and the
>> crushmap maybe come to the conclusion it is not able to place the PGs. What
>> you can try, is to artificially raise the weights of these hosts, to see if
>> it starts mapping the thirth copies for the pg's onto the available host.
>>
>> I had a similiar problem in the past, this was solved by upgrading to the
>> latest crush tunables. But be aware, that can create massive datamovement
>> behavior.
>>
>>
>>>
>>> I also found out that my OSD logs are full of such records:
>>> 2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:34.273469 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:34.273482 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:34.273486 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:39.273310 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:39.273331 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:39.273342 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19fa0).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:44.273753 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:44.273769 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:44.273776 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee189a0).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:49.273412 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>> 2015-11-26 08:31:49.273431 7fe4f49b1700  0 cephx: verify_authorizer
>>> could not get service secret for service osd secret_id=2924
>>> 2015-11-26 08:31:49.273455 7fe4f49b1700  0 --
>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19080).accept: got bad
>>> authorizer
>>> 2015-11-26 08:31:54.273293 7fe4f49b1700  0 auth: could not find
>>> secret_id=2924
>>>
>>> What does it mean? Google sais it might be a time sync issue, but my
>>> clocks are perfectly synchronized...
>>
>>
>> Normally you get an error warning in "ceph status" if time is out of sync.
>> Nevertheless, you can try to restart the OSD's. I had issues with timing in
>> the past and discovered it sometime helps to restart the daemons *after*
>> syncing the times, before the accepted the new timings. But this was mostly
>> the case with monitors though.
>>
>>
>>
>> Regards,
>>
>>
>> Mart
>>
>>
>>
>>
>>>
>>> 2015-11-26 21:05 GMT+08:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
>>> > Hi.
>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>> > " -3 14.56000     host slpeah001
>>> >  -2 14.56000     host slpeah002
>>> >  "
>>> >
>>> > С уважением, Фасихов Ирек Нургаязович
>>> > Моб.: +79229045757
>>> >
>>> > 2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич
>>> > <Kamil.Kuramshin@xxxxxxxx>:
>>> >>
>>> >> It seams that you played around with crushmap, and done something
>>> >> wrong.
>>> >> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>> >> devices renamed to 'device' think threre is you problem.
>>> >>
>>> >> Отправлено с мобильного устройства.
>>> >>
>>> >>
>>> >> -----Original Message-----
>>> >> From: Vasiliy Angapov <angapov@xxxxxxxxx>
>>> >> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>> >> Sent: чт, 26 нояб. 2015 7:53
>>> >> Subject:  Undersized pgs problem
>>> >>
>>> >> Hi, colleagues!
>>> >>
>>> >> I have small 4-node CEPH cluster (0.94.2), all pools have size 3,
>>> >> min_size
>>> >> 1.
>>> >> This night one host failed and cluster was unable to rebalance saying
>>> >> there are a lot of undersized pgs.
>>> >>
>>> >> root@slpeah002:[~]:# ceph -s
>>> >>     cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
>>> >>      health HEALTH_WARN
>>> >>             1486 pgs degraded
>>> >>             1486 pgs stuck degraded
>>> >>             2257 pgs stuck unclean
>>> >>             1486 pgs stuck undersized
>>> >>             1486 pgs undersized
>>> >>             recovery 80429/555185 objects degraded (14.487%)
>>> >>             recovery 40079/555185 objects misplaced (7.219%)
>>> >>             4/20 in osds are down
>>> >>             1 mons down, quorum 1,2 slpeah002,slpeah007
>>> >>      monmap e7: 3 mons at
>>> >>
>>> >>
>>> >> {slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
>>> >>             election epoch 710, quorum 1,2 slpeah002,slpeah007
>>> >>      osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
>>> >>       pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
>>> >>             3366 GB used, 93471 GB / 96838 GB avail
>>> >>             80429/555185 objects degraded (14.487%)
>>> >>             40079/555185 objects misplaced (7.219%)
>>> >>                 1903 active+clean
>>> >>                 1486 active+undersized+degraded
>>> >>                  771 active+remapped
>>> >>   client io 0 B/s rd, 246 kB/s wr, 67 op/s
>>> >>
>>> >>   root@slpeah002:[~]:# ceph osd tree
>>> >> ID  WEIGHT   TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> >>  -1 94.63998 root default
>>> >>  -9 32.75999     host slpeah007
>>> >>  72  5.45999         osd.72          up  1.00000          1.00000
>>> >>  73  5.45999         osd.73          up  1.00000          1.00000
>>> >>  74  5.45999         osd.74          up  1.00000          1.00000
>>> >>  75  5.45999         osd.75          up  1.00000          1.00000
>>> >>  76  5.45999         osd.76          up  1.00000          1.00000
>>> >>  77  5.45999         osd.77          up  1.00000          1.00000
>>> >> -10 32.75999     host slpeah008
>>> >>  78  5.45999         osd.78          up  1.00000          1.00000
>>> >>  79  5.45999         osd.79          up  1.00000          1.00000
>>> >>  80  5.45999         osd.80          up  1.00000          1.00000
>>> >>  81  5.45999         osd.81          up  1.00000          1.00000
>>> >>  82  5.45999         osd.82          up  1.00000          1.00000
>>> >>  83  5.45999         osd.83          up  1.00000          1.00000
>>> >>  -3 14.56000     host slpeah001
>>> >>   1  3.64000          osd.1         down  1.00000          1.00000
>>> >>  33  3.64000         osd.33        down  1.00000          1.00000
>>> >>  34  3.64000         osd.34        down  1.00000          1.00000
>>> >>  35  3.64000         osd.35        down  1.00000          1.00000
>>> >>  -2 14.56000     host slpeah002
>>> >>   0  3.64000         osd.0           up  1.00000          1.00000
>>> >>  36  3.64000         osd.36          up  1.00000          1.00000
>>> >>  37  3.64000         osd.37          up  1.00000          1.00000
>>> >>  38  3.64000         osd.38          up  1.00000          1.00000
>>> >>
>>> >> Crushmap:
>>> >>
>>> >>  # begin crush map
>>> >> tunable choose_local_tries 0
>>> >> tunable choose_local_fallback_tries 0
>>> >> tunable choose_total_tries 50
>>> >> tunable chooseleaf_descend_once 1
>>> >> tunable chooseleaf_vary_r 1
>>> >> tunable straw_calc_version 1
>>> >> tunable allowed_bucket_algs 54
>>> >>
>>> >> # devices
>>> >> device 0 osd.0
>>> >> device 1 osd.1
>>> >> device 2 device2
>>> >> device 3 device3
>>> >> device 4 device4
>>> >> device 5 device5
>>> >> device 6 device6
>>> >> device 7 device7
>>> >> device 8 device8
>>> >> device 9 device9
>>> >> device 10 device10
>>> >> device 11 device11
>>> >> device 12 device12
>>> >> device 13 device13
>>> >> device 14 device14
>>> >> device 15 device15
>>> >> device 16 device16
>>> >> device 17 device17
>>> >> device 18 device18
>>> >> device 19 device19
>>> >> device 20 device20
>>> >> device 21 device21
>>> >> device 22 device22
>>> >> device 23 device23
>>> >> device 24 device24
>>> >> device 25 device25
>>> >> device 26 device26
>>> >> device 27 device27
>>> >> device 28 device28
>>> >> device 29 device29
>>> >> device 30 device30
>>> >> device 31 device31
>>> >> device 32 device32
>>> >> device 33 osd.33
>>> >> device 34 osd.34
>>> >> device 35 osd.35
>>> >> device 36 osd.36
>>> >> device 37 osd.37
>>> >> device 38 osd.38
>>> >> device 39 device39
>>> >> device 40 device40
>>> >> device 41 device41
>>> >> device 42 device42
>>> >> device 43 device43
>>> >> device 44 device44
>>> >> device 45 device45
>>> >> device 46 device46
>>> >> device 47 device47
>>> >> device 48 device48
>>> >> device 49 device49
>>> >> device 50 device50
>>> >> device 51 device51
>>> >> device 52 device52
>>> >> device 53 device53
>>> >> device 54 device54
>>> >> device 55 device55
>>> >> device 56 device56
>>> >> device 57 device57
>>> >> device 58 device58
>>> >> device 59 device59
>>> >> device 60 device60
>>> >> device 61 device61
>>> >> device 62 device62
>>> >> device 63 device63
>>> >> device 64 device64
>>> >> device 65 device65
>>> >> device 66 device66
>>> >> device 67 device67
>>> >> device 68 device68
>>> >> device 69 device69
>>> >> device 70 device70
>>> >> device 71 device71
>>> >> device 72 osd.72
>>> >> device 73 osd.73
>>> >> device 74 osd.74
>>> >> device 75 osd.75
>>> >> device 76 osd.76
>>> >> device 77 osd.77
>>> >> device 78 osd.78
>>> >> device 79 osd.79
>>> >> device 80 osd.80
>>> >> device 81 osd.81
>>> >> device 82 osd.82
>>> >> device 83 osd.83
>>> >>
>>> >> # types
>>> >> type 0 osd
>>> >> type 1 host
>>> >> type 2 chassis
>>> >> type 3 rack
>>> >> type 4 row
>>> >> type 5 pdu
>>> >> type 6 pod
>>> >> type 7 room
>>> >> type 8 datacenter
>>> >> type 9 region
>>> >> type 10 root
>>> >>
>>> >> # buckets
>>> >> host slpeah007 {
>>> >>         id -9           # do not change unnecessarily
>>> >>         # weight 32.760
>>> >>         alg straw
>>> >>         hash 0  # rjenkins1
>>> >>         item osd.72 weight 5.460
>>> >>         item osd.73 weight 5.460
>>> >>         item osd.74 weight 5.460
>>> >>         item osd.75 weight 5.460
>>> >>         item osd.76 weight 5.460
>>> >>         item osd.77 weight 5.460
>>> >> }
>>> >> host slpeah008 {
>>> >>         id -10          # do not change unnecessarily
>>> >>         # weight 32.760
>>> >>         alg straw
>>> >>         hash 0  # rjenkins1
>>> >>         item osd.78 weight 5.460
>>> >>         item osd.79 weight 5.460
>>> >>         item osd.80 weight 5.460
>>> >>         item osd.81 weight 5.460
>>> >>         item osd.82 weight 5.460
>>> >>         item osd.83 weight 5.460
>>> >> }
>>> >> host slpeah001 {
>>> >>         id -3           # do not change unnecessarily
>>> >>         # weight 14.560
>>> >>         alg straw
>>> >>         hash 0  # rjenkins1
>>> >>         item osd.1 weight 3.640
>>> >>         item osd.33 weight 3.640
>>> >>         item osd.34 weight 3.640
>>> >>         item osd.35 weight 3.640
>>> >> }
>>> >> host slpeah002 {
>>> >>         id -2           # do not change unnecessarily
>>> >>         # weight 14.560
>>> >>         alg straw
>>> >>         hash 0  # rjenkins1
>>> >>         item osd.0 weight 3.640
>>> >>         item osd.36 weight 3.640
>>> >>         item osd.37 weight 3.640
>>> >>         item osd.38 weight 3.640
>>> >> }
>>> >> root default {
>>> >>         id -1           # do not change unnecessarily
>>> >>         # weight 94.640
>>> >>         alg straw
>>> >>         hash 0  # rjenkins1
>>> >>         item slpeah007 weight 32.760
>>> >>         item slpeah008 weight 32.760
>>> >>         item slpeah001 weight 14.560
>>> >>         item slpeah002 weight 14.560
>>> >> }
>>> >>
>>> >> # rules
>>> >> rule default {
>>> >>         ruleset 0
>>> >>         type replicated
>>> >>         min_size 1
>>> >>         max_size 10
>>> >>         step take default
>>> >>         step chooseleaf firstn 0 type host
>>> >>         step emit
>>> >> }
>>> >>
>>> >> # end crush map
>>> >>
>>> >>
>>> >>
>>> >> This is odd because pools have size 3 and I have 3 hosts alive, so why
>>> >> it is saying that undersized pgs are present? It makes me feel like
>>> >> CRUSH is not working properly.
>>> >> There is not much data currently in cluster, something about 3TB and
>>> >> as you can see from osd tree - each host have minimum of 14TB disk
>>> >> space on OSDs.
>>> >> So I'm a bit stuck now...
>>> >> How can I find the source of trouble?
>>> >>
>>> >> Thanks in advance!
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@xxxxxxxxxxxxxx
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>> >> _______________________________________________
>>> >> ceph-users mailing list
>>> >> ceph-users@xxxxxxxxxxxxxx
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>> >
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Mart van Santen
>> Greenhost
>> E: mart@xxxxxxxxxxxx
>> T: +31 20 4890444
>> W: https://greenhost.nl
>>
>> A PGP signature can be attached to this e-mail,
>> you need PGP software to verify it.
>> My public key is available in keyserver(s)
>> see: http://tinyurl.com/openpgp-manual
>>
>> PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com