Re: Undersized pgs problem

Vasiliy Angapov <angapov@xxxxxxxxx> · Mon, 30 Nov 2015 16:56:13 +0800

Btw, in my configuration "mon osd downout subtree limit" is set to "host".
Does it influence things?

2015-11-29 14:38 GMT+08:00 Vasiliy Angapov <angapov@xxxxxxxxx>:
> Bob,
> Thanks for explanation, sounds resonable! But how it could happen that
> host is down and its OSDs are still IN cluster?
> I mean NOOUT flag is not set and my timeouts are fully default...
>
> But if I remember correctly host was not completely down, it was
> pingable but not other services were reachable like SSH or any others.
> Is it possible that OSDs were still sending some information to
> monitors making them look like IN?
>
> 2015-11-29 2:10 GMT+08:00 Bob R <bobr@xxxxxxxxxxxxxx>:
>> Vasiliy,
>>
>> Your OSDs are marked as 'down' but 'in'.
>>
>> "Ceph OSDs have two known states that can be combined. Up and Down only
>> tells you whether the OSD is actively involved in the cluster. OSD states
>> also are expressed in terms of cluster replication: In and Out. Only when a
>> Ceph OSD is tagged as Out does the self-healing process occur"
>>
>> Bob
>>
>> On Fri, Nov 27, 2015 at 6:15 AM, Mart van Santen <mart@xxxxxxxxxxxx> wrote:
>>>
>>>
>>> Dear Vasilily,
>>>
>>>
>>>
>>> On 11/27/2015 02:00 PM, Irek Fasikhov wrote:
>>>
>>> You have time to synchronize?
>>>
>>> С уважением, Фасихов Ирек Нургаязович
>>> Моб.: +79229045757
>>>
>>> 2015-11-27 15:57 GMT+03:00 Vasiliy Angapov <angapov@xxxxxxxxx>:
>>>>
>>>> > It seams that you played around with crushmap, and done something
>>>> > wrong.
>>>> > Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>>> > devices renamed to 'device' think threre is you problem.
>>>> Is this a mistake actually? What I did is removed a bunch of OSDs from
>>>> my cluster that's why the numeration is sparse. But is it an issue to
>>>> a have a sparse numeration of OSDs?
>>>
>>>
>>> I think this is normal and should be no problem. I had this also
>>> previously.
>>>
>>>>
>>>> > Hi.
>>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>>> > -3 14.56000     host slpeah001
>>>> > -2 14.56000     host slpeah002
>>>> What exactly is wrong here?
>>>
>>>
>>> I do not know how the weight of the hosts contribute to determine were to
>>> store the 3-th copy of the PG. As you explained, you have enough space on
>>> all hosts, but maybe if the weights of the hosts do not count up and the
>>> crushmap maybe come to the conclusion it is not able to place the PGs. What
>>> you can try, is to artificially raise the weights of these hosts, to see if
>>> it starts mapping the thirth copies for the pg's onto the available host.
>>>
>>> I had a similiar problem in the past, this was solved by upgrading to the
>>> latest crush tunables. But be aware, that can create massive datamovement
>>> behavior.
>>>
>>>
>>>>
>>>> I also found out that my OSD logs are full of such records:
>>>> 2015-11-26 08:31:19.273268 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:19.273276 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:24.273207 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:24.273225 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:24.273231 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:29.273199 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:29.273215 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:29.273222 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:34.273469 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:34.273482 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:34.273486 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
>>>> sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:39.273310 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:39.273331 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:39.273342 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
>>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19fa0).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:44.273753 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:44.273769 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:44.273776 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
>>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee189a0).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:49.273412 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>> 2015-11-26 08:31:49.273431 7fe4f49b1700  0 cephx: verify_authorizer
>>>> could not get service secret for service osd secret_id=2924
>>>> 2015-11-26 08:31:49.273455 7fe4f49b1700  0 --
>>>> 192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
>>>> sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19080).accept: got bad
>>>> authorizer
>>>> 2015-11-26 08:31:54.273293 7fe4f49b1700  0 auth: could not find
>>>> secret_id=2924
>>>>
>>>> What does it mean? Google sais it might be a time sync issue, but my
>>>> clocks are perfectly synchronized...
>>>
>>>
>>> Normally you get an error warning in "ceph status" if time is out of sync.
>>> Nevertheless, you can try to restart the OSD's. I had issues with timing in
>>> the past and discovered it sometime helps to restart the daemons *after*
>>> syncing the times, before the accepted the new timings. But this was mostly
>>> the case with monitors though.
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>> Mart
>>>
>>>
>>>
>>>
>>>>
>>>> 2015-11-26 21:05 GMT+08:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
>>>> > Hi.
>>>> > Vasiliy, Yes it is a problem with crusmap. Look at height:
>>>> > " -3 14.56000     host slpeah001
>>>> >  -2 14.56000     host slpeah002
>>>> >  "
>>>> >
>>>> > С уважением, Фасихов Ирек Нургаязович
>>>> > Моб.: +79229045757
>>>> >
>>>> > 2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич
>>>> > <Kamil.Kuramshin@xxxxxxxx>:
>>>> >>
>>>> >> It seams that you played around with crushmap, and done something
>>>> >> wrong.
>>>> >> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>>>> >> devices renamed to 'device' think threre is you problem.
>>>> >>
>>>> >> Отправлено с мобильного устройства.
>>>> >>
>>>> >>
>>>> >> -----Original Message-----
>>>> >> From: Vasiliy Angapov <angapov@xxxxxxxxx>
>>>> >> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>>>> >> Sent: чт, 26 нояб. 2015 7:53
>>>> >> Subject:  Undersized pgs problem
>>>> >>
>>>> >> Hi, colleagues!
>>>> >>
>>>> >> I have small 4-node CEPH cluster (0.94.2), all pools have size 3,
>>>> >> min_size
>>>> >> 1.
>>>> >> This night one host failed and cluster was unable to rebalance saying
>>>> >> there are a lot of undersized pgs.
>>>> >>
>>>> >> root@slpeah002:[~]:# ceph -s
>>>> >>     cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
>>>> >>      health HEALTH_WARN
>>>> >>             1486 pgs degraded
>>>> >>             1486 pgs stuck degraded
>>>> >>             2257 pgs stuck unclean
>>>> >>             1486 pgs stuck undersized
>>>> >>             1486 pgs undersized
>>>> >>             recovery 80429/555185 objects degraded (14.487%)
>>>> >>             recovery 40079/555185 objects misplaced (7.219%)
>>>> >>             4/20 in osds are down
>>>> >>             1 mons down, quorum 1,2 slpeah002,slpeah007
>>>> >>      monmap e7: 3 mons at
>>>> >>
>>>> >>
>>>> >> {slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
>>>> >>             election epoch 710, quorum 1,2 slpeah002,slpeah007
>>>> >>      osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
>>>> >>       pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
>>>> >>             3366 GB used, 93471 GB / 96838 GB avail
>>>> >>             80429/555185 objects degraded (14.487%)
>>>> >>             40079/555185 objects misplaced (7.219%)
>>>> >>                 1903 active+clean
>>>> >>                 1486 active+undersized+degraded
>>>> >>                  771 active+remapped
>>>> >>   client io 0 B/s rd, 246 kB/s wr, 67 op/s
>>>> >>
>>>> >>   root@slpeah002:[~]:# ceph osd tree
>>>> >> ID  WEIGHT   TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>> >>  -1 94.63998 root default
>>>> >>  -9 32.75999     host slpeah007
>>>> >>  72  5.45999         osd.72          up  1.00000          1.00000
>>>> >>  73  5.45999         osd.73          up  1.00000          1.00000
>>>> >>  74  5.45999         osd.74          up  1.00000          1.00000
>>>> >>  75  5.45999         osd.75          up  1.00000          1.00000
>>>> >>  76  5.45999         osd.76          up  1.00000          1.00000
>>>> >>  77  5.45999         osd.77          up  1.00000          1.00000
>>>> >> -10 32.75999     host slpeah008
>>>> >>  78  5.45999         osd.78          up  1.00000          1.00000
>>>> >>  79  5.45999         osd.79          up  1.00000          1.00000
>>>> >>  80  5.45999         osd.80          up  1.00000          1.00000
>>>> >>  81  5.45999         osd.81          up  1.00000          1.00000
>>>> >>  82  5.45999         osd.82          up  1.00000          1.00000
>>>> >>  83  5.45999         osd.83          up  1.00000          1.00000
>>>> >>  -3 14.56000     host slpeah001
>>>> >>   1  3.64000          osd.1         down  1.00000          1.00000
>>>> >>  33  3.64000         osd.33        down  1.00000          1.00000
>>>> >>  34  3.64000         osd.34        down  1.00000          1.00000
>>>> >>  35  3.64000         osd.35        down  1.00000          1.00000
>>>> >>  -2 14.56000     host slpeah002
>>>> >>   0  3.64000         osd.0           up  1.00000          1.00000
>>>> >>  36  3.64000         osd.36          up  1.00000          1.00000
>>>> >>  37  3.64000         osd.37          up  1.00000          1.00000
>>>> >>  38  3.64000         osd.38          up  1.00000          1.00000
>>>> >>
>>>> >> Crushmap:
>>>> >>
>>>> >>  # begin crush map
>>>> >> tunable choose_local_tries 0
>>>> >> tunable choose_local_fallback_tries 0
>>>> >> tunable choose_total_tries 50
>>>> >> tunable chooseleaf_descend_once 1
>>>> >> tunable chooseleaf_vary_r 1
>>>> >> tunable straw_calc_version 1
>>>> >> tunable allowed_bucket_algs 54
>>>> >>
>>>> >> # devices
>>>> >> device 0 osd.0
>>>> >> device 1 osd.1
>>>> >> device 2 device2
>>>> >> device 3 device3
>>>> >> device 4 device4
>>>> >> device 5 device5
>>>> >> device 6 device6
>>>> >> device 7 device7
>>>> >> device 8 device8
>>>> >> device 9 device9
>>>> >> device 10 device10
>>>> >> device 11 device11
>>>> >> device 12 device12
>>>> >> device 13 device13
>>>> >> device 14 device14
>>>> >> device 15 device15
>>>> >> device 16 device16
>>>> >> device 17 device17
>>>> >> device 18 device18
>>>> >> device 19 device19
>>>> >> device 20 device20
>>>> >> device 21 device21
>>>> >> device 22 device22
>>>> >> device 23 device23
>>>> >> device 24 device24
>>>> >> device 25 device25
>>>> >> device 26 device26
>>>> >> device 27 device27
>>>> >> device 28 device28
>>>> >> device 29 device29
>>>> >> device 30 device30
>>>> >> device 31 device31
>>>> >> device 32 device32
>>>> >> device 33 osd.33
>>>> >> device 34 osd.34
>>>> >> device 35 osd.35
>>>> >> device 36 osd.36
>>>> >> device 37 osd.37
>>>> >> device 38 osd.38
>>>> >> device 39 device39
>>>> >> device 40 device40
>>>> >> device 41 device41
>>>> >> device 42 device42
>>>> >> device 43 device43
>>>> >> device 44 device44
>>>> >> device 45 device45
>>>> >> device 46 device46
>>>> >> device 47 device47
>>>> >> device 48 device48
>>>> >> device 49 device49
>>>> >> device 50 device50
>>>> >> device 51 device51
>>>> >> device 52 device52
>>>> >> device 53 device53
>>>> >> device 54 device54
>>>> >> device 55 device55
>>>> >> device 56 device56
>>>> >> device 57 device57
>>>> >> device 58 device58
>>>> >> device 59 device59
>>>> >> device 60 device60
>>>> >> device 61 device61
>>>> >> device 62 device62
>>>> >> device 63 device63
>>>> >> device 64 device64
>>>> >> device 65 device65
>>>> >> device 66 device66
>>>> >> device 67 device67
>>>> >> device 68 device68
>>>> >> device 69 device69
>>>> >> device 70 device70
>>>> >> device 71 device71
>>>> >> device 72 osd.72
>>>> >> device 73 osd.73
>>>> >> device 74 osd.74
>>>> >> device 75 osd.75
>>>> >> device 76 osd.76
>>>> >> device 77 osd.77
>>>> >> device 78 osd.78
>>>> >> device 79 osd.79
>>>> >> device 80 osd.80
>>>> >> device 81 osd.81
>>>> >> device 82 osd.82
>>>> >> device 83 osd.83
>>>> >>
>>>> >> # types
>>>> >> type 0 osd
>>>> >> type 1 host
>>>> >> type 2 chassis
>>>> >> type 3 rack
>>>> >> type 4 row
>>>> >> type 5 pdu
>>>> >> type 6 pod
>>>> >> type 7 room
>>>> >> type 8 datacenter
>>>> >> type 9 region
>>>> >> type 10 root
>>>> >>
>>>> >> # buckets
>>>> >> host slpeah007 {
>>>> >>         id -9           # do not change unnecessarily
>>>> >>         # weight 32.760
>>>> >>         alg straw
>>>> >>         hash 0  # rjenkins1
>>>> >>         item osd.72 weight 5.460
>>>> >>         item osd.73 weight 5.460
>>>> >>         item osd.74 weight 5.460
>>>> >>         item osd.75 weight 5.460
>>>> >>         item osd.76 weight 5.460
>>>> >>         item osd.77 weight 5.460
>>>> >> }
>>>> >> host slpeah008 {
>>>> >>         id -10          # do not change unnecessarily
>>>> >>         # weight 32.760
>>>> >>         alg straw
>>>> >>         hash 0  # rjenkins1
>>>> >>         item osd.78 weight 5.460
>>>> >>         item osd.79 weight 5.460
>>>> >>         item osd.80 weight 5.460
>>>> >>         item osd.81 weight 5.460
>>>> >>         item osd.82 weight 5.460
>>>> >>         item osd.83 weight 5.460
>>>> >> }
>>>> >> host slpeah001 {
>>>> >>         id -3           # do not change unnecessarily
>>>> >>         # weight 14.560
>>>> >>         alg straw
>>>> >>         hash 0  # rjenkins1
>>>> >>         item osd.1 weight 3.640
>>>> >>         item osd.33 weight 3.640
>>>> >>         item osd.34 weight 3.640
>>>> >>         item osd.35 weight 3.640
>>>> >> }
>>>> >> host slpeah002 {
>>>> >>         id -2           # do not change unnecessarily
>>>> >>         # weight 14.560
>>>> >>         alg straw
>>>> >>         hash 0  # rjenkins1
>>>> >>         item osd.0 weight 3.640
>>>> >>         item osd.36 weight 3.640
>>>> >>         item osd.37 weight 3.640
>>>> >>         item osd.38 weight 3.640
>>>> >> }
>>>> >> root default {
>>>> >>         id -1           # do not change unnecessarily
>>>> >>         # weight 94.640
>>>> >>         alg straw
>>>> >>         hash 0  # rjenkins1
>>>> >>         item slpeah007 weight 32.760
>>>> >>         item slpeah008 weight 32.760
>>>> >>         item slpeah001 weight 14.560
>>>> >>         item slpeah002 weight 14.560
>>>> >> }
>>>> >>
>>>> >> # rules
>>>> >> rule default {
>>>> >>         ruleset 0
>>>> >>         type replicated
>>>> >>         min_size 1
>>>> >>         max_size 10
>>>> >>         step take default
>>>> >>         step chooseleaf firstn 0 type host
>>>> >>         step emit
>>>> >> }
>>>> >>
>>>> >> # end crush map
>>>> >>
>>>> >>
>>>> >>
>>>> >> This is odd because pools have size 3 and I have 3 hosts alive, so why
>>>> >> it is saying that undersized pgs are present? It makes me feel like
>>>> >> CRUSH is not working properly.
>>>> >> There is not much data currently in cluster, something about 3TB and
>>>> >> as you can see from osd tree - each host have minimum of 14TB disk
>>>> >> space on OSDs.
>>>> >> So I'm a bit stuck now...
>>>> >> How can I find the source of trouble?
>>>> >>
>>>> >> Thanks in advance!
>>>> >> _______________________________________________
>>>> >> ceph-users mailing list
>>>> >> ceph-users@xxxxxxxxxxxxxx
>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >>
>>>> >> _______________________________________________
>>>> >> ceph-users mailing list
>>>> >> ceph-users@xxxxxxxxxxxxxx
>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >>
>>>> >
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> --
>>> Mart van Santen
>>> Greenhost
>>> E: mart@xxxxxxxxxxxx
>>> T: +31 20 4890444
>>> W: https://greenhost.nl
>>>
>>> A PGP signature can be attached to this e-mail,
>>> you need PGP software to verify it.
>>> My public key is available in keyserver(s)
>>> see: http://tinyurl.com/openpgp-manual
>>>
>>> PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com