You have time to synchronize?
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
2015-11-27 15:57 GMT+03:00 Vasiliy Angapov <angapov@xxxxxxxxx>:
> It seams that you played around with crushmap, and done something wrong.
> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd' devices renamed to 'device' think threre is you problem.
Is this a mistake actually? What I did is removed a bunch of OSDs from
my cluster that's why the numeration is sparse. But is it an issue to
a have a sparse numeration of OSDs?
> Hi.
> Vasiliy, Yes it is a problem with crusmap. Look at height:
> -3 14.56000 host slpeah001
> -2 14.56000 host slpeah002
What exactly is wrong here?
I also found out that my OSD logs are full of such records:
2015-11-26 08:31:19.273268 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:19.273276 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a520).accept: got bad
authorizer
2015-11-26 08:31:24.273207 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:24.273225 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:24.273231 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a3c0).accept: got bad
authorizer
2015-11-26 08:31:29.273199 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:29.273215 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:29.273222 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a260).accept: got bad
authorizer
2015-11-26 08:31:34.273469 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:34.273482 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:34.273486 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x3f90b000
sd=79 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee1a100).accept: got bad
authorizer
2015-11-26 08:31:39.273310 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:39.273331 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:39.273342 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19fa0).accept: got bad
authorizer
2015-11-26 08:31:44.273753 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:44.273769 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:44.273776 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fcc000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee189a0).accept: got bad
authorizer
2015-11-26 08:31:49.273412 7fe4f49b1700 0 auth: could not find secret_id=2924
2015-11-26 08:31:49.273431 7fe4f49b1700 0 cephx: verify_authorizer
could not get service secret for service osd secret_id=2924
2015-11-26 08:31:49.273455 7fe4f49b1700 0 --
192.168.254.18:6816/110740 >> 192.168.254.12:0/1011754 pipe(0x41fd1000
sd=98 :6816 s=0 pgs=0 cs=0 l=1 c=0x3ee19080).accept: got bad
authorizer
2015-11-26 08:31:54.273293 7fe4f49b1700 0 auth: could not find secret_id=2924
What does it mean? Google sais it might be a time sync issue, but my
clocks are perfectly synchronized...
2015-11-26 21:05 GMT+08:00 Irek Fasikhov <malmyzh@xxxxxxxxx>:
> Hi.
> Vasiliy, Yes it is a problem with crusmap. Look at height:
> " -3 14.56000 host slpeah001
> -2 14.56000 host slpeah002
> "
>
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
> 2015-11-26 13:16 GMT+03:00 ЦИТ РТ-Курамшин Камиль Фидаилевич
> <Kamil.Kuramshin@xxxxxxxx>:
>>
>> It seams that you played around with crushmap, and done something wrong.
>> Compare the look of 'ceph osd tree' and crushmap. There are some 'osd'
>> devices renamed to 'device' think threre is you problem.
>>
>> Отправлено с мобильного устройства.
>>
>>
>> -----Original Message-----
>> From: Vasiliy Angapov <angapov@xxxxxxxxx>
>> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Sent: чт, 26 нояб. 2015 7:53
>> Subject: Undersized pgs problem
>>
>> Hi, colleagues!
>>
>> I have small 4-node CEPH cluster (0.94.2), all pools have size 3, min_size
>> 1.
>> This night one host failed and cluster was unable to rebalance saying
>> there are a lot of undersized pgs.
>>
>> root@slpeah002:[~]:# ceph -s
>> cluster 78eef61a-3e9c-447c-a3ec-ce84c617d728
>> health HEALTH_WARN
>> 1486 pgs degraded
>> 1486 pgs stuck degraded
>> 2257 pgs stuck unclean
>> 1486 pgs stuck undersized
>> 1486 pgs undersized
>> recovery 80429/555185 objects degraded (14.487%)
>> recovery 40079/555185 objects misplaced (7.219%)
>> 4/20 in osds are down
>> 1 mons down, quorum 1,2 slpeah002,slpeah007
>> monmap e7: 3 mons at
>>
>> {slpeah001=192.168.254.11:6780/0,slpeah002=192.168.254.12:6780/0,slpeah007=172.31.252.46:6789/0}
>> election epoch 710, quorum 1,2 slpeah002,slpeah007
>> osdmap e14062: 20 osds: 16 up, 20 in; 771 remapped pgs
>> pgmap v7021316: 4160 pgs, 5 pools, 1045 GB data, 180 kobjects
>> 3366 GB used, 93471 GB / 96838 GB avail
>> 80429/555185 objects degraded (14.487%)
>> 40079/555185 objects misplaced (7.219%)
>> 1903 active+clean
>> 1486 active+undersized+degraded
>> 771 active+remapped
>> client io 0 B/s rd, 246 kB/s wr, 67 op/s
>>
>> root@slpeah002:[~]:# ceph osd tree
>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 94.63998 root default
>> -9 32.75999 host slpeah007
>> 72 5.45999 osd.72 up 1.00000 1.00000
>> 73 5.45999 osd.73 up 1.00000 1.00000
>> 74 5.45999 osd.74 up 1.00000 1.00000
>> 75 5.45999 osd.75 up 1.00000 1.00000
>> 76 5.45999 osd.76 up 1.00000 1.00000
>> 77 5.45999 osd.77 up 1.00000 1.00000
>> -10 32.75999 host slpeah008
>> 78 5.45999 osd.78 up 1.00000 1.00000
>> 79 5.45999 osd.79 up 1.00000 1.00000
>> 80 5.45999 osd.80 up 1.00000 1.00000
>> 81 5.45999 osd.81 up 1.00000 1.00000
>> 82 5.45999 osd.82 up 1.00000 1.00000
>> 83 5.45999 osd.83 up 1.00000 1.00000
>> -3 14.56000 host slpeah001
>> 1 3.64000 osd.1 down 1.00000 1.00000
>> 33 3.64000 osd.33 down 1.00000 1.00000
>> 34 3.64000 osd.34 down 1.00000 1.00000
>> 35 3.64000 osd.35 down 1.00000 1.00000
>> -2 14.56000 host slpeah002
>> 0 3.64000 osd.0 up 1.00000 1.00000
>> 36 3.64000 osd.36 up 1.00000 1.00000
>> 37 3.64000 osd.37 up 1.00000 1.00000
>> 38 3.64000 osd.38 up 1.00000 1.00000
>>
>> Crushmap:
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable chooseleaf_vary_r 1
>> tunable straw_calc_version 1
>> tunable allowed_bucket_algs 54
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 device2
>> device 3 device3
>> device 4 device4
>> device 5 device5
>> device 6 device6
>> device 7 device7
>> device 8 device8
>> device 9 device9
>> device 10 device10
>> device 11 device11
>> device 12 device12
>> device 13 device13
>> device 14 device14
>> device 15 device15
>> device 16 device16
>> device 17 device17
>> device 18 device18
>> device 19 device19
>> device 20 device20
>> device 21 device21
>> device 22 device22
>> device 23 device23
>> device 24 device24
>> device 25 device25
>> device 26 device26
>> device 27 device27
>> device 28 device28
>> device 29 device29
>> device 30 device30
>> device 31 device31
>> device 32 device32
>> device 33 osd.33
>> device 34 osd.34
>> device 35 osd.35
>> device 36 osd.36
>> device 37 osd.37
>> device 38 osd.38
>> device 39 device39
>> device 40 device40
>> device 41 device41
>> device 42 device42
>> device 43 device43
>> device 44 device44
>> device 45 device45
>> device 46 device46
>> device 47 device47
>> device 48 device48
>> device 49 device49
>> device 50 device50
>> device 51 device51
>> device 52 device52
>> device 53 device53
>> device 54 device54
>> device 55 device55
>> device 56 device56
>> device 57 device57
>> device 58 device58
>> device 59 device59
>> device 60 device60
>> device 61 device61
>> device 62 device62
>> device 63 device63
>> device 64 device64
>> device 65 device65
>> device 66 device66
>> device 67 device67
>> device 68 device68
>> device 69 device69
>> device 70 device70
>> device 71 device71
>> device 72 osd.72
>> device 73 osd.73
>> device 74 osd.74
>> device 75 osd.75
>> device 76 osd.76
>> device 77 osd.77
>> device 78 osd.78
>> device 79 osd.79
>> device 80 osd.80
>> device 81 osd.81
>> device 82 osd.82
>> device 83 osd.83
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 chassis
>> type 3 rack
>> type 4 row
>> type 5 pdu
>> type 6 pod
>> type 7 room
>> type 8 datacenter
>> type 9 region
>> type 10 root
>>
>> # buckets
>> host slpeah007 {
>> id -9 # do not change unnecessarily
>> # weight 32.760
>> alg straw
>> hash 0 # rjenkins1
>> item osd.72 weight 5.460
>> item osd.73 weight 5.460
>> item osd.74 weight 5.460
>> item osd.75 weight 5.460
>> item osd.76 weight 5.460
>> item osd.77 weight 5.460
>> }
>> host slpeah008 {
>> id -10 # do not change unnecessarily
>> # weight 32.760
>> alg straw
>> hash 0 # rjenkins1
>> item osd.78 weight 5.460
>> item osd.79 weight 5.460
>> item osd.80 weight 5.460
>> item osd.81 weight 5.460
>> item osd.82 weight 5.460
>> item osd.83 weight 5.460
>> }
>> host slpeah001 {
>> id -3 # do not change unnecessarily
>> # weight 14.560
>> alg straw
>> hash 0 # rjenkins1
>> item osd.1 weight 3.640
>> item osd.33 weight 3.640
>> item osd.34 weight 3.640
>> item osd.35 weight 3.640
>> }
>> host slpeah002 {
>> id -2 # do not change unnecessarily
>> # weight 14.560
>> alg straw
>> hash 0 # rjenkins1
>> item osd.0 weight 3.640
>> item osd.36 weight 3.640
>> item osd.37 weight 3.640
>> item osd.38 weight 3.640
>> }
>> root default {
>> id -1 # do not change unnecessarily
>> # weight 94.640
>> alg straw
>> hash 0 # rjenkins1
>> item slpeah007 weight 32.760
>> item slpeah008 weight 32.760
>> item slpeah001 weight 14.560
>> item slpeah002 weight 14.560
>> }
>>
>> # rules
>> rule default {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type host
>> step emit
>> }
>>
>> # end crush map
>>
>>
>>
>> This is odd because pools have size 3 and I have 3 hosts alive, so why
>> it is saying that undersized pgs are present? It makes me feel like
>> CRUSH is not working properly.
>> There is not much data currently in cluster, something about 3TB and
>> as you can see from osd tree - each host have minimum of 14TB disk
>> space on OSDs.
>> So I'm a bit stuck now...
>> How can I find the source of trouble?
>>
>> Thanks in advance!
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com