Re: Falls cluster then one node switch off

Christian Balzer <chibi@xxxxxxx> · Wed, 25 May 2016 14:51:24 +0900

Hello,

Thanks for the update and I totally agree that it should try to do 2x
replication on the single storage node.

I'll try to reproduce what you're seeing tomorrow on my test cluster, need
to move some data around first.

Christian

On Wed, 25 May 2016 08:58:54 +0700 Никитенко Виталий wrote:

> I'm sorry it was not right map, that map right
> 
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host ceph1-node {
>         id -2           # do not change unnecessarily
>         # weight 0.030
>         alg straw
>         hash 0  # rjenkins1
>         item osd.0 weight 0.010
>         item osd.1 weight 0.010
>         item osd.2 weight 0.010
> }
> host ceph2-node {
>         id -3           # do not change unnecessarily
>         # weight 0.030
>         alg straw
>         hash 0  # rjenkins1
>         item osd.3 weight 0.010
>         item osd.4 weight 0.010
>         item osd.5 weight 0.010
> }
> root default {
>         id -1           # do not change unnecessarily
>         # weight 0.060
>         alg straw
>         hash 0  # rjenkins1
>         item ceph1-node weight 0.030
>         item ceph2-node weight 0.030
> }
> 
> # rules
> rule replicated_ruleset {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type host
>         step emit
> }
> 
> # end crush map
> 
> > Firefly, but be aware that this version is EoL and no longer receiving
> > updates.
> 
> Installed new version
> root@ceph1-node:~# ceph --version
> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
> 
> On pool created file 3GB, and turned off one host, after 10 minutes,
> there's such messages
> 
> 2016-05-24 17:18:23.804162 mon.0 [INF] pgmap v172: 640 pgs: 10
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 117
> active+remapped, 105 active+degraded+remapped; 2118 MB data, 2315 MB
> used, 28371 MB / 30686 MB avail; 517/1098 objects degraded (47.086%);
> 234/1098 objects misplaced (21.311%); 5746 kB/s, 1 objects/s recovering
> 2016-05-24 17:18:28.268437 mon.0 [INF] pgmap v173: 640 pgs: 11
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 120
> active+remapped, 101 active+degraded+remapped; 2118 MB data, 2331 MB
> used, 28355 MB / 30686 MB avail; 513/1098 objects degraded (46.721%);
> 234/1098 objects misplaced (21.311%); 4507 kB/s, 1 objects/s recovering
> 2016-05-24 17:18:32.759455 mon.0 [INF] pgmap v174: 640 pgs: 13
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 125
> active+remapped, 94 active+degraded+remapped; 2118 MB data, 2375 MB
> used, 28311 MB / 30686 MB avail; 499/1098 objects degraded (45.446%);
> 234/1098 objects misplaced (21.311%); 9729 kB/s, 2 objects/s recovering
> 2016-05-24 17:18:35.314436 mon.0 [INF] pgmap v175: 640 pgs: 11
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 130
> active+remapped, 91 active+degraded+remapped; 2118 MB data, 2395 MB
> used, 28291 MB / 30686 MB avail; 491/1098 objects degraded (44.718%);
> 234/1098 objects misplaced (21.311%); 11285 kB/s, 2 objects/s recovering
> 2016-05-24 17:18:36.634583 mon.0 [INF] pgmap v176: 640 pgs: 12
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 130
> active+remapped, 90 active+degraded+remapped; 2118 MB data, 2403 MB
> used, 28283 MB / 30686 MB avail; 489/1098 objects degraded (44.536%);
> 234/1098 objects misplaced (21.311%); 6608 kB/s, 1 objects/s recovering
> 2016-05-24 17:18:39.724440 mon.0 [INF] pgmap v177: 640 pgs: 15
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 133
> active+remapped, 84 active+degraded+remapped; 2118 MB data, 2428 MB
> used, 28258 MB / 30686 MB avail; 477/1098 objects degraded (43.443%);
> 234/1098 objects misplaced (21.311%); 13084 kB/s, 3 objects/s recovering
> 2016-05-24 17:18:44.009854 mon.0 [INF] pgmap v178: 640 pgs: 12
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 137
> active+remapped, 83 active+degraded+remapped; 2118 MB data, 2447 MB
> used, 28239 MB / 30686 MB avail; 474/1098 objects degraded (43.169%);
> 234/1098 objects misplaced (21.311%); 9650 kB/s, 2 objects/s recovering
> 2016-05-24 17:18:48.822643 mon.0 [INF] pgmap v179: 640 pgs: 10
> active+recovering+degraded+remapped, 408 active+undersized+degraded, 142
> active+remapped, 80 active+degraded+remapped; 2118 MB data, 2493 MB
> used, 28193 MB / 30686 MB avail; 469/1098 objects degraded (42.714%);
> 234/1098 objects misplaced (21.311%); 3857 kB/s, 0 objects/s
> recovering                                                                                                                                                                                                     
> 
> 
> root@ceph1-node:~# ceph -s
>     cluster 808ee682-c121-4867-9fe4-a347d95bf3f0
>      health HEALTH_WARN
>             503 pgs degraded
>             12 pgs recovering
>             408 pgs stuck degraded
>             640 pgs stuck unclean
>             408 pgs stuck undersized
>             408 pgs undersized
>             recovery 474/1098 objects degraded (43.169%)
>             recovery 234/1098 objects misplaced (21.311%)
>             1 mons down, quorum 0,2 ceph1-node,ceph-mon2
>      monmap e1: 3 mons at
> {ceph-mon2=192.168.241.20:6789/0,ceph1-node=192.168.241.2:6789/0,ceph2-node=192.168.241.12:6789/0}
> election epoch 18, quorum 0,2 ceph1-node,ceph-mon2 osdmap e58: 6 osds: 3
> up, 3 in; 232 remapped pgs pgmap v178: 640 pgs, 2 pools, 2118 MB data,
> 549 objects 2447 MB used, 28239 MB / 30686 MB avail
>             474/1098 objects degraded (43.169%)
>             234/1098 objects misplaced (21.311%)
>                  408 active+undersized+degraded
>                  137 active+remapped
>                   83 active+degraded+remapped
>                   12 active+recovering+degraded+remapped
> recovery io 9650 kB/s, 2 objects/s
> 
> 
> 
> iostat -x 1
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     7.58    0.00    3.79     0.00    63.64
> 33.60     0.71  187.20    0.00  187.20 187.20  70.91 sdb
> 0.00     0.00   15.15    2.27  1745.45     7.20   201.17     3.52
> 202.26  202.80  198.67  40.17  70.00 sdc               0.00     0.00
> 28.79    9.85  3781.82  3119.70   357.25     6.26  161.96  125.26
> 269.23  24.24  93.64 sdd               0.00     2.27   15.91   26.52
> 1842.42 13575.76   726.86    11.55  287.14  139.43  375.77  22.86  96.97
> rbd0              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.17    0.00   95.83    0.00    0.00    0.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     6.92    0.00    3.85     0.00    43.08
> 22.40     0.65  168.00    0.00  168.00 168.00  64.62 sdb
> 0.00     0.00    1.54    0.77   196.92     3.08   173.33     0.29
> 122.67  172.00   24.00  66.67  15.38 sdc               0.00     0.77
> 5.38    6.92   787.69  3156.92   641.00     3.32  263.25  198.29
> 313.78  47.75  58.77 sdd               0.00    23.85    9.23   51.54
> 1083.08 16794.62   588.38    15.09  264.71  177.67  280.30  16.56 100.62
> rbd0              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> > "ceph osd tree" output may help, as well as removing ceph1-node2 from
> > the picture.
> 
> root@ceph1-node:~# ceph osd tree
> ID WEIGHT  TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 0.05997 root default                                          
> -2 0.02998     host ceph1-node                                   
>  0 0.00999         osd.0            up  1.00000          1.00000 
>  1 0.00999         osd.1            up  1.00000          1.00000 
>  2 0.00999         osd.2            up  1.00000          1.00000 
> -3 0.02998     host ceph2-node                                   
>  3 0.00999         osd.3          down        0          1.00000 
>  4 0.00999         osd.4          down        0          1.00000 
>  5 0.00999         osd.5          down        0          1.00000 
> 
> > Have you verified (ceph osd get <poolname> size / min_size) that all
> > your pools are actually set like this?
> 
>  root@ceph1-node:~# ceph osd pool get hdd size
> size: 2
> root@ceph1-node:~# ceph osd pool get hdd min_size
> 2016-05-24 17:22:52.171706 7fe7b787d700  0 -- :/135882111 >>
> 192.168.241.12:6789/0 pipe(0x7fe7bc059cf0 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7fe7bc05dfe0).fault min_size: 1
> 
>  root@ceph1-node:~# ceph osd dump  
> pool 1 'hdd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 512 pgp_num 512 last_change 53 flags hashpspool
> stripe_width 0
> 
> 
> after remapped end
> 
> root@ceph1-node:~# ceph -s
> 2016-05-24 17:23:10.123542 7f2c001cf700  0 -- :/623268863 >>
> 192.168.241.12:6789/0 pipe(0x7f2bfc059cd0 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f2bfc05dfc0).fault cluster 808ee682-c121-4867-9fe4-a347d95bf3f0
> health HEALTH_WARN 408 pgs degraded
>             262 pgs stuck degraded
>             640 pgs stuck unclean
>             262 pgs stuck undersized
>             408 pgs undersized
>             recovery 315/1098 objects degraded (28.689%)
>             recovery 234/1098 objects misplaced (21.311%)
>             1 mons down, quorum 0,2 ceph1-node,ceph-mon2
>      monmap e1: 3 mons at
> {ceph-mon2=192.168.241.20:6789/0,ceph1-node=192.168.241.2:6789/0,ceph2-node=192.168.241.12:6789/0}
> election epoch 18, quorum 0,2 ceph1-node,ceph-mon2 osdmap e63: 6 osds: 3
> up, 3 in; 232 remapped pgs pgmap v209: 640 pgs, 2 pools, 2118 MB data,
> 549 objects 3149 MB used, 27537 MB / 30686 MB avail
>             315/1098 objects degraded (28.689%)
>             234/1098 objects misplaced (21.311%)
>                  408 active+undersized+degraded
>                  232 active+remapped
> 
> 
> Any idea why ceph making redundancies on local disks host?
> 
> 
> 
> 24.05.2016, 12:53, "Christian Balzer" <chibi@xxxxxxx>:
> > Hello,
> >
> > On Tue, 24 May 2016 10:28:02 +0700 Никитенко Виталий wrote:
> >
> >>  Hello!
> >>  I have a cluster of 2 nodes with 3 OSD each. The cluster full about
> >> 80%.
> >
> > According to your CRUSH map that's not quite true, namely ceph1-node2
> > entry.
> >
> > And while that again according to your CRUSH map isn't in the default
> > root I wonder WHERE it is and if it confuses Ceph into believing that
> > there is actually a third node?
> >
> > "ceph osd tree" output may help, as well as removing ceph1-node2 from
> > the picture.
> >
> >>  df -H
> >>  /dev/sdc1 27G 24G 3.9G 86% /var/lib/ceph/osd/ceph-1
> >>  /dev/sdd1 27G 20G 6.9G 75% /var/lib/ceph/osd/ceph-2
> >>  /dev/sdb1 27G 24G 3.5G 88% /var/lib/ceph/osd/ceph-0
> >>
> >>  When I switch off one server, then after 10 minutes begins remapped
> >> pgs
> >
> > [snip]
> >>  As a result, one disk overflow and the cluster falls. Why ceph
> >> remapped pgs, it was supposed to simply mark all pgs as
> >> active+degraded, while second node down?
> >
> > Yes, I agree, that shouldn't happen with a properly configured 2 node
> > cluster.
> >
> >>  ceph version 0.80.11
> >
> > Not aware of any bugs in there and in fact I did test a 2 node cluster
> > with Firefly, but be aware that this version is EoL and no longer
> > receiving updates.
> >
> >>  root@ceph1-node:~# cat /etc/ceph/ceph.conf
> >>  [global]
> >>  fsid = b66c7daa-d6d8-46c7-9e61-15adbb749ed7
> >>  mon_initial_members = ceph1-node, ceph2-node, ceph-mon2
> >>  mon_host = 192.168.241.97,192.168.241.110,192.168.241.123
> >>  auth_cluster_required = cephx
> >>  auth_service_required = cephx
> >>  auth_client_required = cephx
> >>  filestore_xattr_use_omap = true
> >>  osd_pool_default_size = 2
> >>  osd_pool_default_min_size = 1
> >
> > Have you verified (ceph osd get <poolname> size / min_size) that all
> > your pools are actually set like this?
> >
> > Regards,
> >
> > Christian
> >>  mon_clock_drift_allowed = 2
> >>
> >>  root@ceph1-node:~#cat crush-map.txt
> >>  # begin crush
> >>  map tunable choose_local_tries
> >>  0 tunable choose_local_fallback_tries
> >>  0 tunable choose_total_tries
> >>  50 tunable chooseleaf_descend_once
> >>  1 tunable straw_calc_version
> >>  1
> >>  #
> >>  devices device 0
> >>  osd.0 device 1
> >>  osd.1 device 2
> >>  osd.2 device 3
> >>  osd.3 device 4
> >>  osd.4 device 5
> >>  osd.5
> >>  #
> >>  types type 0
> >>  osd type 1
> >>  host type 2
> >>  chassis type 3
> >>  rack type 4
> >>  row type 5
> >>  pdu type 6
> >>  pod type 7
> >>  room type 8
> >>  datacenter type 9
> >>  region type 10
> >>  root
> >>  #
> >>  buckets host ceph1-node
> >>  { id -2 # do not change
> >>  unnecessarily # weight
> >>  0.060 alg
> >>  straw hash 0 #
> >>  rjenkins1 item osd.0 weight
> >>  0.020 item osd.1 weight
> >>  0.020 item osd.2 weight
> >>  0.020 }
> >>  host ceph2-node
> >>  { id -3 # do not change unnecessarily
> >>          # weight 0.060
> >>          alg straw
> >>          hash 0 # rjenkins1
> >>          item osd.3 weight 0.020
> >>          item osd.4 weight 0.020
> >>          item osd.5 weight 0.020
> >>  }
> >>  root default {
> >>          id -1 # do not change unnecessarily
> >>          # weight 0.120
> >>          alg straw
> >>          hash 0 # rjenkins1
> >>          item ceph1-node weight 0.060
> >>          item ceph2-node weight 0.060
> >>  }
> >>  host ceph1-node2 {
> >>          id -4 # do not change unnecessarily
> >>          # weight 3.000
> >>          alg straw
> >>          hash 0 # rjenkins1
> >>          item osd.0 weight 1.000
> >>          item osd.1 weight 1.000
> >>          item osd.2 weight 1.000
> >>  }
> >>
> >>  # rules
> >>  rule replicated_ruleset {
> >>          ruleset 0
> >>          type replicated
> >>          min_size 1
> >>          max_size 10
> >>          step take default
> >>          step chooseleaf firstn 0 type host
> >>          step emit
> >>  }
> >>  # end crush map
> >>
> >>  _______________________________________________
> >>  ceph-users mailing list
> >>  ceph-users@xxxxxxxxxxxxxx
> >>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com