Re: osd' balancing question

Christian Balzer <chibi@xxxxxxx> · Wed, 4 Jan 2017 10:29:21 +0900

Hello,

On Tue, 3 Jan 2017 15:47:09 +0200 Yair Magnezi wrote:

> Hello
> 
> 1) Does the re-weigh  / load balancing   is taking place only  within the
> same node ?

Not in general, but it could certainly happen if the change is small and
involves only a single PG for examples.

> 2) I'm raising the target osd weigh but nothing is happening , i expect to
> see some data movements but nothing is there , only when decreasing the
> weigh i can see back-filling is taking place , is this normal ?
>
Your log snippets are from OSD logs, where these backfill operations are
not logged. 
Do a "watch ceph -s" in a separate window during these ops and/or log at
the "ceph.log" on a MON node.

That said, if the change in weight is very small, nothing might happen.

Christian

> 
> ceph osd tree | grep 50
> 50  0.89999         osd.50                  up  1.00000          1.00000
> 
> 
> root@ecprdbcph04-opens:/var/log/ceph# ceph osd df | grep 50
> 14 0.75000  1.00000   888G   694G    194G 78.10 0.99 119
> 36 0.86800  1.00000   888G   608G    279G 68.50 0.87 146
> *50 0.84999  1.00000   888G   520G    368G 58.51 0.74 122*
> 52 0.86800  1.00000   888G   650G    238G 73.16 0.93 144
> 37 0.86800  1.00000   888G   650G    238G 73.19 0.93 134
> 
> root@ecprdbcph04-opens:/var/log/ceph# ceph osd crush  reweight osd.50  0.90
> reweighted item id 50 name 'osd.50' to 0.9 in crush map
> 
> 
> 2017-01-03 08:32:39.532287 7f1a42319700  0 -- 10.63.4.18:6838/84978 >>
> 10.63.4.18:6814/81943 pipe(0x7f1a976fa000 sd=382 :6838 s=0 pgs=0 cs=0 l=0
> c=0x7f1a9ef04840).accept connect_seq 12 vs existing 12 state standby
> 2017-01-03 08:32:39.532353 7f1a42319700  0 -- 10.63.4.18:6838/84978 >>
> 10.63.4.18:6814/81943 pipe(0x7f1a976fa000 sd=382 :6838 s=0 pgs=0 cs=0 l=0
> c=0x7f1a9ef04840).accept connect_seq 13 vs existing 12 state standby
> 2017-01-03 08:32:39.573405 7f1a3cd25700  0 -- 10.63.4.18:6838/84978 >>
> 10.63.4.18:6842/85573 pipe(0x7f1a9606f000 sd=475 :6838 s=0 pgs=0 cs=0 l=0
> c=0x7f1a9ef06d60).accept connect_seq 11 vs existing 11 state standby
> 2017-01-03 08:32:39.573457 7f1a3cd25700  0 -- 10.63.4.18:6838/84978 >>
> 10.63.4.18:6842/85573 pipe(0x7f1a9606f000 sd=475 :6838 s=0 pgs=0 cs=0 l=0
> c=0x7f1a9ef06d60).accept connect_seq 12 vs existing 11 state standby
> 
> Thanks
> 
> 
> 
> *Yair Magnezi *
> 
> 
> 
> 
> *Storage & Data Protection TL   // KenshooOffice +972 7 32862423   //
> Mobile +972 50 575-2955__________________________________________*
> 
> 
> 
> On Tue, Jan 3, 2017 at 3:17 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Tue, 3 Jan 2017 14:57:16 +0200 Yair Magnezi wrote:
> >
> > > Hello Christian .
> > > Sorry for my mistake it's  Infernalis  we're running ( 9.2.1 )
> > >
> > With docs being down I'm not certain, but that isn't the latest Infernalis
> > AFAIR.
> > But before any upgrades, you want that cluster being stable and healthy.
> >
> > > our tree looks like this -->
> > >
> > Thanks, so 6 nodes, no corner cases here then.
> >
> > "ceph osd df" as well, but I assume from the original mail that all your
> > OSDS are the same size.
> >
> > [snip]
> >
> > > we have an ongoing capacity issue as you can see below ( although we're
> > > only using less then 80% )
> > >
> > That's getting pretty close to the limites (with the default values), as
> > Ceph really isn't very good at keeping things balanced.
> >
> > >
> > > root@ecprdbcph01-opens:/var/lib/ceph/osd/ceph-11/current# ceph df
> > > GLOBAL:
> > >     SIZE       AVAIL      RAW USED     %RAW USED
> > >     53329G     11219G       42110G         78.96
> > >
> > >
> > > osd.12 is near full at 85%
> > > osd.16 is near full at 85%
> > > osd.17 is near full at 87%
> > > osd.19 is near full at 85%
> > > osd.22 is near full at 87%
> > > osd.24 is near full at 87%
> > > osd.29 is near full at 85%
> > > osd.33 is near full at 86%
> > > osd.39 is near full at 85%
> > > osd.42 is near full at 87%
> > > osd.45 is near full at 87%
> > > osd.47 is near full at 87%
> > > osd.49 is near full at 88%
> > > osd.58 is near full at 87%
> > >
> > >
> > At this number of near-full OSDs I'd strongly recommend adding more
> > OSDs/nodes, because even with a perfectly balanced cluster you'd still be
> > in trouble if a node or even a single OSD were to fail.
> >
> > >
> > > i'm trying to decrease the weigh as you've suggested but it looks like we
> > > have some troubles :
> > >
> > I wrote "RAISE" as in "increase" the weight of OSDs that have significantly
> > less data than others.
> >
> > > ceph osd crush reweight osd.11 0.98
> > >
> > > tail -f ceph-osd.11.log
> > >
> > >
> > > 2017-01-03 07:38:41.952538 7f9a5c7e1700  0 -- 10.63.4.1:6808/3301342 >>
> > > 10.63.4.19:6827/2264381 pipe(0x7f9ad3df4000 sd=442 :6808 s=0 pgs=0 cs=0
> > l=0
> > > c=0x7f9ac2530000).accept connect_seq 34 vs existing 33 state standby
> > > 2017-01-03 07:41:46.566313 7f9a73871700  0 -- 10.63.4.1:6808/3301342 >>
> > > 10.63.4.1:6830/3303583 pipe(0x7f9ac80d5000 sd=376 :6808 s=0 pgs=0 cs=0
> > l=0
> > > c=0x7f9ac2530160).accept connect_seq 4 vs existing 4 state standby
> > > 2017-01-03 07:41:46.566370 7f9a73871700  0 -- 10.63.4.1:6808/3301342 >>
> > > 10.63.4.1:6830/3303583 pipe(0x7f9ac80d5000 sd=376 :6808 s=0 pgs=0 cs=0
> > l=0
> > > c=0x7f9ac2530160).accept connect_seq 5 vs existing 4 state standby
> > > 2017-01-03 07:41:46.585562 7f9a631d9700  0 -- 10.63.4.1:6808/3301342 >>
> > > 10.63.4.1:6824/3303035 pipe(0x7f9ab9940000 sd=283 :6808 s=0 pgs=0 cs=0
> > l=0
> > > c=0x7f9ac2532ec0).accept connect_seq 5 vs existing 5 state standby
> > > 2017-01-03 07:41:46.585608 7f9a631d9700  0 -- 10.63.4.1:6808/3301342 >>
> > > 10.63.4.1:6824/3303035 pipe(0x7f9ab9940000 sd=283 :6808 s=0 pgs=0 cs=0
> > l=0
> > > c=0x7f9ac2532ec0).accept connect_seq 6 vs existing 5 state standby
> > >
> > > in general i've also tried to use reweight-by-utilization but it
> > > doesn't seem to work so well
> > >
> > Latest Hammer or Jewel supposedly have a much improved one.
> >
> > >
> > > Is there any known bug with our version  ? will a  restart of the osds
> > > solve this issue ( it was menstioned in one of the forum's threads but it
> > > was related to firefly )
> > >
> >
> > See above about versions, restart shouldn't be needed but then again
> > recent experiences do suggest that the "Windows approach" (turning it
> > off and on again) seems to help with Ceph at times, too.
> >
> > Christian
> >
> > > Many Thanks .
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *Yair Magnezi *
> > >
> > >
> > >
> > >
> > > *Storage & Data Protection TL   // KenshooOffice +972 7 32862423   //
> > > Mobile +972 50 575-2955__________________________________________*
> > >
> > >
> > >
> > > On Tue, Jan 3, 2017 at 1:41 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Tue, 3 Jan 2017 13:08:50 +0200 Yair Magnezi wrote:
> > > >
> > > > > Hello cephers
> > > > > We're running firefly  ( 9.2.1 )
> > > >
> > > > One of these two is wrong, you're either running Firefly (0.8.x, old
> > and
> > > > unsupported) or Infernalis (9.2.x, non-LTS and thus also unsupported).
> > > >
> > > >
> > > > > I'm trying to re balance  our cluster's osd and from some reason it
> > looks
> > > > > like the re balance is going the wrong way :
> > > >
> > > > A "ceph osd tree" would be helpful for starters.
> > > >
> > > > > What's i'm trying to do is to reduce the loads from osd-14  ( ceph
> > osd
> > > > > crush reweight osd.14 0.75 ) but what i see is the the backfill
> > process
> > > > is
> > > > > moving pgs to osd-29 which is also  86% full
> > > > > i wonder why the crash doesn't map to the less occupied  osd-s  (  3
> > ,
> > > > 4  6
> > > > > for example )
> > > > >  Any input is much appreciated .
> > > > >
> > > >
> > > > CRUSH isn't particular deterministic from a human perspective and often
> > > > data movements will involve steps that are not anticipated.
> > > > CRUSH also does NOT know nor involve the utilization of OSDs, only
> > their
> > > > weight counts.
> > > >
> > > > If you're having extreme in-balances, RAISE the weight of the least
> > > > utilized OSDs first (and in very small increments until you get a
> > > > feeling for things).
> > > > Do this in a manner to keep the weights of hosts more or less the same
> > > > in the end.
> > > >
> > > > Christian
> > > >
> > > > >
> > > > >
> > > > > 2017-01-03 05:59:20.877705 7f3e6a0d6700  0 log_channel(cluster) log
> > > > > [INF] : *2.2cb
> > > > > starting backfill to osd.29 from* (0'0,0'0] MAX to 131306'8029954
> > > > > 2017-01-03 05:59:20.877841 7f3e670d0700  0 log_channel(cluster) log
> > > > [INF] :
> > > > > 2.30d starting backfill to osd.10 from (0'0,0'0] MAX to
> > 131306'8721158
> > > > > 2017-01-03 05:59:31.374323 7f3e356b0700  0 -- 10.63.4.3:6826/3125306
> > >>
> > > > > 10.63.4.5:6821/3162046 pipe(0x7f3e9d513000 sd=322 :6826 s=0 pgs=0
> > cs=0
> > > > l=0
> > > > > c=0x7f3ea72b5de0).accept connect_seq 1605 vs existing 1605 state
> > standby
> > > > > 2017-01-03 05:59:31.374440 7f3e356b0700  0 -- 10.63.4.3:6826/3125306
> > >>
> > > > > 10.63.4.5:6821/3162046 pipe(0x7f3e9d513000 sd=322 :6826 s=0 pgs=0
> > cs=0
> > > > l=0
> > > > > c=0x7f3ea72b5de0).accept connect_seq 1606 vs existing 1605 state
> > standby
> > > > > ^C
> > > > > root@ecprdbcph03-opens:/var/log/ceph# df -h
> > > > > Filesystem                           Size  Used Avail Use% Mounted on
> > > > > udev                                  32G  4.0K   32G   1% /dev
> > > > > tmpfs                                6.3G  1.4M  6.3G   1% /run
> > > > > /dev/dm-1                            106G  4.1G   96G   5% /
> > > > > none                                 4.0K     0  4.0K   0%
> > /sys/fs/cgroup
> > > > > none                                 5.0M     0  5.0M   0% /run/lock
> > > > > none                                  32G     0   32G   0% /run/shm
> > > > > none                                 100M     0  100M   0% /run/user
> > > > > /dev/sdk2                            465M   50M  391M  12% /boot
> > > > > /dev/sdk1                            512M  3.4M  509M   1% /boot/efi
> > > > > ec-mapr-prd:/mapr/ec-mapr-prd/homes  262T  143T  119T  55%
> > /export/home
> > > > > /dev/sde1                            889G  640G  250G  72%
> > > > > /var/lib/ceph/osd/ceph-3
> > > > > /dev/sdf1                            889G  656G  234G  74%
> > > > > /var/lib/ceph/osd/ceph-4
> > > > > /dev/sdg1                            889G  583G  307G  66%
> > > > > /var/lib/ceph/osd/ceph-6
> > > > > /dev/sda1                            889G  559G  331G  63%
> > > > > /var/lib/ceph/osd/ceph-8
> > > > > /dev/sdb1                            889G  651G  239G  74%
> > > > > /var/lib/ceph/osd/ceph-10
> > > > > /dev/sdc1                            889G  751G  139G  85%
> > > > > /var/lib/ceph/osd/ceph-12
> > > > > /dev/sdh1                            889G  759G  131G  86%
> > > > > /var/lib/ceph/osd/ceph-14
> > > > > /dev/sdi1                            889G  763G  127G  86%
> > > > > /var/lib/ceph/osd/ceph-16
> > > > > /dev/sdj1                            889G  732G  158G  83%
> > > > > /var/lib/ceph/osd/ceph-18
> > > > > /dev/sdd1                            889G  756G  134G  86%
> > > > > /var/lib/ceph/osd/ceph-29
> > > > > root@ecprdbcph03-opens:/var/log/ceph#
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > > *Yair Magnezi *
> > > > >
> > > > >
> > > > >
> > > > > *Storage & Data Protection TL   // Kenshoo*
> > > > >
> > > >
> > > >
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > >
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com