Re: OSD rebalancing issue - should drives be distributed equally over all nodes

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 24 Sep 2019 16:33:46 -0500

Hi Thomas,

How does your crush map/tree look?

If your crush failure domain is by host, then your 96x 8T disks will be as useful as you're 1.6T disks, because smallest failure domain is your limiting factor.

So you can either redistribute your disks to be 16x8T+32x1.6T per host, or you could group your 1.6T nodes into groups (chassis perhaps) and move the 8T nodes into their own chassis, and then set your failure domain to chassis, and this would likely lead to a much more even distribution.

I imagine right now you're 1.6T disks are nearful, and your 8T disks are anything but.

Be careful with something like this however, because you will probably run into some iops discrepancies due to number of spindles/TB difference across 'chassis'.

Hope that helps.

Reed

> On Sep 23, 2019, at 4:07 AM, Thomas <74cmonty@xxxxxxxxx> wrote:
> 
> Hi,
> 
> I'm facing several issues with my ceph cluster (2x MDS, 6x ODS).
> Here I would like to focus on the issue with pgs backfill_toofull.
> I assume this is related to the fact that the data distribution on my
> OSDs is not balanced.
> 
> This is the current ceph status:
> root@ld3955:~# ceph -s
>   cluster:
>     id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>     health: HEALTH_ERR
>             1 MDSs report slow metadata IOs
>             78 nearfull osd(s)
>             1 pool(s) nearfull
>             Reduced data availability: 2 pgs inactive, 2 pgs peering
>             Degraded data redundancy: 304136/153251211 objects degraded
> (0.198%), 57 pgs degraded, 57 pgs undersized
>             Degraded data redundancy (low space): 265 pgs backfill_toofull
>             3 pools have too many placement groups
>             74 slow requests are blocked > 32 sec
>             80 stuck requests are blocked > 4096 sec
> 
>   services:
>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 98m)
>     mgr: ld5505(active, since 3d), standbys: ld5506, ld5507
>     mds: pve_cephfs:1 {0=ld3976=up:active} 1 up:standby
>     osd: 368 osds: 368 up, 367 in; 302 remapped pgs
> 
>   data:
>     pools:   5 pools, 8868 pgs
>     objects: 51.08M objects, 195 TiB
>     usage:   590 TiB used, 563 TiB / 1.1 PiB avail
>     pgs:     0.023% pgs not active
>              304136/153251211 objects degraded (0.198%)
>              1672190/153251211 objects misplaced (1.091%)
>              8564 active+clean
>              196  active+remapped+backfill_toofull
>              57   active+undersized+degraded+remapped+backfill_toofull
>              35   active+remapped+backfill_wait
>              12   active+remapped+backfill_wait+backfill_toofull
>              2    active+remapped+backfilling
>              2    peering
> 
>   io:
>     recovery: 18 MiB/s, 4 objects/s
> 
> 
> Currently I'm using 6 OSD nodes.
> Node A
> 48x 1.6TB HDD
> Node B
> 48x 1.6TB HDD
> Node C
> 48x 1.6TB HDD
> Node D
> 48x 1.6TB HDD
> Node E
> 48x 7.2TB HDD
> Node F
> 48x 7.2TB HDD
> 
> Question:
> Is it advisable to distribute the drives equally over all nodes?
> If yes, how should this be executed w/o ceph disruption?
> 
> Regards
> Thomas
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx