Re: Very unbalanced storage

Sage Weil <sage@xxxxxxxxxxx> · Fri, 31 Aug 2012 09:05:49 -0700 (PDT)

On Fri, 31 Aug 2012, Xiaopong Tran wrote:
> Hi,
> 
> Ceph storage on each disk in the cluster is very unbalanced. On each
> node, the data seems to go to one or two disks, while other disks
> are almost empty.
> 
> I can't find anything wrong from the crush map, it's just the
> default for now. Attached is the crush map.

This is usually a problem with the pg_num for the pool you are using.  Can 
you include the output from 'ceph osd dump | grep ^pool'?  By default, 
pools get 8 pgs, which will distribute poorly.

sage

> 
> Here is the current situation on node s100001:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               932G  4.3G  927G   1%
> /disk1
> /dev/sdc1                                               932G  4.3G  927G   1%
> /disk2
> /dev/sdd1                                               932G  4.3G  927G   1%
> /disk3
> /dev/sde1                                               932G  4.3G  927G   1%
> /disk4
> /dev/sdf1                                               932G  4.3G  927G   1%
> /disk5
> /dev/sdg1                                               932G  4.3G  927G   1%
> /disk6
> /dev/sdh1                                               932G  4.3G  927G   1%
> /disk7
> /dev/sdi1                                               932G  4.3G  927G   1%
> /disk8
> /dev/sdj1                                               932G  4.3G  927G   1%
> /disk9
> /dev/sdk1                                               932G  445G  487G  48%
> /disk10
> 
> Here, we can see that all data seem to go to one osd only, while others
> are almost empty.
> 
> And here's the situation on node s200001:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               932G  443G  489G  48%
> /disk1
> /dev/sdc1                                               932G  4.3G  927G   1%
> /disk2
> /dev/sdd1                                               932G  4.3G  927G   1%
> /disk3
> /dev/sde1                                               932G  4.3G  927G   1%
> /disk4
> /dev/sdf1                                               932G  4.3G  927G   1%
> /disk5
> /dev/sdg1                                               932G  4.3G  927G   1%
> /disk6
> /dev/sdh1                                               932G  4.3G  927G   1%
> /disk7
> /dev/sdi1                                               932G  4.3G  927G   1%
> /disk8
> /dev/sdj1                                               932G  449G  483G  49%
> /disk9
> /dev/sdk1                                               932G  4.3G  927G   1%
> /disk10
> 
> The situation is a bit better, but not much, the data are stored on two
> disks mainly.
> 
> Here is a better situation, on node s100002:
> 
> Filesystem                                              Size  Used Avail Use%
> Mounted on
> /dev/sdb1                                               1.9T  453G  1.4T  25%
> /disk1
> /dev/sdc1                                               1.9T  4.3G  1.9T   1%
> /disk2
> /dev/sdd1                                               1.9T  4.4G  1.9T   1%
> /disk3
> /dev/sde1                                               1.9T  4.3G  1.9T   1%
> /disk4
> /dev/sdf1                                               1.9T  457G  1.4T  25%
> /disk5
> /dev/sdg1                                               1.9T  443G  1.4T  24%
> /disk6
> /dev/sdh1                                               1.9T  4.4G  1.9T   1%
> /disk7
> /dev/sdi1                                               1.9T  4.4G  1.9T   1%
> /disk8
> /dev/sdj1                                               1.9T  427G  1.5T  23%
> /disk9
> /dev/sdk1                                               1.9T  4.4G  1.9T   1%
> /disk10
> 
> It's better than the other two, but still not what I expected. I
> expected the data to be spread out according to the weight of each
> osd, as defined in the crush map. Or at least, as close to that
> as possible. It might be just some obviously stupid config error,
> but I don't know. This can't be normal, can it?
> 
> Thanks for any hint.
> 
> Xiaopong
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html