Re: PG size distribution

Daniel Maraio <dmaraio@xxxxxxxxxx> · Tue, 02 Jun 2015 14:05:39 -0400

Hello,

  Thank you for the feedback Jan, much appreciated! I wont post the 
whole tree as it is rather long, but here is an example of one of our 
hosts. All of the OSDs and hosts are weighted the same, with the 
exception of a host that is missing an OSD due to a broken backplane. We 
are only using hosts for buckets so no rack/DC. We have not manually 
adjusted the crush map at all for this cluster.

 -1 302.26959 root default
-24  14.47998     host osd23
192   1.81000         osd.192      up  1.00000          1.00000
193   1.81000         osd.193      up  1.00000          1.00000
194   1.81000         osd.194      up  1.00000          1.00000
195   1.81000         osd.195      up  1.00000          1.00000
199   1.81000         osd.199      up  1.00000          1.00000
200   1.81000         osd.200      up  1.00000          1.00000
201   1.81000         osd.201      up  1.00000          1.00000
202   1.81000         osd.202      up  1.00000          1.00000

  I appreciate your input and will likely follow the same path you 
have, slowly increasing the PGs and adjusting the weights as necessary. 
If anyone else has any further suggestions I'd love to hear them as well!

- Daniel

On 06/02/2015 01:33 PM, Jan Schermer wrote:
Post the output from your “ceph osd tree”.
We were in a similiar situation, some of the OSDs were quite full while other had >50% free. This is exactly why we increased the number of PGs, and it helped to some degree.
Are all your hosts the same size? Does your CRUSH map select a host in the end? That way if you have few hosts with differing number of OSDs the distribution will be poor (IMHO).

Anyway, when we started increasing the PG numbers we first generated the PGs themselves (pg_num) in small increments since that put a lot of load on the OSDs and we were seeing slow requests with large increases.
So something like this:
for i in `seq 4096 64 8192` ; do ceph osd pool set poolname pg_num $i ; done
This ate a few gigs from the drives (1-2GB if I remember correctly).

Once that was finished we increased the pgp_num in larger and larger increments  - at first 64 at a time and then 512 at a time when we were reaching the target (16384 in our case). This does allocate more space temporarily, and it seems to just randomly move data around - one minute an OSD is fine, another and the OSD is nearing full. One of us basically had to watch the process all the time, reweighting the devices that were almost full.
With increasing number of PGs it became much simpler, as the overhead was smaller, every bit of work was smaller and all the management operations a lot smoother.

YMMV - our data distribution was poor from the start, hosts had differing weights due to differing number of OSDs, there were some historical remnants when we tried to load-balance the data by hand, and we ended in a much better state but not perfect - some OSDs still have much more free space than other.
We haven’t touched the CRUSH map at all during this process, once we do and set newer tunables then the data distribution should be much more even.

I’d love to hear the others’ input since we are not sure why exactly this problem is present at all - I’d expect it to fill all the OSDs to the same or close-enough level, but in reality we have OSDs with weight 1.0 which are almost empty and others with weight 0.5 which are nearly full… When adding data it seems to (subjectively) distribute them evenly...

Jan

On 02 Jun 2015, at 18:52, Daniel Maraio <dmaraio@xxxxxxxxxx> wrote:

Hello,

  I have some questions about the size of my placement groups and how I can get a more even distribution. We currently have 160 2TB OSDs across 20 chassis.  We have 133TB used in our radosgw pool with a replica size of 2. We want to move to 3 replicas but are concerned we may fill up some of our OSDs. Some OSDs have ~1.1TB free while others only have ~600GB free. The radosgw pool has 4096 pgs, looking at the documentation I probably want to increase this up to 8192, but we have decided to hold off on that for now.

  So, now for the pg usage. I dumped out the PG stats and noticed that there are two groups of PG sizes in my cluster. There are about 1024 PGs that are each around 17-18GB in size. The rest of the PGs are all around 34-36GB in size. Any idea why there are two distinct groups? We only have the one pool with data in it, though there are several different buckets in the radosgw pool. The data in the pool ranges from small images to 4-6mb audio files. Will increasing the number of PGs on this pool provide a more even distribution?

  Another thing to note is that the initial cluster was built lopsided, with some 4TB OSDs and some 2TB, we have removed all the 4TB disks and are only using 2TBs across the entire cluster. Not sure if this would have had any impact.

  Thank you for your time and I would appreciate any insight the community can offer.

- Daniel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com