Post the output from your “ceph osd tree”. We were in a similiar situation, some of the OSDs were quite full while other had >50% free. This is exactly why we increased the number of PGs, and it helped to some degree. Are all your hosts the same size? Does your CRUSH map select a host in the end? That way if you have few hosts with differing number of OSDs the distribution will be poor (IMHO). Anyway, when we started increasing the PG numbers we first generated the PGs themselves (pg_num) in small increments since that put a lot of load on the OSDs and we were seeing slow requests with large increases. So something like this: for i in `seq 4096 64 8192` ; do ceph osd pool set poolname pg_num $i ; done This ate a few gigs from the drives (1-2GB if I remember correctly). Once that was finished we increased the pgp_num in larger and larger increments - at first 64 at a time and then 512 at a time when we were reaching the target (16384 in our case). This does allocate more space temporarily, and it seems to just randomly move data around - one minute an OSD is fine, another and the OSD is nearing full. One of us basically had to watch the process all the time, reweighting the devices that were almost full. With increasing number of PGs it became much simpler, as the overhead was smaller, every bit of work was smaller and all the management operations a lot smoother. YMMV - our data distribution was poor from the start, hosts had differing weights due to differing number of OSDs, there were some historical remnants when we tried to load-balance the data by hand, and we ended in a much better state but not perfect - some OSDs still have much more free space than other. We haven’t touched the CRUSH map at all during this process, once we do and set newer tunables then the data distribution should be much more even. I’d love to hear the others’ input since we are not sure why exactly this problem is present at all - I’d expect it to fill all the OSDs to the same or close-enough level, but in reality we have OSDs with weight 1.0 which are almost empty and others with weight 0.5 which are nearly full… When adding data it seems to (subjectively) distribute them evenly... Jan > On 02 Jun 2015, at 18:52, Daniel Maraio <dmaraio@xxxxxxxxxx> wrote: > > Hello, > > I have some questions about the size of my placement groups and how I can get a more even distribution. We currently have 160 2TB OSDs across 20 chassis. We have 133TB used in our radosgw pool with a replica size of 2. We want to move to 3 replicas but are concerned we may fill up some of our OSDs. Some OSDs have ~1.1TB free while others only have ~600GB free. The radosgw pool has 4096 pgs, looking at the documentation I probably want to increase this up to 8192, but we have decided to hold off on that for now. > > So, now for the pg usage. I dumped out the PG stats and noticed that there are two groups of PG sizes in my cluster. There are about 1024 PGs that are each around 17-18GB in size. The rest of the PGs are all around 34-36GB in size. Any idea why there are two distinct groups? We only have the one pool with data in it, though there are several different buckets in the radosgw pool. The data in the pool ranges from small images to 4-6mb audio files. Will increasing the number of PGs on this pool provide a more even distribution? > > Another thing to note is that the initial cluster was built lopsided, with some 4TB OSDs and some 2TB, we have removed all the 4TB disks and are only using 2TBs across the entire cluster. Not sure if this would have had any impact. > > Thank you for your time and I would appreciate any insight the community can offer. > > - Daniel > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com