Re: cephfs full, 2/3 Raw capacity used

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 26 Aug 2019 09:38:06 -0500

On 8/26/19 7:39 AM, Wido den Hollander wrote:

On 8/26/19 1:35 PM, Simon Oosthoek wrote:
On 26-08-19 13:25, Simon Oosthoek wrote:
On 26-08-19 13:11, Wido den Hollander wrote:
<snip>
The reweight might actually cause even more confusion for the balancer.
The balancer uses upmap mode and that re-allocates PGs to different OSDs
if needed.

Looking at the output send earlier I have some replies. See below.

<snip>
Looking at this output the balancing seems OK, but from a different
perspective.

PGs are allocated to OSDs and not Objects nor data. All OSDs have 95~97
Placement Groups allocated.

That's good! A almost perfect distribution.

The problem that now rises is the difference in the size of these
Placement Groups as they hold different objects.

This is one of the side-effects of larger disks. The PGs on them will
grow and this will lead to inbalance between the OSDs.

I *think* that increasing the amount of PGs on this cluster would help,
but only for the pools which will contain most of the data.

This will consume a bit more CPU Power and Memory, but on modern systems
this should be less of a problem.

The good thing is that with Nautilus you can also scale down on the
amount of PGs if things would become a problem.

More PGs will mean smaller PGs and thus lead to a better data
distribution.
<snip>

That makes sense, dividing the data in smaller chunks makes it more
flexible. The osd nodes are quite underloaded, even with turbo
recovery mode on (10, not 32 ;-).

When the cluster is in HEALTH_OK again, I'll increase the PGs for the
cephfs pools...
On second thought, I reverted my reweight commands and adjusted the PGs,
which were quite low for some of the pools. The reason they were low is
that when we first created them, we expected them to be rarely used, but
then we started filling them just for the filling, and these are
probably the cause of the unbalance.

You should make sure that the pools which contain the most data have the
most PGs.

Although ~100 PGs per OSD is the recommendation it won't hurt to have
~200 PGs as long as you have enough CPU power and Memory. More PGs will
mean better data distribution with such large disks.

Memory is probably the biggest concern, since the pglog can eat up a 
surprising amount of memory with lots of PGs on the OSD.  I suspect we 
should consider having the pglog controlled by the prioritycachemanager 
and set the lengths based on the amount of memory we want assigned to 
it.  Perhaps even dynamically changing based on the pool and current 
workload.  In the long run, we should probably have a much longer log on 
disk and shorter log in memory regardless.

Mark

The cluster now has over 8% misplaced objects, so that can take a while...

Cheers

/Simon
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com