Re: crushmap rules :: host selection

Janne Johansson <icepic.dz@xxxxxxxxx> · Mon, 29 Jan 2024 07:28:54 +0100

Den sön 28 jan. 2024 kl 23:02 skrev Adrian Sevcenco <Adrian.Sevcenco@xxxxxxx>:
>
> >> is it wrong to think of PGs like a kind of object bucket (S3 like)?
> >
> > Mostly, yes.
> so .. in a PG there are no "file data" but pieces of "file data"?
> so 100 GB file with 2x replication will be placed in more than 2 PGs?
> Is there some kind of strips or chunks that a given data conglomerate is split into
> and then _those_ are put in multiple PGs with the rule that they have to be in a Replicated=X ?

PGs have no fixed size, they accept objects until the OSD(s) they lie
on cannot accept more data.
For repl=3, it means three full copies of each object on three
separate PGs which are chose to be
from three different OSDs, by default also on three different OSD hosts.

That said, if you use RadosGW or RBD (for VM disk images for example),
those two will split your large
"files" into many smaller objects, at sizes like 2M or 4M each. So if
you upload a 10G S3 object to RadosGW
or have your virtualization platform assign a 10G "disk" to a VM, that
10G ceph object will not be in one PG,
it will be made into 2-4M pieces and those will have different names
like 10G_image.1001, 10G_image.1002
which means the pieces will spread out on many different PGs. This has
the benefit of making it possible to
store really huge images on many small drives, and also spread load
since the VM might itself ask for IO
to be done on several places on its disk, and these would be served by
different drives in your ceph cluster.
If the VM image ended up as a single object in a single PG, then it
would mean all IO for it would always hit
this one drive and you would get really bad performance which would be
worse than if the hypervisor had
this drive locally attached.

So for most usage as a normal user, the libraries underneath will do
the object splitting for you and it will
be totally transparent to you but depending on how you use the cluster
and at which level you are looking,
those final objects in ceph do not get split again by the cluster
itself, unless you are using Erasure Coding.
With rados calls one can make 100G objects, but most usage goes via
librbd or radosgw or something on
top of rados and hence it does the splitting for you.

> >> if so, is the size of PG the limit of a file size? (for replicated case)?

Mostly no.
The corner case here is if you have OSDs with very different amounts
of free space. This is mostly
visible in reports like "ceph df" because it will say a pool (made up
of PGs) has a max-free based on
the OSD with least free space, multiplied by number-of-OSDs in the
pool. This is because the PG
placement is semi-random and depending on what you write, it could
happen that all data goes into
the OSD with the least amount of free space every time, and then the
pool goes full long before the
largest OSD is filled. The fewer OSDs you have and the larger the
difference in sizes of those OSDs
you get, the worse the situation becomes. As an example, lets say you
have a pool that is told by crush
to end up on these three OSD hosts with one drive each:

HostA: 10TB free
HostB: 10TB free
HostC: 10G free

In this case "ceph df" will say "you should consider this pool to have
(raw) 30G free at most" and if you
have repl=3 on this pool then it can only take 10G data before either
primary data or one of the two copies
for the data written will make HostC go full. This situation makes
"ceph df" tell you weird numbers, but also
the most truthful ones, since they tell you what your clients will
experience. If you end up in a situation like
the above, you can also notice that if any OSD breaks, hostC has no
chance of hosting extra copies during
repairs, since it will go full almost immediately, when it is
unbalanced like this, your redundancy also suffers.

If your cluster is holding important data, you would make sure there
are most osd hosts than the replication
factor so it can repair copies onto another host when one host dirs,
and since there is soft limits at 85, 90
and 95% OSD capacity that stops certain fileops from driving the OSDs
totally full, you would want to keep
the cluster at 50-60% at most for small clusters so that if one host
dies, all of its data can fit onto the others
without passing the first 85% limit while still allowing all PGs to
hold three copies in total. If you have many
OSD hosts, each one represents a small percentage of the total, and it
will be easier for a cluster to handle
a host failure using space from all the others to rebuild, but then at
some point, having 100+ OSD hosts or
more would make chances larger that any one host is experiencing
planned or unplanned maintenance.

So while PGs don't have fixed size limits, they do have a current
size, and when moving or balancing data,
ceph moves PGs around, not files or objects.
That is why it is preferred to have something like 100+ PGs per OSD,
so that ceph can
balance a part of data at a time that would be around 1% of the disk
capacity and
not just the whole OSD back and forth.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx