Re: Performance of volume size, not a block size

Janne Johansson <icepic.dz@xxxxxxxxx> · Tue, 16 Apr 2024 08:35:22 +0200

Den mån 15 apr. 2024 kl 13:09 skrev Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>:
> Hi Menguy-san,
>
> Thank you for your reply. Users who use large IO with tiny volumes are a
> nuisance to cloud providers.
>
> I confirmed my ceph cluster with 40 SSDs. Each OSD on 1TB SSD has about 50
> placement groups in my cluster. Therefore, each PG has approximately 20GB
> of space.
> If we create a small 8GB volume, I had a feeling it wouldn't be distributed
> well, but it will be distributed well.

RBD images get split into 2 or 4M pieces when stored in ceph, so an 8G
RBD image will be split into 2048-or-4096 separate pieces that end up
"randomly" on the PGs the pool is based on, which means that if you
read or write the whole RBD image from start to end, you are going to
spread the load to all OSDs.

I think it works something like this, you ask librbd for an 8G image
named "myimage", and underneath it makes myimage.0, myimage.1, 2,3,4
and so on. The PG placement will depend on the object name, which of
course differs for all the pieces, and hence they end up on different
PGs, thereby spreading the load. If ceph did not do this, then you
could never make an RBD image that was larger than "smallest free
space on any of the pools OSDs" but also, it would mean that the RBD
client would be talking to the same single OSD for everything, and
that would not be a good way to use a clusters resources evenly.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx