Hi Zoltan,
We have a very similar setup with one of our upstream community
performance test clusters. 60 4TB PM983 drives spread across 10 nodes.
We get similar numbers to what you are initially seeing (scaled down to
60 drives) though with somewhat lower random read IOPS (we tend to max
out at around 2M with 60 drives on this HW). I haven't seen any issues
with quincy like what you are describing, but on this cluster most of
the tests have been on bare metal. One issue we have noticed with the
PM983 drives is that they may be more susceptible to non-optimal write
patterns causing slowdowns vs other NVMe drives in the lab. We actually
had to issue a last minute PR for quincy to change the disk allocation
behavior to deal with it. See:
https://github.com/ceph/ceph/pull/45771
https://github.com/ceph/ceph/pull/45884
I don't *think* this is the issue you are hitting since the fix in
#45884 should have taken care of it, but it might be something to keep
in the back of your mind. Otherwise, the fact that you are seeing such
a dramatic difference across both small and large read/write benchmarks
makes me think there is something else going on. Is there any chance
that some other bottleneck is being imposed when the pods and volumes
are deleted and recreated? Might be worth looking at memory and CPU
usage of the OSDs in all of the cases and RocksDB flushing/compaction
stats from the OSD logs. Also a quick check with collectl/iostat/sar
during the slow case to make sure none of the drives are showing latency
and built up IOs in the device queues.
If you want to go deeper down the rabbit hole you can try running my
wallclock profiler against one of your OSDs in the fast/slow cases, but
you'll have to make sure it has access to debug symbols:
https://github.com/markhpc/uwpmp.git
run it like:
./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt
If the libdw backend is having problems you can use -b libdwarf instead,
but it's much slower and takes longer to collect as many samples (you
might want to do -n 1000 instead).
Mark
On 7/25/22 11:17, Zoltan Langi wrote:
Hi people, we got an interesting issue here and I would like to ask if
anyone seen anything like this before.
First: our system:
The ceph version is 17.2.1 but we also seen the same behaviour on 16.2.9.
Our kernel version is 5.13.0-51 and our NVMe disks are Samsung PM983.
In our deployment we got 12 nodes in total, 72 disks and 2 osd per
disk makes 144 osd in total.
The depoyment was done by ceph-rook with default values, 6 CPU cores
allocated to the OSD each and 4GB of memory allocated to each OSD.
The issue we are experiencing: We create for example 100 volumes via
ceph-csi and attach it to kubernetes pods via rbd. We talk about 100
volumes in total, 2GB each. We run fio performance tests (read, write,
mixed) on them so the volumes are being used heavily. Ceph delivers
good performance, no problems as all.
Performance we get for example: read iops 3371027 write iops: 727714
read bw: 79.9 GB/s write bw: 31.2 GB/s
After the tests are complete, these volumes just sitting there doing
nothing for a longer period of time for example 48 hours. After that,
we clean the pods up, clean the volumes up and delete them.
Recreate the volumes and pods once more, same spec (2GB each 100 pods)
then run the same tests once again. We don’t even have half the
performance of that we have measured before leaving the pods sitting
there doing notning for 2 days.
Performance we get after deleting the volumes and recreating them,
rerun the tests: read iops: 1716239 write iops: 370631 read bw: 37.8
GB/s write bw: 7.47 GB/s
We can clearly see that it’s a big performance loss.
If we clean up the ceph deployment, wipe the disks out completely and
redeploy, the cluster once again delivering great performance.
We haven’t seen such a behaviour with ceph version 14.x
Has anyone seen such a thing? Thanks in advance!
Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx