Is rook/CSI still not using efficient rbd object maps ? It could be that you issued a new benchmark while ceph was busy (inefficiently) removing the old rbd images. This is quite a stretch but could be worth exploring. On Mon, Jul 25, 2022, 21:42 Mark Nelson <mnelson@xxxxxxxxxx> wrote: > I don't think so if this is just plain old RBD. RBD shouldn't require > a bunch of RocksDB iterator seeks in the read/write hot path and writes > should pretty quickly clear out tombstones as part of the memtable flush > and compaction process even in the slow case. Maybe in some kind of > pathologically bad read-only corner case with no onode cache but it > would be bad for more reasons than what's happening in that tracker > ticket imho (even reading onodes from rocksdb block cache is > significantly slower than BlueStore's onode cache). > > If RBD mirror (or snapshots) are involved that could be a different > story though. I believe to deal with deletes in that case we have to go > through iteration/deletion loops that have same root issue as what's > going on in the tracker ticket and it can end up impacting client IO. > Gabi and Paul and testing/reworking how the snapmapper works and I've > started a sort of a catch-all PR for improving our RocksDB tunings/glue > here: > > > https://github.com/ceph/ceph/pull/47221 > > > Mark > > On 7/25/22 12:48, Frank Schilder wrote: > > Could it be related to this performance death trap: > https://tracker.ceph.com/issues/55324 ? > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Mark Nelson <mnelson@xxxxxxxxxx> > > Sent: 25 July 2022 18:50 > > To: ceph-users@xxxxxxx > > Subject: Re: weird performance issue on ceph > > > > Hi Zoltan, > > > > > > We have a very similar setup with one of our upstream community > > performance test clusters. 60 4TB PM983 drives spread across 10 nodes. > > We get similar numbers to what you are initially seeing (scaled down to > > 60 drives) though with somewhat lower random read IOPS (we tend to max > > out at around 2M with 60 drives on this HW). I haven't seen any issues > > with quincy like what you are describing, but on this cluster most of > > the tests have been on bare metal. One issue we have noticed with the > > PM983 drives is that they may be more susceptible to non-optimal write > > patterns causing slowdowns vs other NVMe drives in the lab. We actually > > had to issue a last minute PR for quincy to change the disk allocation > > behavior to deal with it. See: > > > > > > https://github.com/ceph/ceph/pull/45771 > > > > https://github.com/ceph/ceph/pull/45884 > > > > > > I don't *think* this is the issue you are hitting since the fix in > > #45884 should have taken care of it, but it might be something to keep > > in the back of your mind. Otherwise, the fact that you are seeing such > > a dramatic difference across both small and large read/write benchmarks > > makes me think there is something else going on. Is there any chance > > that some other bottleneck is being imposed when the pods and volumes > > are deleted and recreated? Might be worth looking at memory and CPU > > usage of the OSDs in all of the cases and RocksDB flushing/compaction > > stats from the OSD logs. Also a quick check with collectl/iostat/sar > > during the slow case to make sure none of the drives are showing latency > > and built up IOs in the device queues. > > > > If you want to go deeper down the rabbit hole you can try running my > > wallclock profiler against one of your OSDs in the fast/slow cases, but > > you'll have to make sure it has access to debug symbols: > > > > > > https://github.com/markhpc/uwpmp.git > > > > > > run it like: > > > > > > ./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt > > > > > > If the libdw backend is having problems you can use -b libdwarf instead, > > but it's much slower and takes longer to collect as many samples (you > > might want to do -n 1000 instead). > > > > > > Mark > > > > > > On 7/25/22 11:17, Zoltan Langi wrote: > >> Hi people, we got an interesting issue here and I would like to ask if > >> anyone seen anything like this before. > >> > >> > >> First: our system: > >> > >> The ceph version is 17.2.1 but we also seen the same behaviour on > 16.2.9. > >> > >> Our kernel version is 5.13.0-51 and our NVMe disks are Samsung PM983. > >> > >> In our deployment we got 12 nodes in total, 72 disks and 2 osd per > >> disk makes 144 osd in total. > >> > >> The depoyment was done by ceph-rook with default values, 6 CPU cores > >> allocated to the OSD each and 4GB of memory allocated to each OSD. > >> > >> > >> The issue we are experiencing: We create for example 100 volumes via > >> ceph-csi and attach it to kubernetes pods via rbd. We talk about 100 > >> volumes in total, 2GB each. We run fio performance tests (read, write, > >> mixed) on them so the volumes are being used heavily. Ceph delivers > >> good performance, no problems as all. > >> > >> Performance we get for example: read iops 3371027 write iops: 727714 > >> read bw: 79.9 GB/s write bw: 31.2 GB/s > >> > >> > >> After the tests are complete, these volumes just sitting there doing > >> nothing for a longer period of time for example 48 hours. After that, > >> we clean the pods up, clean the volumes up and delete them. > >> > >> Recreate the volumes and pods once more, same spec (2GB each 100 pods) > >> then run the same tests once again. We don’t even have half the > >> performance of that we have measured before leaving the pods sitting > >> there doing notning for 2 days. > >> > >> > >> Performance we get after deleting the volumes and recreating them, > >> rerun the tests: read iops: 1716239 write iops: 370631 read bw: 37.8 > >> GB/s write bw: 7.47 GB/s > >> > >> We can clearly see that it’s a big performance loss. > >> > >> > >> If we clean up the ceph deployment, wipe the disks out completely and > >> redeploy, the cluster once again delivering great performance. > >> > >> > >> We haven’t seen such a behaviour with ceph version 14.x > >> > >> > >> Has anyone seen such a thing? Thanks in advance! > >> > >> Zoltan > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx