Anyway, I uploaded them to an image hosting site: Picture1: https://ibb.co/jZfBW9g Picture2: https://ibb.co/ftnp8Sg Picture3: https://ibb.co/Qrt140Z Picture4: https://ibb.co/945Hhc1 Picture5: https:/ibb.co/VJXhkm0 Picture6: https://ibb.co/mrpgHPvPlease match them up from the previous email any finally you can see the performance graphs I have collected.
Many thanks, Zoltan Am 01.08.22 um 17:53 schrieb Mark Nelson:
Hi Zoltan,It doesn't look like your pictures showed up for me at least. Very interesting results though! Are (or were) the drives particularly full when you've run into performance problems that the discard option appears to fix? There have been some discussions in the past regarding online discard vs periodic discard ala fstrim. The gist of it is that there are performance implications for online trim, but there are (eventual) performance implications if let the drive get too full before doing an offline trim (that itself can be impactful). There's been quite a bit of discussion about it on the mailing list and in PRs:https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/https://github.com/ceph/ceph/pull/14727Specifically, see this comment regarding how it can affect garbage collection but also burst TRIM command effect on the FTL:https://github.com/ceph/ceph/pull/14727#issuecomment-342399578 And some performance testing by Igor here: https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724It would be very interesting to see if you see a similar performance improvement if we had a fstrim like discard option you could run before the new test. There's a tracker ticket for it, but afaik no one has actually implemented anything yet:https://tracker.ceph.com/issues/38494Regarding whether it's safe to have (async) discard enabled... Maybe? :) We left it disabled by default because we didn't want to deal with having to situationally disable it for drives with buggy firmwares and some of the other associated problems with online discard. Having said that, in your case it sounds like enabling it is yielding good results with the PM983 and your workload.There's a really good (but slightly old now) article on LWN detailing the discussion the kernel engineers were having regarding all of this at the LSFMM Summit a few years ago:https://lwn.net/Articles/787272/In the comments, Chris Mason mentions the same delete issue we probably need to tackle (see Igor's comment linked above):"The XFS async trim implementation is pretty reasonable, and it can be a big win in some workloads. Basically anything that gets pushed out of the critical section of the transaction commit can have a huge impact on performance. The major thing it's missing is a way to throttle new deletes from creating a never ending stream of discards, but I don't think any of the filesystems are doing that yet."Mark On 8/1/22 08:36, Zoltan Langi wrote:Hey Frank and Mark,Thanks for your response and sorry about coming back a bit late, but I needed to test something that needs time.How I reproduced this issue: Created 100 volumes with ceph-csi ran 3 set of tests, let the volumes sit for 48 hours and then deleted the volumes, recreated them and ran the tests 3x in a row.If you look at the picture: picture1The picture above clearly shows the performance degradation. We run the first test first read then write at 09:20 finishes at 09:45, at 11:00 we run the new test, 11:20 finishes and already struggling with the read iops and the write iops drops a lot, but it is more like a saw graph in case of the read. 11:40 I reran the test and now, the write normalised on a bad level, no more saw pattern and the write sticks to the bad levels.Let's have a look at the bandwidth graph: picture2Compare the 09:40-10:05 part and the 12:00-12:25 part. Those are the identical tests. Dropped a lot. The only way to recover from this state is to recreate the bluestore devices from scratch.We have enabled the following options in rook-ceph: bdev_enable_discard = true bdev_async_discard = true Now let's have a look at the speed comparsion: Data from last Friday, before the volumes sat for 48 hours: picture3 picture4We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and Test 3: 21:40-23:30. We see slight write degradation, but it should stay the same for the rest of the time.Now let's see the test runs from today: picture5 picture6We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 Test 3: 13:10-14:40.As we see, after enabling these options, the system is delivering constant speeds without degradation and huge performance loss like before.Has anyone came across with something like this behaviour before? We haven't seen any mention of these options int he official docs just in pull requests. Is it safe to use these options in production at all?Many thanks, Zoltan Am 25.07.22 um 21:42 schrieb Mark Nelson:I don't think so if this is just plain old RBD. RBD shouldn't require a bunch of RocksDB iterator seeks in the read/write hot path and writes should pretty quickly clear out tombstones as part of the memtable flush and compaction process even in the slow case. Maybe in some kind of pathologically bad read-only corner case with no onode cache but it would be bad for more reasons than what's happening in that tracker ticket imho (even reading onodes from rocksdb block cache is significantly slower than BlueStore's onode cache).If RBD mirror (or snapshots) are involved that could be a different story though. I believe to deal with deletes in that case we have to go through iteration/deletion loops that have same root issue as what's going on in the tracker ticket and it can end up impacting client IO. Gabi and Paul and testing/reworking how the snapmapper works and I've started a sort of a catch-all PR for improving our RocksDB tunings/glue here:https://github.com/ceph/ceph/pull/47221 Mark On 7/25/22 12:48, Frank Schilder wrote:Could it be related to this performance death trap: https://tracker.ceph.com/issues/55324 ?================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mark Nelson <mnelson@xxxxxxxxxx> Sent: 25 July 2022 18:50 To: ceph-users@xxxxxxx Subject: Re: weird performance issue on ceph Hi Zoltan, We have a very similar setup with one of our upstream communityperformance test clusters. 60 4TB PM983 drives spread across 10 nodes. We get similar numbers to what you are initially seeing (scaled down to60 drives) though with somewhat lower random read IOPS (we tend to max out at around 2M with 60 drives on this HW). I haven't seen any issues with quincy like what you are describing, but on this cluster most of the tests have been on bare metal. One issue we have noticed with the PM983 drives is that they may be more susceptible to non-optimal writepatterns causing slowdowns vs other NVMe drives in the lab. We actuallyhad to issue a last minute PR for quincy to change the disk allocation behavior to deal with it. See: https://github.com/ceph/ceph/pull/45771 https://github.com/ceph/ceph/pull/45884 I don't *think* this is the issue you are hitting since the fix in #45884 should have taken care of it, but it might be something to keepin the back of your mind. Otherwise, the fact that you are seeing such a dramatic difference across both small and large read/write benchmarksmakes me think there is something else going on. Is there any chance that some other bottleneck is being imposed when the pods and volumes are deleted and recreated? Might be worth looking at memory and CPU usage of the OSDs in all of the cases and RocksDB flushing/compaction stats from the OSD logs. Also a quick check with collectl/iostat/sarduring the slow case to make sure none of the drives are showing latencyand built up IOs in the device queues. If you want to go deeper down the rabbit hole you can try running mywallclock profiler against one of your OSDs in the fast/slow cases, butyou'll have to make sure it has access to debug symbols: https://github.com/markhpc/uwpmp.git run it like: ./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txtIf the libdw backend is having problems you can use -b libdwarf instead,but it's much slower and takes longer to collect as many samples (you might want to do -n 1000 instead). Mark On 7/25/22 11:17, Zoltan Langi wrote:Hi people, we got an interesting issue here and I would like to ask ifanyone seen anything like this before. First: our system:The ceph version is 17.2.1 but we also seen the same behaviour on 16.2.9.Our kernel version is 5.13.0-51 and our NVMe disks are Samsung PM983. In our deployment we got 12 nodes in total, 72 disks and 2 osd per disk makes 144 osd in total. The depoyment was done by ceph-rook with default values, 6 CPU cores allocated to the OSD each and 4GB of memory allocated to each OSD. The issue we are experiencing: We create for example 100 volumes via ceph-csi and attach it to kubernetes pods via rbd. We talk about 100volumes in total, 2GB each. We run fio performance tests (read, write,mixed) on them so the volumes are being used heavily. Ceph delivers good performance, no problems as all. Performance we get for example: read iops 3371027 write iops: 727714 read bw: 79.9 GB/s write bw: 31.2 GB/s After the tests are complete, these volumes just sitting there doing nothing for a longer period of time for example 48 hours. After that, we clean the pods up, clean the volumes up and delete them.Recreate the volumes and pods once more, same spec (2GB each 100 pods)then run the same tests once again. We don’t even have half the performance of that we have measured before leaving the pods sitting there doing notning for 2 days. Performance we get after deleting the volumes and recreating them, rerun the tests: read iops: 1716239 write iops: 370631 read bw: 37.8 GB/s write bw: 7.47 GB/s We can clearly see that it’s a big performance loss. If we clean up the ceph deployment, wipe the disks out completely and redeploy, the cluster once again delivering great performance. We haven’t seen such a behaviour with ceph version 14.x Has anyone seen such a thing? Thanks in advance! Zoltan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx