Re: weird performance issue on ceph

Zoltan Langi <zoltan.langi@xxxxxxxxxxxxx> · Tue, 2 Aug 2022 09:51:06 +0200

Sorry Mark, I have made the beginner's mistake and embedded the images 
into the email. :facepalm:

Anyway, I uploaded them to an image hosting site:

Picture1:

https://ibb.co/jZfBW9g

Picture2:

https://ibb.co/ftnp8Sg

Picture3:

https://ibb.co/Qrt140Z

Picture4:

https://ibb.co/945Hhc1

Picture5:

https:/ibb.co/VJXhkm0

Picture6:

https://ibb.co/mrpgHPv

Please match them up from the previous email any finally you can see the 
performance graphs I have collected.

Many thanks,
Zoltan

Am 01.08.22 um 17:53 schrieb Mark Nelson:
Hi Zoltan,

It doesn't look like your pictures showed up for me at least. Very 
interesting results though!  Are (or were) the drives particularly 
full when you've run into performance problems that the discard option 
appears to fix?  There have been some discussions in the past 
regarding online discard vs periodic discard ala fstrim.  The gist of 
it is that there are performance implications for online trim, but 
there are (eventual) performance implications if let the drive get too 
full before doing an offline trim (that itself can be impactful).  
There's been quite a bit of discussion about it on the mailing list 
and in PRs:

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/ 

https://github.com/ceph/ceph/pull/14727

Specifically, see this comment regarding how it can affect garbage 
collection but also burst TRIM command effect on the FTL:

https://github.com/ceph/ceph/pull/14727#issuecomment-342399578

And some performance testing by Igor here:

https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724

It would be very interesting to see if you see a similar performance 
improvement if we had a fstrim like discard option you could run 
before the new test.  There's a tracker ticket for it, but afaik no 
one has actually implemented anything yet:

https://tracker.ceph.com/issues/38494

Regarding whether it's safe to have (async) discard enabled... Maybe? 
:)  We left it disabled by default because we didn't want to deal with 
having to situationally disable it for drives with buggy firmwares and 
some of the other associated problems with online discard.  Having 
said that, in your case it sounds like enabling it is yielding good 
results with the PM983 and your workload.

There's a really good (but slightly old now) article on LWN detailing 
the discussion the kernel engineers were having regarding all of this 
at the LSFMM Summit a few years ago:

https://lwn.net/Articles/787272/

In the comments, Chris Mason mentions the same delete issue we 
probably need to tackle (see Igor's comment linked above):

"The XFS async trim implementation is pretty reasonable, and it can be 
a big win in some workloads. Basically anything that gets pushed out 
of the critical section of the transaction commit can have a huge 
impact on performance. The major thing it's missing is a way to 
throttle new deletes from creating a never ending stream of discards, 
but I don't think any of the filesystems are doing that yet."

Mark

On 8/1/22 08:36, Zoltan Langi wrote:
Hey Frank and Mark,

Thanks for your response and sorry about coming back a bit late, but 
I needed to test something that needs time.

How I reproduced this issue: Created 100 volumes with ceph-csi ran 3 
set of tests, let the volumes sit for 48 hours and then deleted the 
volumes, recreated them and ran the tests 3x in a row.

If you look at the picture:

picture1

The picture above clearly shows the performance degradation. We run 
the first test first read then write at 09:20 finishes at 09:45, at 
11:00 we run the new test, 11:20 finishes and already struggling with 
the read iops and the write iops drops a lot, but it is more like a 
saw graph in case of the read. 11:40 I reran the test and now, the 
write normalised on a bad level, no more saw pattern and the write 
sticks to the bad levels.

Let's have a look at the bandwidth graph:

picture2

Compare the 09:40-10:05 part and the 12:00-12:25 part. Those are the 
identical tests. Dropped a lot. The only way to recover from this 
state is to recreate the bluestore devices from scratch.

We have enabled the following options in rook-ceph:

    bdev_enable_discard = true
    bdev_async_discard = true

Now let's have a look at the speed comparsion:

Data from last Friday, before the volumes sat for 48 hours:

picture3

picture4

We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and Test 3: 
21:40-23:30. We see slight write degradation, but it should stay the 
same for the rest of the time.

Now let's see the test runs from today:

picture5

picture6

We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 Test 3: 
13:10-14:40.

As we see, after enabling these options, the system is delivering 
constant speeds without degradation and huge performance loss like 
before.

Has anyone came across with something like this behaviour before? We 
haven't seen any mention of these options int he official docs just 
in pull requests. Is it safe to use these options in production at all?

Many thanks,
Zoltan

Am 25.07.22 um 21:42 schrieb Mark Nelson:
I don't think so if this is just plain old RBD.  RBD  shouldn't 
require a bunch of RocksDB iterator seeks in the read/write hot path 
and writes should pretty quickly clear out tombstones as part of the 
memtable flush and compaction process even in the slow case. Maybe 
in some kind of pathologically bad read-only corner case with no 
onode cache but it would be bad for more reasons than what's 
happening in that tracker ticket imho (even reading onodes from 
rocksdb block cache is significantly slower than BlueStore's onode 
cache).

If RBD mirror (or snapshots) are involved that could be a different 
story though.  I believe to deal with deletes in that case we have 
to go through iteration/deletion loops that have same root issue as 
what's going on in the tracker ticket and it can end up impacting 
client IO.  Gabi and Paul and testing/reworking how the snapmapper 
works and I've started a sort of a catch-all PR for improving our 
RocksDB tunings/glue here:

https://github.com/ceph/ceph/pull/47221

Mark

On 7/25/22 12:48, Frank Schilder wrote:
Could it be related to this performance death trap: 
https://tracker.ceph.com/issues/55324 ?
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 25 July 2022 18:50
To: ceph-users@xxxxxxx
Subject:  Re: weird performance issue on ceph

Hi Zoltan,

We have a very similar setup with one of our upstream community
performance test clusters.  60 4TB PM983 drives spread across 10 
nodes.
We get similar numbers to what you are initially seeing (scaled 
down to
60 drives) though with somewhat lower random read IOPS (we tend to max
out at around 2M with 60 drives on this HW). I haven't seen any issues
with quincy like what you are describing, but on this cluster most of
the tests have been on bare metal.  One issue we have noticed with the
PM983 drives is that they may be more susceptible to non-optimal write
patterns causing slowdowns vs other NVMe drives in the lab. We 
actually
had to issue a last minute PR for quincy to change the disk allocation
behavior to deal with it.  See:

https://github.com/ceph/ceph/pull/45771

https://github.com/ceph/ceph/pull/45884

I don't *think* this is the issue you are hitting since the fix in
#45884 should have taken care of it, but it might be something to keep
in the back of your mind.  Otherwise, the fact that you are seeing 
such
a dramatic difference across both small and large read/write 
benchmarks
makes me think there is something else going on.  Is there any chance
that some other bottleneck is being imposed when the pods and volumes
are deleted and recreated? Might be worth looking at memory and CPU
usage of the OSDs in all of the cases and RocksDB flushing/compaction
stats from the OSD logs.  Also a quick check with collectl/iostat/sar
during the slow case to make sure none of the drives are showing 
latency
and built up IOs in the device queues.

If you want to go deeper down the rabbit hole you can try running my
wallclock profiler against one of your OSDs in the fast/slow cases, 
but
you'll have to make sure it has access to debug symbols:

https://github.com/markhpc/uwpmp.git

run it like:

./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt

If the libdw backend is having problems you can use -b libdwarf 
instead,
but it's much slower and takes longer to collect as many samples (you
might want to do -n 1000 instead).

Mark

On 7/25/22 11:17, Zoltan Langi wrote:
Hi people, we got an interesting issue here and I would like to 
ask if
anyone seen anything like this before.

First: our system:

The ceph version is 17.2.1 but we also seen the same behaviour on 
16.2.9.

Our kernel version is 5.13.0-51 and our NVMe disks are Samsung PM983.

In our deployment we got 12 nodes in total, 72 disks and 2 osd per
disk makes 144 osd in total.

The depoyment was done by ceph-rook with default values, 6 CPU cores
allocated to the OSD each and 4GB of memory allocated to each OSD.

The issue we are experiencing: We create for example 100 volumes via
ceph-csi and attach it to kubernetes pods via rbd. We talk about 100
volumes in total, 2GB each. We run fio performance tests (read, 
write,
mixed) on them so the volumes are being used heavily. Ceph delivers
good performance, no problems as all.

Performance we get for example: read iops 3371027 write iops: 727714
read bw: 79.9 GB/s write bw: 31.2 GB/s

After the tests are complete, these volumes just sitting there doing
nothing for a longer period of time for example 48 hours. After that,
we clean the pods up, clean the volumes up and delete them.

Recreate the volumes and pods once more, same spec (2GB each 100 
pods)
then run the same tests once again. We don’t even have half the
performance of that we have measured before leaving the pods sitting
there doing notning for 2 days.

Performance we get after deleting the volumes and recreating them,
rerun the tests: read iops: 1716239 write iops: 370631 read bw: 37.8
GB/s write bw: 7.47 GB/s

We can clearly see that it’s a big performance loss.

If we clean up the ceph deployment, wipe the disks out completely and
redeploy, the cluster once again delivering great performance.

We haven’t seen such a behaviour with ceph version 14.x

Has anyone seen such a thing? Thanks in advance!

Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx