Re: weird performance issue on ceph

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 12 Aug 2022 11:25:52 -0500

Hi Zoltan,

Sadly it looks like some of the debug symbols are messed which makes 
things a little rough to debug from this.  On the write path if you look 
at the bstore_kv_sync thread:

Good state write test:

1.
   + 86.00% FileJournal::_open_file(long, long, bool)
2.
   |+ 86.00% ???
3.
   + 11.50% ???
4.
   |+ 0.20% ???

Bad state write test:

1.
   Thread 2869223 (bstore_kv_sync) - 1000 samples
2.
   + 73.70% FileJournal::_open_file(long, long, bool)
3.
   |+ 73.70% ???
4.
   + 24.90% ???

That's really strange, because FileJournal is part of filestore. There 
also seems to be stuff in this trace regarding BtrfsFileStoreBackend and 
FuseStore::Stop().  Seems like the debug symbols are basically just 
wrong.  Is it possible that some how you ended up with debug symbols for 
the wrong version of ceph or something?

Mark

On 8/12/22 11:13, Zoltan Langi wrote:
Hi Mark,

I managed to profile one osd before and after the bad state. We have 
downgraded ceph to 14.2.22

Good state with read test:

https://pastebin.com/etreYzQc

Good state with write test:

https://pastebin.com/qrN5MaY6

Bad state with read test:

https://pastebin.com/S1pRiJDq

Bad state with write test:

https://pastebin.com/dEv05eGV

Do you see anything obvious that could give us a clue what is going on?

Many thanks!

Zoltan

Am 02.08.22 um 19:01 schrieb Mark Nelson:
Ah, too bad!  I suppose that was too easy. :)

Ok, so my two lines of thought:

1) Something related to the weird performance issues we ran into on 
the PM983 after lots of fragmented writes over the drive.  I think 
we've worked around that with the fix in quincy, but perhaps you are 
hitting a manifestation of it that we haven't. The way to investigate 
that is to look at the NVMe block device stats with collectl or 
iostat and see if you see higher io service times and longer device 
queue lengths in the "bad" case vs the "good" case. If you do, it 
means that something is making the drive(s) themselves laggy at 
fulfilling requests.  You might have to look at a bunch of drives in 
case there's one acting up before the others do, but that's pretty 
easy to do with either tool.  For extra bonus point you can use 
blktrace/blkparse/iowatcher to see if writes are really being 
fragmented (there could be other causes of a drive becoming slow).

The other thing that comes to mind is RocksDB...either due to just 
having more metadata to deal with, or perhaps as a result of having a 
ton more objects, not enough onode cache, and having to issue onode 
reads to rocksdb when you have cache misses.  I believe we have hit 
rate perf counters for the onode cache, but you can get a hint if you 
see a bunch of reads (specifically to the DB partition if you've 
configured it to be separate) during writes.  You may also want to 
look at the compaction stats in the OSD log just to make sure it's 
not super laggy.  You can run this tool against the log to see a 
summary and details regarding individual compaction events:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

Those would be the first places I would look.  If neither are 
helpful, you could try profiling the OSDs using uwpmp as I mentioned 
earlier.

Mark

On 8/2/22 09:50, Zoltan Langi wrote:
Hey Mark, taking back these options solve the issue. I just ran my 
tests twice again and here are the results:

https://ibb.co/9vY5xgS
https://ibb.co/71pSCQv

Back to where it was, performance dropped down today. So seems like the

    bdev_enable_discard = true
    bdev_async_discard = true

options didn't make any difference in the end and the problem 
reappeared, just a bit later.

I have read all the articles you posted, thanks for it, however I am 
still struggling with this. Any other recommendation or idea what to 
check?

Thanks a lot,
Zoltan

Am 01.08.22 um 17:53 schrieb Mark Nelson:
Hi Zoltan,

It doesn't look like your pictures showed up for me at least. Very 
interesting results though!  Are (or were) the drives particularly 
full when you've run into performance problems that the discard 
option appears to fix?  There have been some discussions in the 
past regarding online discard vs periodic discard ala fstrim.  The 
gist of it is that there are performance implications for online 
trim, but there are (eventual) performance implications if let the 
drive get too full before doing an offline trim (that itself can be 
impactful).  There's been quite a bit of discussion about it on the 
mailing list and in PRs:

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/ 

https://github.com/ceph/ceph/pull/14727

Specifically, see this comment regarding how it can affect garbage 
collection but also burst TRIM command effect on the FTL:

https://github.com/ceph/ceph/pull/14727#issuecomment-342399578

And some performance testing by Igor here:

https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724

It would be very interesting to see if you see a similar 
performance improvement if we had a fstrim like discard option you 
could run before the new test.  There's a tracker ticket for it, 
but afaik no one has actually implemented anything yet:

https://tracker.ceph.com/issues/38494

Regarding whether it's safe to have (async) discard enabled... 
Maybe? :)  We left it disabled by default because we didn't want to 
deal with having to situationally disable it for drives with buggy 
firmwares and some of the other associated problems with online 
discard.  Having said that, in your case it sounds like enabling it 
is yielding good results with the PM983 and your workload.

There's a really good (but slightly old now) article on LWN 
detailing the discussion the kernel engineers were having regarding 
all of this at the LSFMM Summit a few years ago:

https://lwn.net/Articles/787272/

In the comments, Chris Mason mentions the same delete issue we 
probably need to tackle (see Igor's comment linked above):

"The XFS async trim implementation is pretty reasonable, and it can 
be a big win in some workloads. Basically anything that gets pushed 
out of the critical section of the transaction commit can have a 
huge impact on performance. The major thing it's missing is a way 
to throttle new deletes from creating a never ending stream of 
discards, but I don't think any of the filesystems are doing that 
yet."

Mark

On 8/1/22 08:36, Zoltan Langi wrote:
Hey Frank and Mark,

Thanks for your response and sorry about coming back a bit late, 
but I needed to test something that needs time.

How I reproduced this issue: Created 100 volumes with ceph-csi ran 
3 set of tests, let the volumes sit for 48 hours and then deleted 
the volumes, recreated them and ran the tests 3x in a row.

If you look at the picture:

picture1

The picture above clearly shows the performance degradation. We 
run the first test first read then write at 09:20 finishes at 
09:45, at 11:00 we run the new test, 11:20 finishes and already 
struggling with the read iops and the write iops drops a lot, but 
it is more like a saw graph in case of the read. 11:40 I reran the 
test and now, the write normalised on a bad level, no more saw 
pattern and the write sticks to the bad levels.

Let's have a look at the bandwidth graph:

picture2

Compare the 09:40-10:05 part and the 12:00-12:25 part. Those are 
the identical tests. Dropped a lot. The only way to recover from 
this state is to recreate the bluestore devices from scratch.

We have enabled the following options in rook-ceph:

    bdev_enable_discard = true
    bdev_async_discard = true

Now let's have a look at the speed comparsion:

Data from last Friday, before the volumes sat for 48 hours:

picture3

picture4

We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and Test 
3: 21:40-23:30. We see slight write degradation, but it should 
stay the same for the rest of the time.

Now let's see the test runs from today:

picture5

picture6

We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 Test 3: 
13:10-14:40.

As we see, after enabling these options, the system is delivering 
constant speeds without degradation and huge performance loss like 
before.

Has anyone came across with something like this behaviour before? 
We haven't seen any mention of these options int he official docs 
just in pull requests. Is it safe to use these options in 
production at all?

Many thanks,
Zoltan

Am 25.07.22 um 21:42 schrieb Mark Nelson:
I don't think so if this is just plain old RBD.  RBD  shouldn't 
require a bunch of RocksDB iterator seeks in the read/write hot 
path and writes should pretty quickly clear out tombstones as 
part of the memtable flush and compaction process even in the 
slow case. Maybe in some kind of pathologically bad read-only 
corner case with no onode cache but it would be bad for more 
reasons than what's happening in that tracker ticket imho (even 
reading onodes from rocksdb block cache is significantly slower 
than BlueStore's onode cache).

If RBD mirror (or snapshots) are involved that could be a 
different story though.  I believe to deal with deletes in that 
case we have to go through iteration/deletion loops that have 
same root issue as what's going on in the tracker ticket and it 
can end up impacting client IO. Gabi and Paul and 
testing/reworking how the snapmapper works and I've started a 
sort of a catch-all PR for improving our RocksDB tunings/glue here:

https://github.com/ceph/ceph/pull/47221

Mark

On 7/25/22 12:48, Frank Schilder wrote:
Could it be related to this performance death trap: 
https://tracker.ceph.com/issues/55324 ?
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 25 July 2022 18:50
To: ceph-users@xxxxxxx
Subject:  Re: weird performance issue on ceph

Hi Zoltan,

We have a very similar setup with one of our upstream community
performance test clusters.  60 4TB PM983 drives spread across 10 
nodes.
We get similar numbers to what you are initially seeing (scaled 
down to
60 drives) though with somewhat lower random read IOPS (we tend 
to max
out at around 2M with 60 drives on this HW). I haven't seen any 
issues
with quincy like what you are describing, but on this cluster 
most of
the tests have been on bare metal.  One issue we have noticed 
with the
PM983 drives is that they may be more susceptible to non-optimal 
write
patterns causing slowdowns vs other NVMe drives in the lab. We 
actually
had to issue a last minute PR for quincy to change the disk 
allocation
behavior to deal with it.  See:

https://github.com/ceph/ceph/pull/45771

https://github.com/ceph/ceph/pull/45884

I don't *think* this is the issue you are hitting since the fix in
#45884 should have taken care of it, but it might be something 
to keep
in the back of your mind.  Otherwise, the fact that you are 
seeing such
a dramatic difference across both small and large read/write 
benchmarks
makes me think there is something else going on.  Is there any 
chance
that some other bottleneck is being imposed when the pods and 
volumes
are deleted and recreated? Might be worth looking at memory and CPU
usage of the OSDs in all of the cases and RocksDB 
flushing/compaction
stats from the OSD logs.  Also a quick check with 
collectl/iostat/sar
during the slow case to make sure none of the drives are showing 
latency
and built up IOs in the device queues.

If you want to go deeper down the rabbit hole you can try 
running my
wallclock profiler against one of your OSDs in the fast/slow 
cases, but
you'll have to make sure it has access to debug symbols:

https://github.com/markhpc/uwpmp.git

run it like:

./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt

If the libdw backend is having problems you can use -b libdwarf 
instead,
but it's much slower and takes longer to collect as many samples 
(you
might want to do -n 1000 instead).

Mark

On 7/25/22 11:17, Zoltan Langi wrote:
Hi people, we got an interesting issue here and I would like to 
ask if
anyone seen anything like this before.

First: our system:

The ceph version is 17.2.1 but we also seen the same behaviour 
on 16.2.9.

Our kernel version is 5.13.0-51 and our NVMe disks are Samsung 
PM983.

In our deployment we got 12 nodes in total, 72 disks and 2 osd per
disk makes 144 osd in total.

The depoyment was done by ceph-rook with default values, 6 CPU 
cores
allocated to the OSD each and 4GB of memory allocated to each OSD.

The issue we are experiencing: We create for example 100 
volumes via
ceph-csi and attach it to kubernetes pods via rbd. We talk 
about 100
volumes in total, 2GB each. We run fio performance tests (read, 
write,
mixed) on them so the volumes are being used heavily. Ceph 
delivers
good performance, no problems as all.

Performance we get for example: read iops 3371027 write iops: 
727714
read bw: 79.9 GB/s write bw: 31.2 GB/s

After the tests are complete, these volumes just sitting there 
doing
nothing for a longer period of time for example 48 hours. After 
that,
we clean the pods up, clean the volumes up and delete them.

Recreate the volumes and pods once more, same spec (2GB each 
100 pods)
then run the same tests once again. We don’t even have half the
performance of that we have measured before leaving the pods 
sitting
there doing notning for 2 days.

Performance we get after deleting the volumes and recreating them,
rerun the tests: read iops: 1716239 write iops: 370631 read bw: 
37.8
GB/s write bw: 7.47 GB/s

We can clearly see that it’s a big performance loss.

If we clean up the ceph deployment, wipe the disks out 
completely and
redeploy, the cluster once again delivering great performance.

We haven’t seen such a behaviour with ceph version 14.x

Has anyone seen such a thing? Thanks in advance!

Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx