Re: weird performance issue on ceph

Zoltan Langi <zoltan.langi@xxxxxxxxxxxxx> · Tue, 27 Sep 2022 08:35:37 +0200

Hi Mark,

Of course I share it with you how we triggered it:

Set the drive to 4k:
nvme format --lbaf=1 /dev/nvme0n1

Make a file system on the disk, we used ext4:
mkfs.ext4 /dev/nvme0n1

Mount the disk to a mount point:
mount /dev/nvme0n1 /mnt/test/

Run the FIO write test to write data to a file:
fio --randrepeat=0 --verify=0 --ioengine=libaio --direct=1 
--gtod_reduce=1 --name=write_seq --filename=/mnt/test/fiotest1 --bs=4M 
--iodepth=16 --size=500G --readwrite=write --time_based --ramp_time=2s 
--runtime=480m --thread --numjobs=4 --offset_increment=100M

Check the nvme list output. Once the drive usage reaches 500GB kill the 
fio process and restart it with a new filename so it won't overwrite the 
original file:
fio --randrepeat=0 --verify=0 --ioengine=libaio --direct=1 
--gtod_reduce=1 --name=write_seq --filename=/mnt/test/fiotest2 --bs=4M 
--iodepth=16 --size=500G --readwrite=write --time_based --ramp_time=2s 
--runtime=480m --thread --numjobs=4 --offset_increment=100M
Shortly you will see the degraded performance which will in time gets 
worse and worse.

We used fw: EDA5702Q

Hope this makes sense. Opened a case with the disk supplier, will update 
you if I get any kind of sensible response from them.

Zoltan

Am 26.09.22 um 16:52 schrieb Mark Nelson:
********************
CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.
********************

Hi Zoltan,

Great investigation work!  I think in my tests the data set typically 
was smaller than 500GB/drive.  If you have a simple fio test that can 
be run against a bare NVMe drive I can try running it on one of our 
test nodes.  FWIW I kind of suspected that the issue I had to work 
around for quincy might have been related to some kind of internal 
cache being saturated.  I wonder if the drive is fast up until some 
limit is hit where it's reverted to slower flash or something?

Mark

On 9/26/22 06:39, Zoltan Langi wrote:
Hi Mark and the mailing list, we managed to figure something very 
weird out what I would like to share with you and ask if you have 
seen anything like this before.

We started to investigate the drives one-by-one after Mark's 
suggestion that a few osd-s are holding back the ceph and we noticed 
this:

When the disk usage reaches 500GB on a single drive, the drive loses 
half of its write performance compared to when it's empty.
To show you, let's see the fio write performance when the disk is empty:
Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
We see, when the disk is empty, the drive achieves almost 1,9GB/s 
throughput and 482 iops. Very decent values.

However! When the disk gets to 500GB full and we start to write a new 
file all of the sudden we get these values:
Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
As we see we lost significant throughput and iops as well.

If we remove all the files and do an fstrim on the disk, the 
performance returns back to normal again.

If we format the disk, no need to do fstrim, we get the performance 
back to normal again. That explains why the ceph recreation from 
scratch helped us.

Have you see this behaviour before in your deployments?

Thanks,

Zoltan

Am 17.09.22 um 06:58 schrieb Mark Nelson:
********************
CAUTION:
This email originated from outside the organization. Do not click 
links unless you can confirm the sender and know the content is safe.
********************

Hi Zoltan,

So kind of interesting results.  In the "good" write test the OSD 
doesn't actually seem to be working very hard.  If you look at the 
kv sync thread, it's mostly idle with only about 22% of the time in 
the thread spent doing real work:

1.
   | + 99.90% BlueStore::_kv_sync_thread()
2.
   | + 78.60% 
std::condition_variable::wait(std::unique_lock<std::mutex>&)
3.
   | |+ 78.60% pthread_cond_wait
4.
   | + 18.00%
RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>) 

...but at least it's actually doing work!  For reference though, on 
our high performing setup with enough concurrency we can push things 
hard enough where this thread isn't spending much time in 
pthread_cond_wait.  In the "bad" state, your example OSD here is 
basically doing nothing at all (100% of the time in 
pthread_cold_wait!).  The tp_osd_tp and the kv sync thread are just 
waiting around twiddling their thumbs:

1.
   Thread 339848 (bstore_kv_sync) - 1000 samples
2.
   + 100.00% clone
3.
   + 100.00% start_thread
4.
   + 100.00% BlueStore::KVSyncThread::entry()
5.
   + 100.00% BlueStore::_kv_sync_thread()
6.
   + 100.00% 
std::condition_variable::wait(std::unique_lock<std::mutex>&)
7.
   + 100.00% pthread_cond_wait

My first thought is that you might have one or more OSDs that are 
slowing the whole cluster down so that clients are backing up on it 
and other OSDs are just waiting around for IO.  It might be worth 
checking the perf admin socket stats on each OSD to see if you can 
narrow down if any of them are having issues.

Thanks,

Mark

On 9/16/22 05:57, Zoltan Langi wrote:
Hey people and Mark, the cluster was left overnight to do nothing 
and the problem as expected came back in the morning. We managed to 
capture the bad states on the exact same OSD-s we captured the good 
states earlier:

Here is the output of a read test when the cluster is in a bad 
state on the same OSD which I recorded in the good state earlier:

https://pastebin.com/jp5JLWYK

Here is the output of a write test when the cluster is in a bad 
state on the same OSD which I recorded in the good state earlier:

The write speed came down from 30,1GB/s to 17,9GB/s

https://pastebin.com/9e80L5XY

We are still open for any suggestions, so please feel free to 
comment or suggest. :)

Thanks a lot,
Zoltan

Am 15.09.22 um 16:53 schrieb Zoltan Langi:
Hey people and Mark, we managed to capture the good and bad states 
separately:

Here is the output of a read test when the cluster is in a bad state:

https://pastebin.com/0HdNapLQ

Here is the output of a write test when the cluster is in a bad 
state:

https://pastebin.com/2T2pKu6Q

Here is the output of a read test when the cluster is in a brand 
new reinstalled state:

https://pastebin.com/qsKeX0D8

Here is the output of a write test when the cluster is in a brand 
new reinstalled state:

https://pastebin.com/nTCuEUAb

Hope anyone can suggest anything, any ideas are welcome! :)

Zoltan

Am 13.09.22 um 14:27 schrieb Zoltan Langi:
Hey Mark,

Sorry about the silence for a while, but a lot of things came up. 
We finally managed to fix up the profiler and here is an output 
when the ceph is under heavy write load, in a pretty bad state 
and its throughput is not achieving more than 12,2GB/s.

For a good state we have to recreate the whole thing, so we 
thought we start with the bad state, maybe something obvious is 
already visible for someone who knows the osd internals well.

You find the file here: https://pastebin.com/0HdNapLQ

Tanks a lot in advance,

Zolta

Am 12.08.22 um 18:25 schrieb Mark Nelson:
CAUTION: This email originated from outside the organization. Do 
not click links unless you can confirm the sender and know the 
content is safe.

Hi Zoltan,

Sadly it looks like some of the debug symbols are messed which 
makes things a little rough to debug from this. On the write 
path if you look at the bstore_kv_sync thread:

Good state write test:

1.
   + 86.00% FileJournal::_open_file(long, long, bool)
2.
   |+ 86.00% ???
3.
   + 11.50% ???
4.
   |+ 0.20% ???

Bad state write test:

1.
   Thread 2869223 (bstore_kv_sync) - 1000 samples
2.
   + 73.70% FileJournal::_open_file(long, long, bool)
3.
   |+ 73.70% ???
4.
   + 24.90% ???

That's really strange, because FileJournal is part of filestore. 
There also seems to be stuff in this trace regarding 
BtrfsFileStoreBackend and FuseStore::Stop(). Seems like the 
debug symbols are basically just wrong. Is it possible that some 
how you ended up with debug symbols for the wrong version of 
ceph or something?

Mark

On 8/12/22 11:13, Zoltan Langi wrote:
Hi Mark,

I managed to profile one osd before and after the bad state. We 
have downgraded ceph to 14.2.22

Good state with read test:

https://pastebin.com/etreYzQc

Good state with write test:

https://pastebin.com/qrN5MaY6

Bad state with read test:

https://pastebin.com/S1pRiJDq

Bad state with write test:

https://pastebin.com/dEv05eGV

Do you see anything obvious that could give us a clue what is 
going on?

Many thanks!

Zoltan

Am 02.08.22 um 19:01 schrieb Mark Nelson:
Ah, too bad!  I suppose that was too easy. :)

Ok, so my two lines of thought:

1) Something related to the weird performance issues we ran 
into on the PM983 after lots of fragmented writes over the 
drive.  I think we've worked around that with the fix in 
quincy, but perhaps you are hitting a manifestation of it that 
we haven't. The way to investigate that is to look at the NVMe 
block device stats with collectl or iostat and see if you see 
higher io service times and longer device queue lengths in the 
"bad" case vs the "good" case. If you do, it means that 
something is making the drive(s) themselves laggy at 
fulfilling requests.  You might have to look at a bunch of 
drives in case there's one acting up before the others do, but 
that's pretty easy to do with either tool.  For extra bonus 
point you can use blktrace/blkparse/iowatcher to see if writes 
are really being fragmented (there could be other causes of a 
drive becoming slow).

The other thing that comes to mind is RocksDB...either due to 
just having more metadata to deal with, or perhaps as a result 
of having a ton more objects, not enough onode cache, and 
having to issue onode reads to rocksdb when you have cache 
misses.  I believe we have hit rate perf counters for the 
onode cache, but you can get a hint if you see a bunch of 
reads (specifically to the DB partition if you've configured 
it to be separate) during writes. You may also want to look at 
the compaction stats in the OSD log just to make sure it's not 
super laggy.  You can run this tool against the log to see a 
summary and details regarding individual compaction events:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py 

Those would be the first places I would look.  If neither are 
helpful, you could try profiling the OSDs using uwpmp as I 
mentioned earlier.

Mark

On 8/2/22 09:50, Zoltan Langi wrote:
Hey Mark, taking back these options solve the issue. I just 
ran my tests twice again and here are the results:

https://ibb.co/9vY5xgS
https://ibb.co/71pSCQv

Back to where it was, performance dropped down today. So 
seems like the

    bdev_enable_discard = true
    bdev_async_discard = true

options didn't make any difference in the end and the problem 
reappeared, just a bit later.

I have read all the articles you posted, thanks for it, 
however I am still struggling with this. Any other 
recommendation or idea what to check?

Thanks a lot,
Zoltan

Am 01.08.22 um 17:53 schrieb Mark Nelson:
Hi Zoltan,

It doesn't look like your pictures showed up for me at 
least. Very interesting results though! Are (or were) the 
drives particularly full when you've run into performance 
problems that the discard option appears to fix?  There have 
been some discussions in the past regarding online discard 
vs periodic discard ala fstrim.  The gist of it is that 
there are performance implications for online trim, but 
there are (eventual) performance implications if let the 
drive get too full before doing an offline trim (that itself 
can be impactful). There's been quite a bit of discussion 
about it on the mailing list and in PRs:

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/ 

https://github.com/ceph/ceph/pull/14727

Specifically, see this comment regarding how it can affect 
garbage collection but also burst TRIM command effect on the 
FTL:

https://github.com/ceph/ceph/pull/14727#issuecomment-342399578

And some performance testing by Igor here:

https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724 

It would be very interesting to see if you see a similar 
performance improvement if we had a fstrim like discard 
option you could run before the new test.  There's a tracker 
ticket for it, but afaik no one has actually implemented 
anything yet:

https://tracker.ceph.com/issues/38494

Regarding whether it's safe to have (async) discard 
enabled... Maybe? :)  We left it disabled by default because 
we didn't want to deal with having to situationally disable 
it for drives with buggy firmwares and some of the other 
associated problems with online discard. Having said that, 
in your case it sounds like enabling it is yielding good 
results with the PM983 and your workload.

There's a really good (but slightly old now) article on LWN 
detailing the discussion the kernel engineers were having 
regarding all of this at the LSFMM Summit a few years ago:

https://lwn.net/Articles/787272/

In the comments, Chris Mason mentions the same delete issue 
we probably need to tackle (see Igor's comment linked above):

"The XFS async trim implementation is pretty reasonable, and 
it can be a big win in some workloads. Basically anything 
that gets pushed out of the critical section of the 
transaction commit can have a huge impact on performance. 
The major thing it's missing is a way to throttle new 
deletes from creating a never ending stream of discards, but 
I don't think any of the filesystems are doing that yet."

Mark

On 8/1/22 08:36, Zoltan Langi wrote:
Hey Frank and Mark,

Thanks for your response and sorry about coming back a bit 
late, but I needed to test something that needs time.

How I reproduced this issue: Created 100 volumes with 
ceph-csi ran 3 set of tests, let the volumes sit for 48 
hours and then deleted the volumes, recreated them and ran 
the tests 3x in a row.

If you look at the picture:

picture1

The picture above clearly shows the performance 
degradation. We run the first test first read then write at 
09:20 finishes at 09:45, at 11:00 we run the new test, 
11:20 finishes and already struggling with the read iops 
and the write iops drops a lot, but it is more like a saw 
graph in case of the read. 11:40 I reran the test and now, 
the write normalised on a bad level, no more saw pattern 
and the write sticks to the bad levels.

Let's have a look at the bandwidth graph:

picture2

Compare the 09:40-10:05 part and the 12:00-12:25 part. 
Those are the identical tests. Dropped a lot. The only way 
to recover from this state is to recreate the bluestore 
devices from scratch.

We have enabled the following options in rook-ceph:

    bdev_enable_discard = true
    bdev_async_discard = true

Now let's have a look at the speed comparsion:

Data from last Friday, before the volumes sat for 48 hours:

picture3

picture4

We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and 
Test 3: 21:40-23:30. We see slight write degradation, but 
it should stay the same for the rest of the time.

Now let's see the test runs from today:

picture5

picture6

We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 
Test 3: 13:10-14:40.

As we see, after enabling these options, the system is 
delivering constant speeds without degradation and huge 
performance loss like before.

Has anyone came across with something like this behaviour 
before? We haven't seen any mention of these options int he 
official docs just in pull requests. Is it safe to use 
these options in production at all?

Many thanks,
Zoltan

Am 25.07.22 um 21:42 schrieb Mark Nelson:
I don't think so if this is just plain old RBD.  RBD 
shouldn't require a bunch of RocksDB iterator seeks in the 
read/write hot path and writes should pretty quickly clear 
out tombstones as part of the memtable flush and 
compaction process even in the slow case. Maybe in some 
kind of pathologically bad read-only corner case with no 
onode cache but it would be bad for more reasons than 
what's happening in that tracker ticket imho (even reading 
onodes from rocksdb block cache is significantly slower 
than BlueStore's onode cache).

If RBD mirror (or snapshots) are involved that could be a 
different story though.  I believe to deal with deletes in 
that case we have to go through iteration/deletion loops 
that have same root issue as what's going on in the 
tracker ticket and it can end up impacting client IO. Gabi 
and Paul and testing/reworking how the snapmapper works 
and I've started a sort of a catch-all PR for improving 
our RocksDB tunings/glue here:

https://github.com/ceph/ceph/pull/47221

Mark

On 7/25/22 12:48, Frank Schilder wrote:
Could it be related to this performance death trap: 
https://tracker.ceph.com/issues/55324 ?
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 25 July 2022 18:50
To: ceph-users@xxxxxxx
Subject:  Re: weird performance issue on ceph

Hi Zoltan,

We have a very similar setup with one of our upstream 
community
performance test clusters.  60 4TB PM983 drives spread 
across 10 nodes.
We get similar numbers to what you are initially seeing 
(scaled down to
60 drives) though with somewhat lower random read IOPS 
(we tend to max
out at around 2M with 60 drives on this HW). I haven't 
seen any issues
with quincy like what you are describing, but on this 
cluster most of
the tests have been on bare metal.  One issue we have 
noticed with the
PM983 drives is that they may be more susceptible to 
non-optimal write
patterns causing slowdowns vs other NVMe drives in the 
lab. We actually
had to issue a last minute PR for quincy to change the 
disk allocation
behavior to deal with it.  See:

https://github.com/ceph/ceph/pull/45771

https://github.com/ceph/ceph/pull/45884

I don't *think* this is the issue you are hitting since 
the fix in
#45884 should have taken care of it, but it might be 
something to keep
in the back of your mind.  Otherwise, the fact that you 
are seeing such
a dramatic difference across both small and large 
read/write benchmarks
makes me think there is something else going on. Is there 
any chance
that some other bottleneck is being imposed when the pods 
and volumes
are deleted and recreated? Might be worth looking at 
memory and CPU
usage of the OSDs in all of the cases and RocksDB 
flushing/compaction
stats from the OSD logs.  Also a quick check with 
collectl/iostat/sar
during the slow case to make sure none of the drives are 
showing latency
and built up IOs in the device queues.

If you want to go deeper down the rabbit hole you can try 
running my
wallclock profiler against one of your OSDs in the 
fast/slow cases, but
you'll have to make sure it has access to debug symbols:

https://github.com/markhpc/uwpmp.git

run it like:

./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt

If the libdw backend is having problems you can use -b 
libdwarf instead,
but it's much slower and takes longer to collect as many 
samples (you
might want to do -n 1000 instead).

Mark

On 7/25/22 11:17, Zoltan Langi wrote:
Hi people, we got an interesting issue here and I would 
like to ask if
anyone seen anything like this before.

First: our system:

The ceph version is 17.2.1 but we also seen the same 
behaviour on 16.2.9.

Our kernel version is 5.13.0-51 and our NVMe disks are 
Samsung PM983.

In our deployment we got 12 nodes in total, 72 disks and 
2 osd per
disk makes 144 osd in total.

The depoyment was done by ceph-rook with default values, 
6 CPU cores
allocated to the OSD each and 4GB of memory allocated to 
each OSD.

The issue we are experiencing: We create for example 100 
volumes via
ceph-csi and attach it to kubernetes pods via rbd. We 
talk about 100
volumes in total, 2GB each. We run fio performance tests 
(read, write,
mixed) on them so the volumes are being used heavily. 
Ceph delivers
good performance, no problems as all.

Performance we get for example: read iops 3371027 write 
iops: 727714
read bw: 79.9 GB/s write bw: 31.2 GB/s

After the tests are complete, these volumes just sitting 
there doing
nothing for a longer period of time for example 48 
hours. After that,
we clean the pods up, clean the volumes up and delete them.

Recreate the volumes and pods once more, same spec (2GB 
each 100 pods)
then run the same tests once again. We don’t even have 
half the
performance of that we have measured before leaving the 
pods sitting
there doing notning for 2 days.

Performance we get after deleting the volumes and 
recreating them,
rerun the tests: read iops: 1716239 write iops: 370631 
read bw: 37.8
GB/s write bw: 7.47 GB/s

We can clearly see that it’s a big performance loss.

If we clean up the ceph deployment, wipe the disks out 
completely and
redeploy, the cluster once again delivering great 
performance.

We haven’t seen such a behaviour with ceph version 14.x

Has anyone seen such a thing? Thanks in advance!

Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx