Hi Zoltan,
So kind of interesting results. In the "good" write test the OSD
doesn't actually seem to be working very hard. If you look at the kv
sync thread, it's mostly idle with only about 22% of the time in the
thread spent doing real work:
1.
| + 99.90% BlueStore::_kv_sync_thread()
2.
| + 78.60% std::condition_variable::wait(std::unique_lock<std::mutex>&)
3.
| |+ 78.60% pthread_cond_wait
4.
| + 18.00%
RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)
...but at least it's actually doing work! For reference though, on our
high performing setup with enough concurrency we can push things hard
enough where this thread isn't spending much time in pthread_cond_wait.
In the "bad" state, your example OSD here is basically doing nothing at
all (100% of the time in pthread_cold_wait!). The tp_osd_tp and the kv
sync thread are just waiting around twiddling their thumbs:
1.
Thread 339848 (bstore_kv_sync) - 1000 samples
2.
+ 100.00% clone
3.
+ 100.00% start_thread
4.
+ 100.00% BlueStore::KVSyncThread::entry()
5.
+ 100.00% BlueStore::_kv_sync_thread()
6.
+ 100.00% std::condition_variable::wait(std::unique_lock<std::mutex>&)
7.
+ 100.00% pthread_cond_wait
My first thought is that you might have one or more OSDs that are
slowing the whole cluster down so that clients are backing up on it and
other OSDs are just waiting around for IO. It might be worth checking
the perf admin socket stats on each OSD to see if you can narrow down if
any of them are having issues.
Thanks,
Mark
On 9/16/22 05:57, Zoltan Langi wrote:
Hey people and Mark, the cluster was left overnight to do nothing and
the problem as expected came back in the morning. We managed to
capture the bad states on the exact same OSD-s we captured the good
states earlier:
Here is the output of a read test when the cluster is in a bad state
on the same OSD which I recorded in the good state earlier:
https://pastebin.com/jp5JLWYK
Here is the output of a write test when the cluster is in a bad state
on the same OSD which I recorded in the good state earlier:
The write speed came down from 30,1GB/s to 17,9GB/s
https://pastebin.com/9e80L5XY
We are still open for any suggestions, so please feel free to comment
or suggest. :)
Thanks a lot,
Zoltan
Am 15.09.22 um 16:53 schrieb Zoltan Langi:
Hey people and Mark, we managed to capture the good and bad states
separately:
Here is the output of a read test when the cluster is in a bad state:
https://pastebin.com/0HdNapLQ
Here is the output of a write test when the cluster is in a bad state:
https://pastebin.com/2T2pKu6Q
Here is the output of a read test when the cluster is in a brand new
reinstalled state:
https://pastebin.com/qsKeX0D8
Here is the output of a write test when the cluster is in a brand new
reinstalled state:
https://pastebin.com/nTCuEUAb
Hope anyone can suggest anything, any ideas are welcome! :)
Zoltan
Am 13.09.22 um 14:27 schrieb Zoltan Langi:
Hey Mark,
Sorry about the silence for a while, but a lot of things came up. We
finally managed to fix up the profiler and here is an output when
the ceph is under heavy write load, in a pretty bad state and its
throughput is not achieving more than 12,2GB/s.
For a good state we have to recreate the whole thing, so we thought
we start with the bad state, maybe something obvious is already
visible for someone who knows the osd internals well.
You find the file here: https://pastebin.com/0HdNapLQ
Tanks a lot in advance,
Zolta
Am 12.08.22 um 18:25 schrieb Mark Nelson:
CAUTION: This email originated from outside the organization. Do
not click links unless you can confirm the sender and know the
content is safe.
Hi Zoltan,
Sadly it looks like some of the debug symbols are messed which
makes things a little rough to debug from this. On the write path
if you look at the bstore_kv_sync thread:
Good state write test:
1.
+ 86.00% FileJournal::_open_file(long, long, bool)
2.
|+ 86.00% ???
3.
+ 11.50% ???
4.
|+ 0.20% ???
Bad state write test:
1.
Thread 2869223 (bstore_kv_sync) - 1000 samples
2.
+ 73.70% FileJournal::_open_file(long, long, bool)
3.
|+ 73.70% ???
4.
+ 24.90% ???
That's really strange, because FileJournal is part of filestore.
There also seems to be stuff in this trace regarding
BtrfsFileStoreBackend and FuseStore::Stop(). Seems like the debug
symbols are basically just wrong. Is it possible that some how you
ended up with debug symbols for the wrong version of ceph or
something?
Mark
On 8/12/22 11:13, Zoltan Langi wrote:
Hi Mark,
I managed to profile one osd before and after the bad state. We
have downgraded ceph to 14.2.22
Good state with read test:
https://pastebin.com/etreYzQc
Good state with write test:
https://pastebin.com/qrN5MaY6
Bad state with read test:
https://pastebin.com/S1pRiJDq
Bad state with write test:
https://pastebin.com/dEv05eGV
Do you see anything obvious that could give us a clue what is
going on?
Many thanks!
Zoltan
Am 02.08.22 um 19:01 schrieb Mark Nelson:
Ah, too bad! I suppose that was too easy. :)
Ok, so my two lines of thought:
1) Something related to the weird performance issues we ran into
on the PM983 after lots of fragmented writes over the drive. I
think we've worked around that with the fix in quincy, but
perhaps you are hitting a manifestation of it that we haven't.
The way to investigate that is to look at the NVMe block device
stats with collectl or iostat and see if you see higher io
service times and longer device queue lengths in the "bad" case
vs the "good" case. If you do, it means that something is making
the drive(s) themselves laggy at fulfilling requests. You might
have to look at a bunch of drives in case there's one acting up
before the others do, but that's pretty easy to do with either
tool. For extra bonus point you can use
blktrace/blkparse/iowatcher to see if writes are really being
fragmented (there could be other causes of a drive becoming slow).
The other thing that comes to mind is RocksDB...either due to
just having more metadata to deal with, or perhaps as a result of
having a ton more objects, not enough onode cache, and having to
issue onode reads to rocksdb when you have cache misses. I
believe we have hit rate perf counters for the onode cache, but
you can get a hint if you see a bunch of reads (specifically to
the DB partition if you've configured it to be separate) during
writes. You may also want to look at the compaction stats in the
OSD log just to make sure it's not super laggy. You can run this
tool against the log to see a summary and details regarding
individual compaction events:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
Those would be the first places I would look. If neither are
helpful, you could try profiling the OSDs using uwpmp as I
mentioned earlier.
Mark
On 8/2/22 09:50, Zoltan Langi wrote:
Hey Mark, taking back these options solve the issue. I just ran
my tests twice again and here are the results:
https://ibb.co/9vY5xgS
https://ibb.co/71pSCQv
Back to where it was, performance dropped down today. So seems
like the
bdev_enable_discard = true
bdev_async_discard = true
options didn't make any difference in the end and the problem
reappeared, just a bit later.
I have read all the articles you posted, thanks for it, however
I am still struggling with this. Any other recommendation or
idea what to check?
Thanks a lot,
Zoltan
Am 01.08.22 um 17:53 schrieb Mark Nelson:
Hi Zoltan,
It doesn't look like your pictures showed up for me at least.
Very interesting results though! Are (or were) the drives
particularly full when you've run into performance problems
that the discard option appears to fix? There have been some
discussions in the past regarding online discard vs periodic
discard ala fstrim. The gist of it is that there are
performance implications for online trim, but there are
(eventual) performance implications if let the drive get too
full before doing an offline trim (that itself can be
impactful). There's been quite a bit of discussion about it on
the mailing list and in PRs:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/
https://github.com/ceph/ceph/pull/14727
Specifically, see this comment regarding how it can affect
garbage collection but also burst TRIM command effect on the FTL:
https://github.com/ceph/ceph/pull/14727#issuecomment-342399578
And some performance testing by Igor here:
https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724
It would be very interesting to see if you see a similar
performance improvement if we had a fstrim like discard option
you could run before the new test. There's a tracker ticket
for it, but afaik no one has actually implemented anything yet:
https://tracker.ceph.com/issues/38494
Regarding whether it's safe to have (async) discard enabled...
Maybe? :) We left it disabled by default because we didn't
want to deal with having to situationally disable it for drives
with buggy firmwares and some of the other associated problems
with online discard. Having said that, in your case it sounds
like enabling it is yielding good results with the PM983 and
your workload.
There's a really good (but slightly old now) article on LWN
detailing the discussion the kernel engineers were having
regarding all of this at the LSFMM Summit a few years ago:
https://lwn.net/Articles/787272/
In the comments, Chris Mason mentions the same delete issue we
probably need to tackle (see Igor's comment linked above):
"The XFS async trim implementation is pretty reasonable, and it
can be a big win in some workloads. Basically anything that
gets pushed out of the critical section of the transaction
commit can have a huge impact on performance. The major thing
it's missing is a way to throttle new deletes from creating a
never ending stream of discards, but I don't think any of the
filesystems are doing that yet."
Mark
On 8/1/22 08:36, Zoltan Langi wrote:
Hey Frank and Mark,
Thanks for your response and sorry about coming back a bit
late, but I needed to test something that needs time.
How I reproduced this issue: Created 100 volumes with ceph-csi
ran 3 set of tests, let the volumes sit for 48 hours and then
deleted the volumes, recreated them and ran the tests 3x in a
row.
If you look at the picture:
picture1
The picture above clearly shows the performance degradation.
We run the first test first read then write at 09:20 finishes
at 09:45, at 11:00 we run the new test, 11:20 finishes and
already struggling with the read iops and the write iops drops
a lot, but it is more like a saw graph in case of the read.
11:40 I reran the test and now, the write normalised on a bad
level, no more saw pattern and the write sticks to the bad
levels.
Let's have a look at the bandwidth graph:
picture2
Compare the 09:40-10:05 part and the 12:00-12:25 part. Those
are the identical tests. Dropped a lot. The only way to
recover from this state is to recreate the bluestore devices
from scratch.
We have enabled the following options in rook-ceph:
bdev_enable_discard = true
bdev_async_discard = true
Now let's have a look at the speed comparsion:
Data from last Friday, before the volumes sat for 48 hours:
picture3
picture4
We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and
Test 3: 21:40-23:30. We see slight write degradation, but it
should stay the same for the rest of the time.
Now let's see the test runs from today:
picture5
picture6
We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 Test
3: 13:10-14:40.
As we see, after enabling these options, the system is
delivering constant speeds without degradation and huge
performance loss like before.
Has anyone came across with something like this behaviour
before? We haven't seen any mention of these options int he
official docs just in pull requests. Is it safe to use these
options in production at all?
Many thanks,
Zoltan
Am 25.07.22 um 21:42 schrieb Mark Nelson:
I don't think so if this is just plain old RBD. RBD
shouldn't require a bunch of RocksDB iterator seeks in the
read/write hot path and writes should pretty quickly clear
out tombstones as part of the memtable flush and compaction
process even in the slow case. Maybe in some kind of
pathologically bad read-only corner case with no onode cache
but it would be bad for more reasons than what's happening in
that tracker ticket imho (even reading onodes from rocksdb
block cache is significantly slower than BlueStore's onode
cache).
If RBD mirror (or snapshots) are involved that could be a
different story though. I believe to deal with deletes in
that case we have to go through iteration/deletion loops that
have same root issue as what's going on in the tracker ticket
and it can end up impacting client IO. Gabi and Paul and
testing/reworking how the snapmapper works and I've started a
sort of a catch-all PR for improving our RocksDB tunings/glue
here:
https://github.com/ceph/ceph/pull/47221
Mark
On 7/25/22 12:48, Frank Schilder wrote:
Could it be related to this performance death trap:
https://tracker.ceph.com/issues/55324 ?
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 25 July 2022 18:50
To: ceph-users@xxxxxxx
Subject: Re: weird performance issue on ceph
Hi Zoltan,
We have a very similar setup with one of our upstream community
performance test clusters. 60 4TB PM983 drives spread
across 10 nodes.
We get similar numbers to what you are initially seeing
(scaled down to
60 drives) though with somewhat lower random read IOPS (we
tend to max
out at around 2M with 60 drives on this HW). I haven't seen
any issues
with quincy like what you are describing, but on this
cluster most of
the tests have been on bare metal. One issue we have
noticed with the
PM983 drives is that they may be more susceptible to
non-optimal write
patterns causing slowdowns vs other NVMe drives in the lab.
We actually
had to issue a last minute PR for quincy to change the disk
allocation
behavior to deal with it. See:
https://github.com/ceph/ceph/pull/45771
https://github.com/ceph/ceph/pull/45884
I don't *think* this is the issue you are hitting since the
fix in
#45884 should have taken care of it, but it might be
something to keep
in the back of your mind. Otherwise, the fact that you are
seeing such
a dramatic difference across both small and large read/write
benchmarks
makes me think there is something else going on. Is there
any chance
that some other bottleneck is being imposed when the pods
and volumes
are deleted and recreated? Might be worth looking at memory
and CPU
usage of the OSDs in all of the cases and RocksDB
flushing/compaction
stats from the OSD logs. Also a quick check with
collectl/iostat/sar
during the slow case to make sure none of the drives are
showing latency
and built up IOs in the device queues.
If you want to go deeper down the rabbit hole you can try
running my
wallclock profiler against one of your OSDs in the fast/slow
cases, but
you'll have to make sure it has access to debug symbols:
https://github.com/markhpc/uwpmp.git
run it like:
./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt
If the libdw backend is having problems you can use -b
libdwarf instead,
but it's much slower and takes longer to collect as many
samples (you
might want to do -n 1000 instead).
Mark
On 7/25/22 11:17, Zoltan Langi wrote:
Hi people, we got an interesting issue here and I would
like to ask if
anyone seen anything like this before.
First: our system:
The ceph version is 17.2.1 but we also seen the same
behaviour on 16.2.9.
Our kernel version is 5.13.0-51 and our NVMe disks are
Samsung PM983.
In our deployment we got 12 nodes in total, 72 disks and 2
osd per
disk makes 144 osd in total.
The depoyment was done by ceph-rook with default values, 6
CPU cores
allocated to the OSD each and 4GB of memory allocated to
each OSD.
The issue we are experiencing: We create for example 100
volumes via
ceph-csi and attach it to kubernetes pods via rbd. We talk
about 100
volumes in total, 2GB each. We run fio performance tests
(read, write,
mixed) on them so the volumes are being used heavily. Ceph
delivers
good performance, no problems as all.
Performance we get for example: read iops 3371027 write
iops: 727714
read bw: 79.9 GB/s write bw: 31.2 GB/s
After the tests are complete, these volumes just sitting
there doing
nothing for a longer period of time for example 48 hours.
After that,
we clean the pods up, clean the volumes up and delete them.
Recreate the volumes and pods once more, same spec (2GB
each 100 pods)
then run the same tests once again. We don’t even have half
the
performance of that we have measured before leaving the
pods sitting
there doing notning for 2 days.
Performance we get after deleting the volumes and
recreating them,
rerun the tests: read iops: 1716239 write iops: 370631 read
bw: 37.8
GB/s write bw: 7.47 GB/s
We can clearly see that it’s a big performance loss.
If we clean up the ceph deployment, wipe the disks out
completely and
redeploy, the cluster once again delivering great performance.
We haven’t seen such a behaviour with ceph version 14.x
Has anyone seen such a thing? Thanks in advance!
Zoltan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx