Re: weird performance issue on ceph

Frank Schilder <frans@xxxxxx> · Mon, 26 Sep 2022 19:00:58 +0000

Hi Zoltan and Mark,

this observation of performance loss when a solid state drive gets full and/or exceeded a certain number of write OPS is very typical even for enterprise SSDs. This performance drop can be very dramatic. Therefore, I'm reluctant to add untested solid state drives (SSD/NVMe) to our cluster, because a single bad choice can ruin everything.

For testing, I always fill the entire drive before performing a benchmark. I found only few drives that don't suffer from this kind of performance degradation. Manufacturers of such "good" drives usually provide "sustained XYZ" performance specs instaed of just "XYZ", for example, "sustained write IPO/s" instead of "write IOP/s". When you start a test on these, they start with much higher than spec performance and settle down to specs as they fill up. A full drive lives up to specs for its declared life-time. The downside is, that these drives are usually very expensive, I never saw a cheap one living up to specs when full or after a couple of days with fio random 4K writes.

I believe the Samsumg PM-drives have been flagged in earlier posts as "a bit below expectation". There were also a lot of posts with other drives where users got a hard awakening.

I wonder if it might be a good idea to collect such experience somewhere in the ceph documentation, for example, a link unser hardware recommendations->solid state drives in the docs. Are there legal implications with creating a list of drives showing effective sustained performance of a drive in a ceph cluster? Maybe according to a standardised benchmark that hammers the drives for a couple of weeks and provides sustained performance information under sustained max load in contrast to peak load (which most cheap drives are optimized for and therefore less suitable for a constant-load system like ceph)?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 26 September 2022 16:52
To: ceph-users@xxxxxxx
Subject:  Re: weird performance issue on ceph

Hi Zoltan,

Great investigation work!  I think in my tests the data set typically
was smaller than 500GB/drive.  If you have a simple fio test that can be
run against a bare NVMe drive I can try running it on one of our test
nodes.  FWIW I kind of suspected that the issue I had to work around for
quincy might have been related to some kind of internal cache being
saturated.  I wonder if the drive is fast up until some limit is hit
where it's reverted to slower flash or something?

Mark

On 9/26/22 06:39, Zoltan Langi wrote:
> Hi Mark and the mailing list, we managed to figure something very
> weird out what I would like to share with you and ask if you have seen
> anything like this before.
>
> We started to investigate the drives one-by-one after Mark's
> suggestion that a few osd-s are holding back the ceph and we noticed
> this:
>
> When the disk usage reaches 500GB on a single drive, the drive loses
> half of its write performance compared to when it's empty.
> To show you, let's see the fio write performance when the disk is empty:
> Jobs: 4 (f=4): [W(4)][6.0%][w=1930MiB/s][w=482 IOPS][eta 07h:31m:13s]
> We see, when the disk is empty, the drive achieves almost 1,9GB/s
> throughput and 482 iops. Very decent values.
>
> However! When the disk gets to 500GB full and we start to write a new
> file all of the sudden we get these values:
> Jobs: 4 (f=4): [W(4)][0.9%][w=1033MiB/s][w=258 IOPS][eta 07h:55m:43s]
> As we see we lost significant throughput and iops as well.
>
> If we remove all the files and do an fstrim on the disk, the
> performance returns back to normal again.
>
> If we format the disk, no need to do fstrim, we get the performance
> back to normal again. That explains why the ceph recreation from
> scratch helped us.
>
> Have you see this behaviour before in your deployments?
>
> Thanks,
>
> Zoltan
>
> Am 17.09.22 um 06:58 schrieb Mark Nelson:
>> ********************
>> CAUTION:
>> This email originated from outside the organization. Do not click
>> links unless you can confirm the sender and know the content is safe.
>> ********************
>>
>>
>> Hi Zoltan,
>>
>>
>> So kind of interesting results.  In the "good" write test the OSD
>> doesn't actually seem to be working very hard.  If you look at the kv
>> sync thread, it's mostly idle with only about 22% of the time in the
>> thread spent doing real work:
>>
>> 1.
>>    | + 99.90% BlueStore::_kv_sync_thread()
>> 2.
>>    | + 78.60%
>> std::condition_variable::wait(std::unique_lock<std::mutex>&)
>> 3.
>>    | |+ 78.60% pthread_cond_wait
>> 4.
>>    | + 18.00%
>> RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)
>>
>>
>> ...but at least it's actually doing work!  For reference though, on
>> our high performing setup with enough concurrency we can push things
>> hard enough where this thread isn't spending much time in
>> pthread_cond_wait.  In the "bad" state, your example OSD here is
>> basically doing nothing at all (100% of the time in
>> pthread_cold_wait!).  The tp_osd_tp and the kv sync thread are just
>> waiting around twiddling their thumbs:
>>
>> 1.
>>    Thread 339848 (bstore_kv_sync) - 1000 samples
>> 2.
>>    + 100.00% clone
>> 3.
>>    + 100.00% start_thread
>> 4.
>>    + 100.00% BlueStore::KVSyncThread::entry()
>> 5.
>>    + 100.00% BlueStore::_kv_sync_thread()
>> 6.
>>    + 100.00%
>> std::condition_variable::wait(std::unique_lock<std::mutex>&)
>> 7.
>>    + 100.00% pthread_cond_wait
>>
>>
>> My first thought is that you might have one or more OSDs that are
>> slowing the whole cluster down so that clients are backing up on it
>> and other OSDs are just waiting around for IO.  It might be worth
>> checking the perf admin socket stats on each OSD to see if you can
>> narrow down if any of them are having issues.
>>
>>
>> Thanks,
>>
>> Mark
>>
>>
>> On 9/16/22 05:57, Zoltan Langi wrote:
>>> Hey people and Mark, the cluster was left overnight to do nothing
>>> and the problem as expected came back in the morning. We managed to
>>> capture the bad states on the exact same OSD-s we captured the good
>>> states earlier:
>>>
>>> Here is the output of a read test when the cluster is in a bad state
>>> on the same OSD which I recorded in the good state earlier:
>>>
>>> https://pastebin.com/jp5JLWYK
>>>
>>> Here is the output of a write test when the cluster is in a bad
>>> state on the same OSD which I recorded in the good state earlier:
>>>
>>> The write speed came down from 30,1GB/s to 17,9GB/s
>>>
>>> https://pastebin.com/9e80L5XY
>>>
>>> We are still open for any suggestions, so please feel free to
>>> comment or suggest. :)
>>>
>>> Thanks a lot,
>>> Zoltan
>>>
>>> Am 15.09.22 um 16:53 schrieb Zoltan Langi:
>>>> Hey people and Mark, we managed to capture the good and bad states
>>>> separately:
>>>>
>>>> Here is the output of a read test when the cluster is in a bad state:
>>>>
>>>> https://pastebin.com/0HdNapLQ
>>>>
>>>> Here is the output of a write test when the cluster is in a bad state:
>>>>
>>>> https://pastebin.com/2T2pKu6Q
>>>>
>>>> Here is the output of a read test when the cluster is in a brand
>>>> new reinstalled state:
>>>>
>>>> https://pastebin.com/qsKeX0D8
>>>>
>>>> Here is the output of a write test when the cluster is in a brand
>>>> new reinstalled state:
>>>>
>>>> https://pastebin.com/nTCuEUAb
>>>>
>>>> Hope anyone can suggest anything, any ideas are welcome! :)
>>>>
>>>> Zoltan
>>>>
>>>> Am 13.09.22 um 14:27 schrieb Zoltan Langi:
>>>>> Hey Mark,
>>>>>
>>>>> Sorry about the silence for a while, but a lot of things came up.
>>>>> We finally managed to fix up the profiler and here is an output
>>>>> when the ceph is under heavy write load, in a pretty bad state and
>>>>> its throughput is not achieving more than 12,2GB/s.
>>>>>
>>>>> For a good state we have to recreate the whole thing, so we
>>>>> thought we start with the bad state, maybe something obvious is
>>>>> already visible for someone who knows the osd internals well.
>>>>>
>>>>> You find the file here: https://pastebin.com/0HdNapLQ
>>>>>
>>>>> Tanks a lot in advance,
>>>>>
>>>>> Zolta
>>>>>
>>>>> Am 12.08.22 um 18:25 schrieb Mark Nelson:
>>>>>> CAUTION: This email originated from outside the organization. Do
>>>>>> not click links unless you can confirm the sender and know the
>>>>>> content is safe.
>>>>>>
>>>>>> Hi Zoltan,
>>>>>>
>>>>>>
>>>>>> Sadly it looks like some of the debug symbols are messed which
>>>>>> makes things a little rough to debug from this. On the write path
>>>>>> if you look at the bstore_kv_sync thread:
>>>>>>
>>>>>>
>>>>>> Good state write test:
>>>>>>
>>>>>> 1.
>>>>>>    + 86.00% FileJournal::_open_file(long, long, bool)
>>>>>> 2.
>>>>>>    |+ 86.00% ???
>>>>>> 3.
>>>>>>    + 11.50% ???
>>>>>> 4.
>>>>>>    |+ 0.20% ???
>>>>>>
>>>>>> Bad state write test:
>>>>>>
>>>>>> 1.
>>>>>>    Thread 2869223 (bstore_kv_sync) - 1000 samples
>>>>>> 2.
>>>>>>    + 73.70% FileJournal::_open_file(long, long, bool)
>>>>>> 3.
>>>>>>    |+ 73.70% ???
>>>>>> 4.
>>>>>>    + 24.90% ???
>>>>>>
>>>>>> That's really strange, because FileJournal is part of filestore.
>>>>>> There also seems to be stuff in this trace regarding
>>>>>> BtrfsFileStoreBackend and FuseStore::Stop(). Seems like the debug
>>>>>> symbols are basically just wrong. Is it possible that some how
>>>>>> you ended up with debug symbols for the wrong version of ceph or
>>>>>> something?
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On 8/12/22 11:13, Zoltan Langi wrote:
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> I managed to profile one osd before and after the bad state. We
>>>>>>> have downgraded ceph to 14.2.22
>>>>>>>
>>>>>>> Good state with read test:
>>>>>>>
>>>>>>> https://pastebin.com/etreYzQc
>>>>>>>
>>>>>>> Good state with write test:
>>>>>>>
>>>>>>> https://pastebin.com/qrN5MaY6
>>>>>>>
>>>>>>> Bad state with read test:
>>>>>>>
>>>>>>> https://pastebin.com/S1pRiJDq
>>>>>>>
>>>>>>> Bad state with write test:
>>>>>>>
>>>>>>> https://pastebin.com/dEv05eGV
>>>>>>>
>>>>>>> Do you see anything obvious that could give us a clue what is
>>>>>>> going on?
>>>>>>>
>>>>>>> Many thanks!
>>>>>>>
>>>>>>> Zoltan
>>>>>>>
>>>>>>> Am 02.08.22 um 19:01 schrieb Mark Nelson:
>>>>>>>> Ah, too bad!  I suppose that was too easy. :)
>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, so my two lines of thought:
>>>>>>>>
>>>>>>>> 1) Something related to the weird performance issues we ran
>>>>>>>> into on the PM983 after lots of fragmented writes over the
>>>>>>>> drive.  I think we've worked around that with the fix in
>>>>>>>> quincy, but perhaps you are hitting a manifestation of it that
>>>>>>>> we haven't. The way to investigate that is to look at the NVMe
>>>>>>>> block device stats with collectl or iostat and see if you see
>>>>>>>> higher io service times and longer device queue lengths in the
>>>>>>>> "bad" case vs the "good" case. If you do, it means that
>>>>>>>> something is making the drive(s) themselves laggy at fulfilling
>>>>>>>> requests.  You might have to look at a bunch of drives in case
>>>>>>>> there's one acting up before the others do, but that's pretty
>>>>>>>> easy to do with either tool.  For extra bonus point you can use
>>>>>>>> blktrace/blkparse/iowatcher to see if writes are really being
>>>>>>>> fragmented (there could be other causes of a drive becoming slow).
>>>>>>>>
>>>>>>>> The other thing that comes to mind is RocksDB...either due to
>>>>>>>> just having more metadata to deal with, or perhaps as a result
>>>>>>>> of having a ton more objects, not enough onode cache, and
>>>>>>>> having to issue onode reads to rocksdb when you have cache
>>>>>>>> misses.  I believe we have hit rate perf counters for the onode
>>>>>>>> cache, but you can get a hint if you see a bunch of reads
>>>>>>>> (specifically to the DB partition if you've configured it to be
>>>>>>>> separate) during writes. You may also want to look at the
>>>>>>>> compaction stats in the OSD log just to make sure it's not
>>>>>>>> super laggy.  You can run this tool against the log to see a
>>>>>>>> summary and details regarding individual compaction events:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Those would be the first places I would look.  If neither are
>>>>>>>> helpful, you could try profiling the OSDs using uwpmp as I
>>>>>>>> mentioned earlier.
>>>>>>>>
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/2/22 09:50, Zoltan Langi wrote:
>>>>>>>>> Hey Mark, taking back these options solve the issue. I just
>>>>>>>>> ran my tests twice again and here are the results:
>>>>>>>>>
>>>>>>>>> https://ibb.co/9vY5xgS
>>>>>>>>> https://ibb.co/71pSCQv
>>>>>>>>>
>>>>>>>>> Back to where it was, performance dropped down today. So seems
>>>>>>>>> like the
>>>>>>>>>
>>>>>>>>>     bdev_enable_discard = true
>>>>>>>>>     bdev_async_discard = true
>>>>>>>>>
>>>>>>>>> options didn't make any difference in the end and the problem
>>>>>>>>> reappeared, just a bit later.
>>>>>>>>>
>>>>>>>>> I have read all the articles you posted, thanks for it,
>>>>>>>>> however I am still struggling with this. Any other
>>>>>>>>> recommendation or idea what to check?
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Zoltan
>>>>>>>>>
>>>>>>>>> Am 01.08.22 um 17:53 schrieb Mark Nelson:
>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It doesn't look like your pictures showed up for me at least.
>>>>>>>>>> Very interesting results though! Are (or were) the drives
>>>>>>>>>> particularly full when you've run into performance problems
>>>>>>>>>> that the discard option appears to fix?  There have been some
>>>>>>>>>> discussions in the past regarding online discard vs periodic
>>>>>>>>>> discard ala fstrim.  The gist of it is that there are
>>>>>>>>>> performance implications for online trim, but there are
>>>>>>>>>> (eventual) performance implications if let the drive get too
>>>>>>>>>> full before doing an offline trim (that itself can be
>>>>>>>>>> impactful).  There's been quite a bit of discussion about it
>>>>>>>>>> on the mailing list and in PRs:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/YFQKVCAMHHQ72AMTL2MQAA7QN7YCJ7GA/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/ceph/ceph/pull/14727
>>>>>>>>>>
>>>>>>>>>> Specifically, see this comment regarding how it can affect
>>>>>>>>>> garbage collection but also burst TRIM command effect on the
>>>>>>>>>> FTL:
>>>>>>>>>>
>>>>>>>>>> https://github.com/ceph/ceph/pull/14727#issuecomment-342399578
>>>>>>>>>>
>>>>>>>>>> And some performance testing by Igor here:
>>>>>>>>>>
>>>>>>>>>> https://github.com/ceph/ceph/pull/20723#pullrequestreview-104218724
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It would be very interesting to see if you see a similar
>>>>>>>>>> performance improvement if we had a fstrim like discard
>>>>>>>>>> option you could run before the new test.  There's a tracker
>>>>>>>>>> ticket for it, but afaik no one has actually implemented
>>>>>>>>>> anything yet:
>>>>>>>>>>
>>>>>>>>>> https://tracker.ceph.com/issues/38494
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regarding whether it's safe to have (async) discard
>>>>>>>>>> enabled... Maybe? :)  We left it disabled by default because
>>>>>>>>>> we didn't want to deal with having to situationally disable
>>>>>>>>>> it for drives with buggy firmwares and some of the other
>>>>>>>>>> associated problems with online discard. Having said that, in
>>>>>>>>>> your case it sounds like enabling it is yielding good results
>>>>>>>>>> with the PM983 and your workload.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There's a really good (but slightly old now) article on LWN
>>>>>>>>>> detailing the discussion the kernel engineers were having
>>>>>>>>>> regarding all of this at the LSFMM Summit a few years ago:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://lwn.net/Articles/787272/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the comments, Chris Mason mentions the same delete issue
>>>>>>>>>> we probably need to tackle (see Igor's comment linked above):
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> "The XFS async trim implementation is pretty reasonable, and
>>>>>>>>>> it can be a big win in some workloads. Basically anything
>>>>>>>>>> that gets pushed out of the critical section of the
>>>>>>>>>> transaction commit can have a huge impact on performance. The
>>>>>>>>>> major thing it's missing is a way to throttle new deletes
>>>>>>>>>> from creating a never ending stream of discards, but I don't
>>>>>>>>>> think any of the filesystems are doing that yet."
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/1/22 08:36, Zoltan Langi wrote:
>>>>>>>>>>> Hey Frank and Mark,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your response and sorry about coming back a bit
>>>>>>>>>>> late, but I needed to test something that needs time.
>>>>>>>>>>>
>>>>>>>>>>> How I reproduced this issue: Created 100 volumes with
>>>>>>>>>>> ceph-csi ran 3 set of tests, let the volumes sit for 48
>>>>>>>>>>> hours and then deleted the volumes, recreated them and ran
>>>>>>>>>>> the tests 3x in a row.
>>>>>>>>>>>
>>>>>>>>>>> If you look at the picture:
>>>>>>>>>>>
>>>>>>>>>>> picture1
>>>>>>>>>>>
>>>>>>>>>>> The picture above clearly shows the performance degradation.
>>>>>>>>>>> We run the first test first read then write at 09:20
>>>>>>>>>>> finishes at 09:45, at 11:00 we run the new test, 11:20
>>>>>>>>>>> finishes and already struggling with the read iops and the
>>>>>>>>>>> write iops drops a lot, but it is more like a saw graph in
>>>>>>>>>>> case of the read. 11:40 I reran the test and now, the write
>>>>>>>>>>> normalised on a bad level, no more saw pattern and the write
>>>>>>>>>>> sticks to the bad levels.
>>>>>>>>>>>
>>>>>>>>>>> Let's have a look at the bandwidth graph:
>>>>>>>>>>>
>>>>>>>>>>> picture2
>>>>>>>>>>>
>>>>>>>>>>> Compare the 09:40-10:05 part and the 12:00-12:25 part. Those
>>>>>>>>>>> are the identical tests. Dropped a lot. The only way to
>>>>>>>>>>> recover from this state is to recreate the bluestore devices
>>>>>>>>>>> from scratch.
>>>>>>>>>>>
>>>>>>>>>>> We have enabled the following options in rook-ceph:
>>>>>>>>>>>
>>>>>>>>>>>     bdev_enable_discard = true
>>>>>>>>>>>     bdev_async_discard = true
>>>>>>>>>>>
>>>>>>>>>>> Now let's have a look at the speed comparsion:
>>>>>>>>>>>
>>>>>>>>>>> Data from last Friday, before the volumes sat for 48 hours:
>>>>>>>>>>>
>>>>>>>>>>> picture3
>>>>>>>>>>>
>>>>>>>>>>> picture4
>>>>>>>>>>>
>>>>>>>>>>> We see 3 tests. Test 1: 16:40-19:00 Test 2: 20:00-21:35 and
>>>>>>>>>>> Test 3: 21:40-23:30. We see slight write degradation, but it
>>>>>>>>>>> should stay the same for the rest of the time.
>>>>>>>>>>>
>>>>>>>>>>> Now let's see the test runs from today:
>>>>>>>>>>>
>>>>>>>>>>> picture5
>>>>>>>>>>>
>>>>>>>>>>> picture6
>>>>>>>>>>>
>>>>>>>>>>> We see 3 tests. Test 1: 09:20-11:00 Test 2: 11:05-12:40 Test
>>>>>>>>>>> 3: 13:10-14:40.
>>>>>>>>>>>
>>>>>>>>>>> As we see, after enabling these options, the system is
>>>>>>>>>>> delivering constant speeds without degradation and huge
>>>>>>>>>>> performance loss like before.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone came across with something like this behaviour
>>>>>>>>>>> before? We haven't seen any mention of these options int he
>>>>>>>>>>> official docs just in pull requests. Is it safe to use these
>>>>>>>>>>> options in production at all?
>>>>>>>>>>>
>>>>>>>>>>> Many thanks,
>>>>>>>>>>> Zoltan
>>>>>>>>>>>
>>>>>>>>>>> Am 25.07.22 um 21:42 schrieb Mark Nelson:
>>>>>>>>>>>> I don't think so if this is just plain old RBD.  RBD
>>>>>>>>>>>> shouldn't require a bunch of RocksDB iterator seeks in the
>>>>>>>>>>>> read/write hot path and writes should pretty quickly clear
>>>>>>>>>>>> out tombstones as part of the memtable flush and compaction
>>>>>>>>>>>> process even in the slow case. Maybe in some kind of
>>>>>>>>>>>> pathologically bad read-only corner case with no onode
>>>>>>>>>>>> cache but it would be bad for more reasons than what's
>>>>>>>>>>>> happening in that tracker ticket imho (even reading onodes
>>>>>>>>>>>> from rocksdb block cache is significantly slower than
>>>>>>>>>>>> BlueStore's onode cache).
>>>>>>>>>>>>
>>>>>>>>>>>> If RBD mirror (or snapshots) are involved that could be a
>>>>>>>>>>>> different story though.  I believe to deal with deletes in
>>>>>>>>>>>> that case we have to go through iteration/deletion loops
>>>>>>>>>>>> that have same root issue as what's going on in the tracker
>>>>>>>>>>>> ticket and it can end up impacting client IO. Gabi and Paul
>>>>>>>>>>>> and testing/reworking how the snapmapper works and I've
>>>>>>>>>>>> started a sort of a catch-all PR for improving our RocksDB
>>>>>>>>>>>> tunings/glue here:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/ceph/ceph/pull/47221
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Mark
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/25/22 12:48, Frank Schilder wrote:
>>>>>>>>>>>>> Could it be related to this performance death trap:
>>>>>>>>>>>>> https://tracker.ceph.com/issues/55324 ?
>>>>>>>>>>>>> =================
>>>>>>>>>>>>> Frank Schilder
>>>>>>>>>>>>> AIT Risø Campus
>>>>>>>>>>>>> Bygning 109, rum S14
>>>>>>>>>>>>>
>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>> From: Mark Nelson <mnelson@xxxxxxxxxx>
>>>>>>>>>>>>> Sent: 25 July 2022 18:50
>>>>>>>>>>>>> To: ceph-users@xxxxxxx
>>>>>>>>>>>>> Subject:  Re: weird performance issue on ceph
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Zoltan,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have a very similar setup with one of our upstream
>>>>>>>>>>>>> community
>>>>>>>>>>>>> performance test clusters.  60 4TB PM983 drives spread
>>>>>>>>>>>>> across 10 nodes.
>>>>>>>>>>>>> We get similar numbers to what you are initially seeing
>>>>>>>>>>>>> (scaled down to
>>>>>>>>>>>>> 60 drives) though with somewhat lower random read IOPS (we
>>>>>>>>>>>>> tend to max
>>>>>>>>>>>>> out at around 2M with 60 drives on this HW). I haven't
>>>>>>>>>>>>> seen any issues
>>>>>>>>>>>>> with quincy like what you are describing, but on this
>>>>>>>>>>>>> cluster most of
>>>>>>>>>>>>> the tests have been on bare metal.  One issue we have
>>>>>>>>>>>>> noticed with the
>>>>>>>>>>>>> PM983 drives is that they may be more susceptible to
>>>>>>>>>>>>> non-optimal write
>>>>>>>>>>>>> patterns causing slowdowns vs other NVMe drives in the
>>>>>>>>>>>>> lab. We actually
>>>>>>>>>>>>> had to issue a last minute PR for quincy to change the
>>>>>>>>>>>>> disk allocation
>>>>>>>>>>>>> behavior to deal with it.  See:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/45771
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/45884
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't *think* this is the issue you are hitting since
>>>>>>>>>>>>> the fix in
>>>>>>>>>>>>> #45884 should have taken care of it, but it might be
>>>>>>>>>>>>> something to keep
>>>>>>>>>>>>> in the back of your mind.  Otherwise, the fact that you
>>>>>>>>>>>>> are seeing such
>>>>>>>>>>>>> a dramatic difference across both small and large
>>>>>>>>>>>>> read/write benchmarks
>>>>>>>>>>>>> makes me think there is something else going on. Is there
>>>>>>>>>>>>> any chance
>>>>>>>>>>>>> that some other bottleneck is being imposed when the pods
>>>>>>>>>>>>> and volumes
>>>>>>>>>>>>> are deleted and recreated? Might be worth looking at
>>>>>>>>>>>>> memory and CPU
>>>>>>>>>>>>> usage of the OSDs in all of the cases and RocksDB
>>>>>>>>>>>>> flushing/compaction
>>>>>>>>>>>>> stats from the OSD logs.  Also a quick check with
>>>>>>>>>>>>> collectl/iostat/sar
>>>>>>>>>>>>> during the slow case to make sure none of the drives are
>>>>>>>>>>>>> showing latency
>>>>>>>>>>>>> and built up IOs in the device queues.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you want to go deeper down the rabbit hole you can try
>>>>>>>>>>>>> running my
>>>>>>>>>>>>> wallclock profiler against one of your OSDs in the
>>>>>>>>>>>>> fast/slow cases, but
>>>>>>>>>>>>> you'll have to make sure it has access to debug symbols:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/markhpc/uwpmp.git
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> run it like:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the libdw backend is having problems you can use -b
>>>>>>>>>>>>> libdwarf instead,
>>>>>>>>>>>>> but it's much slower and takes longer to collect as many
>>>>>>>>>>>>> samples (you
>>>>>>>>>>>>> might want to do -n 1000 instead).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/25/22 11:17, Zoltan Langi wrote:
>>>>>>>>>>>>>> Hi people, we got an interesting issue here and I would
>>>>>>>>>>>>>> like to ask if
>>>>>>>>>>>>>> anyone seen anything like this before.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> First: our system:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The ceph version is 17.2.1 but we also seen the same
>>>>>>>>>>>>>> behaviour on 16.2.9.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Our kernel version is 5.13.0-51 and our NVMe disks are
>>>>>>>>>>>>>> Samsung PM983.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In our deployment we got 12 nodes in total, 72 disks and
>>>>>>>>>>>>>> 2 osd per
>>>>>>>>>>>>>> disk makes 144 osd in total.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The depoyment was done by ceph-rook with default values,
>>>>>>>>>>>>>> 6 CPU cores
>>>>>>>>>>>>>> allocated to the OSD each and 4GB of memory allocated to
>>>>>>>>>>>>>> each OSD.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The issue we are experiencing: We create for example 100
>>>>>>>>>>>>>> volumes via
>>>>>>>>>>>>>> ceph-csi and attach it to kubernetes pods via rbd. We
>>>>>>>>>>>>>> talk about 100
>>>>>>>>>>>>>> volumes in total, 2GB each. We run fio performance tests
>>>>>>>>>>>>>> (read, write,
>>>>>>>>>>>>>> mixed) on them so the volumes are being used heavily.
>>>>>>>>>>>>>> Ceph delivers
>>>>>>>>>>>>>> good performance, no problems as all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance we get for example: read iops 3371027 write
>>>>>>>>>>>>>> iops: 727714
>>>>>>>>>>>>>> read bw: 79.9 GB/s write bw: 31.2 GB/s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After the tests are complete, these volumes just sitting
>>>>>>>>>>>>>> there doing
>>>>>>>>>>>>>> nothing for a longer period of time for example 48 hours.
>>>>>>>>>>>>>> After that,
>>>>>>>>>>>>>> we clean the pods up, clean the volumes up and delete them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Recreate the volumes and pods once more, same spec (2GB
>>>>>>>>>>>>>> each 100 pods)
>>>>>>>>>>>>>> then run the same tests once again. We don't even have
>>>>>>>>>>>>>> half the
>>>>>>>>>>>>>> performance of that we have measured before leaving the
>>>>>>>>>>>>>> pods sitting
>>>>>>>>>>>>>> there doing notning for 2 days.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance we get after deleting the volumes and
>>>>>>>>>>>>>> recreating them,
>>>>>>>>>>>>>> rerun the tests: read iops: 1716239 write iops: 370631
>>>>>>>>>>>>>> read bw: 37.8
>>>>>>>>>>>>>> GB/s write bw: 7.47 GB/s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We can clearly see that it's a big performance loss.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we clean up the ceph deployment, wipe the disks out
>>>>>>>>>>>>>> completely and
>>>>>>>>>>>>>> redeploy, the cluster once again delivering great
>>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We haven't seen such a behaviour with ceph version 14.x
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Has anyone seen such a thing? Thanks in advance!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zoltan
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx