Hi Everyone,
David Orman mentioned in the CLT meeting this morning that there are a
number of people on the mailing list asking about performance
regressions in Pacific+ vs older releases. I want to document a couple
of the bigger ones that we know about for the community's benefit. I
want to be clear that Pacific does have a number of performance
improvements over previous releases, and we do have tests showing
improvement relative to nautilus (especially RBD on NVMe drives). Some
of these regressions are going to have a bigger effect for some users
than others. Having said that, let's get into them.
********** Regression #1: RocksDB Log File Recycling **********
Effects: More metadata updates to the underlying FS, higher
write-amplification (Observed by Digital Ocean), Slower performance
especially when the WAL device is saturated.
When bluestore was created back in 2015 Sage implemented an optimization
in RocksDB that allowed WAL log files to be recycled. The idea is that
instead of deleting logs when they are flushed, rocksdb can simply reuse
them. The benefit here is that it allows records to be written and
fadatasync can be called without touching the inode for every IO. Sage
did a pretty good job of explaining the benefit in the PR available here:
https://github.com/facebook/rocksdb/pull/746
After much discussion, that PR was merged and received a couple of bug
fixes over the years:
Locking bug fix from Somnath back in 2016:
https://github.com/facebook/rocksdb/pull/1313
Another bug fix from ajkr in 2020:
https://github.com/facebook/rocksdb/pull/5900
In 2020, the RocksDB folks discovered there is a fundamental flaw in the
way that the original PR works. It turns out that the feature to
recycle log files is incompatible with RocksDB's kPointInTimeRecovery,
kAbsoluteConsistency, and kTolerateCorruptedTailRecords recovery modes.
One of the later PR's included a very good and concise description of
the problem:
"The two features are naturally incompatible. WAL recycling expects the
recovery to succeed upon encountering a corrupt record at the point
where new data ends and recycled data remains at the tail. However,
WALRecoveryMode::kTolerateCorruptedTailRecords must fail upon
encountering any such corrupt record, as it cannot differentiate between
this and a real corruption, which would cause committed updates to be
truncated."
More background discussion on the RocksDB side available in these PRs
and comments:
https://github.com/facebook/rocksdb/pull/6351
https://github.com/facebook/rocksdb/pull/6351#issuecomment-672838284
https://github.com/facebook/rocksdb/pull/7252
https://github.com/facebook/rocksdb/pull/7271
On the Ceph side, there was a PR to try to re-enable the old behavior
which we rejected as unsafe based on the analysis by the RocksDB folks
(which we agree with):
https://github.com/ceph/ceph/pull/36579
Sage also commented about a potential way forward:
https://github.com/ceph/ceph/pull/36579#issuecomment-870884583
"tbh I think the best approach would be to create a new WAL file format
that (1) is 4k block aligned and (2) has a header for each block that
indicates the generation # for that log file (so we can see whether what
we read is from a previous pass or corruption). That would be a fair bit
of effort, though."
On a side note, Igor tried to also disable WAL file recycling as a
backport to Octopus but was thwarted by a BlueFS bug. That PR was
eventually reverted leaving the old (dangerous!) behavior being left in
place:
https://github.com/ceph/ceph/pull/45040
https://github.com/ceph/ceph/pull/47053
The gist of it is that releases of Ceph older than Pacific are
benefiting from the speed improvement of log file recycling but may be
vulnerable to the issue as described above. This is likely one of the
more impactful regressions that people upgrading to Pacific or later
releases are seeing.
Josh Baergen from Digital Ocean followed up that there is a slew of
additional information on this issue in the following tracker as well:
https://tracker.ceph.com/issues/58530
********** Regression #1 Potential Fixes ***********
Josh Baergen also mentioned that the write-amplification effect that was
observed due to this issue is mitigated in by
https://github.com/ceph/ceph/pull/48915 which was merged into 16.2.11
back in December. That however does not improve write IOPS amplification.
Beyond that, we could follow Sage's idea and try to implement a new WAL
file format. The risks here are that it could be a lot of work and we
don't know if there is really any appetite on the RocksDB side to merge
something like this upstream. My personal take is that we're already
kind of abusing the RocksDB WAL for short lived PG log updates and I'm
not thrilled about trying to add further code into RocksDB to try and
support our use cases (though there is benefit here that goes beyond
Ceph). We already maintain a custom version of RocksDB's LRU cache in
our code to tie into our memory autotuning system but it would be really
nice to avoid custom code like that in the future.
One alternative: Igor Fedetov implemented a prototype WAL inside
bluestore itself and we saw very good initial results from it with the
RocksDB WAL disabled. These can be seen on slide 24 of my performance
deck from Cephalocon 2023:
https://www.linkedin.com/in/markhpc/overlay/experience/2113859303/multiple-media-viewer/?profileId=ACoAAAHzuIEB_T2FuVPM2terPw14ffzShLXPbbo&treasuryMediaId=1635524697350
If Igor (or others) want to continue this work, I personally would be in
favor of trying to move the WAL into Bluestore itself. I suspect we can
make better decisions about PG log life cycles and have better BlueFS
integration than what RocksDB provides us. Igor probably has a better
idea of the pitfalls here though so I think we should hear out his
thoughts on whether this is the right path forward. Igor also mentioned
that he is continuing to work on his Bluestore WAL prototype with
promising results, but that PG Log will (as expected) likely require a
different solution that looks more like a specialized ring buffer. I
think moving the WAL out of RocksDB is a good step toward that eventual
goal.
********** Regression #2: (re-)Enabling BlueFS Buffered IO **********
Effects: Works around unexpected readahead behavior in RocksDB by
utilizing underlying kernel page cache. Hurts write performance on fast
devices.
We're stuck between a bit of a rock and a hard place here. Over the
years we have sea-sawed back and forth regarding when we should or
should not use buffered IO:
https://github.com/ceph/ceph/pull/11012
https://github.com/ceph/ceph/pull/11059
https://github.com/ceph/ceph/pull/18172
https://github.com/ceph/ceph/pull/20542
https://github.com/ceph/ceph/pull/34224
https://github.com/ceph/ceph/pull/38044 <-- lots of discussion here
The gist of it is that there are upsides and downside to having
bluefs_buffered_io=true. Direct IO is faster in some scenarios,
especially more recent write tests on NVMe drives. The trade off is
that RocksDB really seems to benefit from kernel buffer cache and there
are other scenarios where bluefs_buffered_io is a big win. 2 years ago
Adam and I did a walkthrough of the RocksDB code to try to understand
the behavior regarding RocksDB readahead and we couldn't understand why
it was re-reading data from the file system so often (or in the case of
buffered IO the page cache!). I wrote up our walkthrough of the code here:
https://github.com/ceph/ceph/pull/38044#issuecomment-790157415
********** Regression #2 Potential Fixes **********
In a recent discussion with Mark Callaghan (of MyRocks/RocksDB
performance tuning fame), he pointed out that RocksDB has an option to
pre-populate the block cache with the data from SSTs created by memtable
flush and that might help when O_DIRECT is used:
https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L600
We may want to experiment to see if this helps keep the block cache
pre-populated after compaction and avoid (re)reads from the disk during
iteration. We also might want to revisit this topic in general with the
compact on iteration feature that was recently added and backported to
pacific in 16.2.13. I'm still a little concerned however that we were
seeing repeated overlapping reads for the same ranges during iteration
that I would have expected to be cached by RocksDB on a previous read.
Ultimately I think many of us would prefer to move entirely to directIO
but there's more work to do to figure this one out.
Josh Baergen provided further advice here: They have had good luck
enabling buffered IO for rgw bucket index OSDs and disabling it
everywhere else. This assumes that bucket indexes are on their own
dedicated OSDs though, and personally I am a bit wary of hitting slow
cases in RocksDB even on "regular" OSDs, but this might be something to
consider as they've had good luck with this configuration for over a year.
********** Regression #3: RadosGW Coroutine and Request Timeout Changes
**********
Effects: Higher RadosGW CPU usage, lower performance, especially for
small object workloads
Back when Pacific was released it was observed that RadosGW was showing
much higher CPU usage and lower performance vs Nautilus for small (4KB)
objects. It's likely that larger objects may be affected, though to a
lesser degree A git bisection was performed and the results are
summarized in the introduction section of the folliowing RGW performance
analysis blog post here:
https://ceph.io/en/news/blog/2023/reef-freeze-rgw-performance/
The bisection uncovered two primary PRs that were causing performance
regression:
https://github.com/ceph/ceph/pull/31580
https://github.com/ceph/ceph/pull/35355
The good news is that once those PRs were identified, the RGW team
started working to improve things, especially for #35355:
https://github.com/ceph/ceph/pull/43761 <-- Fixes issues introduced in
#35355, backported to Pacific in 2022
*********** Regression #3 Potential Fixes **********
Quincy (and due to the backport likely Pacific) is showing significantly
better behavior in recent tests due to PR #43761. The effects of #31580
are still present, but are considered a necessary trade-off. Other
improvements since then may be helping, but we'll need to continue to
make up the difference in other areas and start really investigating
where we are spending cycles/time, especially in Reef.
********** Regression #4: Gradually slowing down OSDs **********
Effects: Significant slowdown after 1-2 weeks of OSD runtime
Igor Fedetov pointed this one out in discussion earlier today:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/OXAUOK7CQXWYGQNT7LWHMLPRB4KNIFXT/
This one is pretty new and there is not much there yet other than
perhaps low memory (and cache?) usage despite regular IO workload.
Onode misses can absolutely cause performance degradation, but it's not
clear yet whether this is memory related issue or something else. More
investigation needed. Hopefully we'll get perf data from the users who
encountered it to help diagnose what's going on here.
********** Conclusion **********
There may be other performance issues that I'm not remembering, but
these are the big ones I can think of off the top of my head at the
moment. Hopefully this helps clarify what's going on if people are
seeing a regression, what to look for, and if they are hitting it, the
why behind it.
Thanks,
Mark
--
Best Regards,
Mark Nelson
Head of R&D (USA)
Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nelson@xxxxxxxxx
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx