Hi Maged,
1) Good question. Our cmake setup is complex enough that I suspect it's
hard to definitively answer that question without auditing every
sub-project for each build type. My instinct was to explicitly set the
CMAKE_BUILD_TYPE to RelWithDebInfo in the rules file (This is what
Canonical does), but that's not the direction we ended up going.
2) I didn't create the submit latency chart so I can't say what
conditions were present on that cluster when it was taken. I can say
that those results are consistent with my expectations based on the
testing I did under heavy write i/o load though.
Mark
On 2/9/24 07:18, Maged Mokhtar wrote:
Hi Mark,
Thanks a lot for highlighting this issue...I have 2 questions:
1) In the patch comments:
/"but we fail to populate this setting down when building external
projects. this is important when it comes to the projects which is
critical to the performance. RocksDB is one of them."/
Do we have similar issues with other sub-projects ? boost ? spdk .. ?
2) The chart shown on "rocksdb submit latency", going from over 10 ms
to below 5 ms..is this during write i/o under heavy load ?
/Maged
On 08/02/2024 20:04, Mark Nelson wrote:
Hi Folks,
Recently we discovered a flaw in how the upstream Ubuntu and Debian
builds of Ceph compile RocksDB. It causes a variety of performance
issues including slower than expected write performance, 3X longer
compaction times, and significantly higher than expected CPU
utilization when RocksDB is heavily utilized. The issue has now been
fixed in main. Igor Fedotov, however, observed during the performance
meeting today that there were no backports for the fix in place. He
also rightly pointed out that it would be helpful to make an
announcement about the issue given the severity for the affected
users. I wanted to give a bit more background and make sure people
are aware and understand what's going on.
1) Who's affected?
Anyone running an upstream Ubuntu/Debian build of Ceph from the last
several years. External builds from Canonical and Gentoo suffered
from this issue as well, but were fixed independently.
2) How can you check?
There's no easy way to tell at the moment. We are investigating if
running "strings" on the OSD executable may provide a clue. For now,
assume that if you are using our Debian/Ubuntu builds in a
non-container configuration you are affected. Proxmox for instance
was affected prior to adopting the fix.
3) Are Cephadm deployments affected?
Not as far as we know. Ceph container builds are compiled slightly
differently from stand-alone Debian builds. They do not appear to
suffer from the bug.
4) What versions of Ceph will get the fix?
Casey Bodley kindly offered to backport the fix to both Reef and
Quincy. He also verified that the fix builds properly with Pacific.
We now have 3 separate backport PRs for the releases here:
https://github.com/ceph/ceph/pull/55500
https://github.com/ceph/ceph/pull/55501
https://github.com/ceph/ceph/pull/55502
Please feel free to reply if you have any questions!
Thanks,
Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx