Re: [ceph-users] PSA: Long Standing Debian/Ubuntu build performance issue (fixed, backports in progress)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Maged,


1) Good question.  Our cmake setup is complex enough that I suspect it's hard to definitively answer that question without auditing every sub-project for each build type.  My instinct was to explicitly set the CMAKE_BUILD_TYPE to RelWithDebInfo in the rules file (This is what Canonical does), but that's not the direction we ended up going.

2) I didn't create the submit latency chart so I can't say what conditions were present on that cluster when it was taken.  I can say that those results are consistent with my expectations based on the testing I did under heavy write i/o load though.


Mark


On 2/9/24 07:18, Maged Mokhtar wrote:

Hi Mark,

Thanks a lot for highlighting this issue...I have 2 questions:

1) In the patch comments:

"but we fail to populate this setting down when building external projects. this is important when it comes to the projects which is critical to the performance. RocksDB is one of them."

Do we have similar issues with other sub-projects ? boost ? spdk .. ?

2) The chart shown on "rocksdb submit latency", going from over 10 ms to below 5 ms..is this during write i/o under heavy load ?

/Maged


On 08/02/2024 20:04, Mark Nelson wrote:
Hi Folks,

Recently we discovered a flaw in how the upstream Ubuntu and Debian builds of Ceph compile RocksDB.  It causes a variety of performance issues including slower than expected write performance, 3X longer compaction times, and significantly higher than expected CPU utilization when RocksDB is heavily utilized.  The issue has now been fixed in main. Igor Fedotov, however, observed during the performance meeting today that there were no backports for the fix in place.  He also rightly pointed out that it would be helpful to make an announcement about the issue given the severity for the affected users. I wanted to give a bit more background and make sure people are aware and understand what's going on.

1) Who's affected?

Anyone running an upstream Ubuntu/Debian build of Ceph from the last several years.  External builds from Canonical and Gentoo suffered from this issue as well, but were fixed independently.

2) How can you check?

There's no easy way to tell at the moment.  We are investigating if running "strings" on the OSD executable may provide a clue.  For now, assume that if you are using our Debian/Ubuntu builds in a non-container configuration you are affected.  Proxmox for instance was affected prior to adopting the fix.

3) Are Cephadm deployments affected?

Not as far as we know.  Ceph container builds are compiled slightly differently from stand-alone Debian builds.  They do not appear to suffer from the bug.

4) What versions of Ceph will get the fix?

Casey Bodley kindly offered to backport the fix to both Reef and Quincy.  He also verified that the fix builds properly with Pacific.  We now have 3 separate backport PRs for the releases here:

https://github.com/ceph/ceph/pull/55500
https://github.com/ceph/ceph/pull/55501
https://github.com/ceph/ceph/pull/55502


Please feel free to reply if you have any questions!

Thanks,
Mark

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux