Re: Recent ceph.io Performance Blog Posts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Mark,

Thanks a lot for your insightful blogs on Ceph performance. It is really very informative and interesting.

When i read Ceph OSD CPU Scaling, i am wondering by which way you  scale CPU cores per OSD, in other words, how do you allocate a specified num of cores for OSDs?  I am interested to repeat some of your tests on hybrid NVMe/HDD environments.

best regards,

samuel



huxiaoyu@xxxxxxxxxxxx
 
From: Mark Nelson
Date: 2022-11-10 18:24
To: Eshcar Hillel; ceph-users@xxxxxxx
Subject:  Re: Recent ceph.io Performance Blog Posts
Interesting, I see all of the usual suspects here with InlineSkipList 
KeyComparator being the big one.  I've rarely seen it this bad though.  
What model CPU are you running on?
 
 
There's a very good chance that you would benefit from the new 
(experimental) tuning in the RocksDB article.  Smaller buffers means 
fewer key comparisons and that's what's primarily holding you back based 
on this trace.  The trade off used to be higher write-amplification but 
making sure that L0 and L1 are properly sized seems to help avoid that.  
We can't support that tuning yet, but if you have a test cluster it 
might be something worth trying out.
 
 
Mark
 
 
On 11/10/22 10:54, Eshcar Hillel wrote:
> I attach the profiling results of one of the OSDs
> you are invited to take a look at it.
> we already have multiple OSDs on a single nvme drive.
> we have 3 nodes each with 4 SSD drives.
> We run 8 SSDs on each nvme, total of 96 OSDs.
> Indeed, the size of the data in each rocksdb db is very small and the 
> reported compaction WA is ~1.4 for each cf.
>
> My suggestion is not to shard across KeyValueDBs.
> instead, I suggest having multiple bstore_kv_sync accessing the same 
> cfs concurrently.
> rocksdb has a feature that allows concurrent writes: 
> https://www.youtube.com/watch?v=_OBCvU-DECk
> it is disabled by default, but it is there.
> the assumption is that due to internal sharding most operations would 
> access different cfs and in practice there will be no conflicts and so 
> no concurrency overhead.
> nevertheless, the concurrent write flag will allow those operation 
> that do need to share the same cf to do it in a safe way.
>
> was this option ever considered?
>
> ------------------------------------------------------------------------
> *From:* Mark Nelson <mnelson@xxxxxxxxxx>
> *Sent:* Wednesday, November 9, 2022 3:09 PM
> *To:* Eshcar Hillel <eshcarh@xxxxxxxxxx>; ceph-users@xxxxxxx 
> <ceph-users@xxxxxxx>
> *Subject:* Re:  Recent ceph.io Performance Blog Posts
> CAUTION:External Sender
> On 11/9/22 6:03 AM, Eshcar Hillel wrote:
>> Hi Mark,
>>
>> Thanks for posting these blogs. They are very interesting to read.
>> Maybe you have an answer to a question I asked in the dev list:
>>
>> We run fio benchmark against a 3-node ceph cluster with 96 OSDs. 
>> Objects are 4kb. We use 
>> gdbpmp profilerhttps://github.com/markhpc/gdbpmp 
>> <https://github.com/markhpc/gdbpmp> to analyze the threads' performance.
>> we discovered the bstore_kv_sync thread is always busy, while all 16 
>> tp_osd_tp threads are not busy most of the time (wait on a 
>> conditional variable or a lock).
>> Given that 3 rocksdb CFs are sharded, and sharding is configurable, 
>> why not run multiple (3) bstore_kv_sync threads? they won't have 
>> conflicts most of the time.
>> This has the potential of removing the rocksdb bottleneck and 
>> increasing IOPS.
>>
>> Can you explain this design choice?
>
>
> You are absolutely correct that the bstore_kv_sync thread can often be 
> a bottleneck during 4K random writes.  Typically it's not so bad that 
> the tp_osd_tp threads are mostly blocked though (feel free to send me 
> a copy of the trace, I would be interested in seeing it). Years ago I 
> advocated for the same approach you are suggesting here.  The fear at 
> the time was that the changes inside bluestore would be too 
> disruptive.  The column family sharding approach could be (and was) 
> mostly contained to the KeyValueDB glue code.  Column family sharding 
> has been a win from the standpoint that it helps us avoid really deep 
> LSM hierarchies in RocksDB.  We tend to see faster compaction times 
> and are more likely to keep full levels on the fast device.  Sadly it 
> doesn't really help with improving metadata throughput and may even 
> introduce a small amount of overhead during the WAL flush process.  
> FWIW slow bstore_kv_sync is one of the reasons that people will some 
> times run multiple OSDs on one NVMe drive (sometimes it's faster, 
> sometimes it's not).
>
>
> Maybe a year ago I tried to sort of map out the changes that I thought 
> would be necessary to shard across KeyValueDBs inside bluestore 
> itself.  It didn't look impossible, but would require quite a bit of 
> work (and a bit of finesse to restructure the data path).  There's a 
> legitimate questions of whether or not it's worth it now to make those 
> kinds of changes to bluestore or invest in crimson and seastore at 
> this point.  We ended up deciding not to pursue the changes back 
> then.  I think if we changed our minds it would most likely go into 
> some kind of experimental bluestore v2 project (along with other 
> things like hierarchical storage) so we don't screw up the existing 
> code base.
>
>
>>
>> ------------------------------------------------------------------------
>> *From:* Mark Nelson <mnelson@xxxxxxxxxx> <mailto:mnelson@xxxxxxxxxx>
>> *Sent:* Tuesday, November 8, 2022 10:20 PM
>> *To:* ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> 
>> <ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx>
>> *Subject:*  Recent ceph.io Performance Blog Posts
>> CAUTION: External Sender
>>
>> Hi Folks,
>>
>> I thought I would mention that I've released a couple of performance
>> articles on the Ceph blog recently that might be of interest to people:
>>
>>  1.
>> https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ 
>> <https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/>
>>     <https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ 
>> <https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/>>
>>  2.
>> https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/ 
>> <https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/>
>>     <https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/ 
>> <https://ceph.io/en/news/blog/2022/qemu-kvm-tuning/>>
>>  3.
>> https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/ 
>> <https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/>
>>     <https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/ 
>> <https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/>>
>>
>> The first covers RocksDB tuning. How we arrived at our defaults, an
>> analysis of some common settings that have been floating around on the
>> mailing list, and potential new settings that we are considering making
>> default in the future.
>>
>> The second covers how to tune QEMU/KVM with librbd to achieve high
>> single-client performance on a small (30 OSD) NVMe backed cluster. This
>> article also covers the performance impact when enabling 128bit AES
>> over-the-wire encryption.
>>
>> The third covers per-OSD CPU/Core scaling and the kind of IOPS/core and
>> IOPS/NVMe numbers that are achievable both on a single OSD and on a
>> larger (60 OSD) NVMe cluster. In this case there are enough clients and
>> a high enough per-client iodepth to saturate the OSD(s).
>>
>> I hope these are helpful or at least interesting!
>>
>> Thanks,
>> Mark
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>> <mailto:ceph-users-leave@xxxxxxx>
 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux