Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 6 Aug 2020 12:27:28 -0500

Yeah, there are cases where enabling it will improve performance as 
rocksdb can then used the page cache as a (potentially large) secondary 
cache beyond the block cache and avoid hitting the underlying devices 
for reads.  Do you have a lot of spare memory for page cache on your OSD 
nodes? You may be able to improve the situation with 
bluefs_buffered_io=false by increasing the osd_memory_target which 
should give the rocksdb block cache more memory to work with directly.  
One downside is that we currently double cache onodes in both the 
rocksdb cache and bluestore onode cache which hurts us when memory 
limited.  We have some experimental work that might help in this area by 
better balancing bluestore onode and rocksdb block caches but it needs 
to be rebased after Adam's column family sharding work.

The reason we had to disable bluefs_buffered_io again was that we had 
users with certain RGW workloads where the kernel started swapping large 
amounts of memory on the OSD nodes despite seemingly have free memory 
available.  This caused huge latency spikes and IO slowdowns (even 
stalls).  We never noticed it in our QA test suites and it doesn't 
appear to happen with RBD workloads as far as I can tell, but when it 
does happen it's really painful.

Mark

On 8/6/20 6:53 AM, Manuel Lausch wrote:
Hi,

I found the reasen of this behavior change.
With 14.2.10 the default value of "bluefs_buffered_io" was changed from
true to false.
https://tracker.ceph.com/issues/44818

configureing this to true my problems seems to be solved.

Regards
Manuel

On Wed, 5 Aug 2020 13:30:45 +0200
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

Hello Vladimir,

I just tested this with a single node testcluster with 60 HDDs (3 of
them with bluestore without separate wal and db).

With the 14.2.10, I see on the bluestore OSDs a lot of read IOPs while
snaptrimming. With 14.2.9 this was not an issue.

I wonder if this would explain the huge amount of slowops on my big
testcluster (44 Nodes 1056 OSDs) while snaptrimming. I
cannot test a downgrade there, because there are no packages of older
releases for CentOS 8 available.

Regards
Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx