Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 6 Aug 2020 15:06:49 -0500

a 2GB memory target will absolutely starve the OSDs of memory for 
rocksdb block cache which probably explains why you are hitting the disk 
for reads and a shared page cache is helping so much. It's definitely 
more memory efficient to have a page cache scheme rather than having 
more cache for each OSD, but for NVMe drives you can end up having more 
contention and overhead.  For older systems with slower devices and 
lower amounts of memory the page cache is probably a win.  FWIW with a 
4GB+ memory target I suspect you would see far fewer cache miss reads 
(but obviously you can't do that on your nodes).

Mark

On 8/6/20 1:47 PM, Vladimir Prokofev wrote:
In my case I only have 16GB RAM per node with 5 OSD on each of them, so I
actually have to tune osd_memory_target=2147483648 because with the default
value of 4GB my osd processes tend to get killed by OOM.
That is what I was looking into before the correct solution. I
disabled osd_memory_target limitation essentially setting it to default 4GB
- it helped in a sense that workload on the block.db device significantly
dropped, but overall pattern was not the same - for example there still
were no merges on the block.db device. It all came back to the usual
pattern with bluefs_buffered_io=true.
osd_memory_target limitation was implemented somewhere around 10 > 12
release upgrade I think, before memory auto scaling feature for bluestore
was introduced - that's when my osds started to get OOM. They worked fine
before that.

чт, 6 авг. 2020 г. в 20:28, Mark Nelson <mnelson@xxxxxxxxxx>:

Yeah, there are cases where enabling it will improve performance as
rocksdb can then used the page cache as a (potentially large) secondary
cache beyond the block cache and avoid hitting the underlying devices
for reads.  Do you have a lot of spare memory for page cache on your OSD
nodes? You may be able to improve the situation with
bluefs_buffered_io=false by increasing the osd_memory_target which
should give the rocksdb block cache more memory to work with directly.
One downside is that we currently double cache onodes in both the
rocksdb cache and bluestore onode cache which hurts us when memory
limited.  We have some experimental work that might help in this area by
better balancing bluestore onode and rocksdb block caches but it needs
to be rebased after Adam's column family sharding work.

The reason we had to disable bluefs_buffered_io again was that we had
users with certain RGW workloads where the kernel started swapping large
amounts of memory on the OSD nodes despite seemingly have free memory
available.  This caused huge latency spikes and IO slowdowns (even
stalls).  We never noticed it in our QA test suites and it doesn't
appear to happen with RBD workloads as far as I can tell, but when it
does happen it's really painful.

Mark

On 8/6/20 6:53 AM, Manuel Lausch wrote:
Hi,

I found the reasen of this behavior change.
With 14.2.10 the default value of "bluefs_buffered_io" was changed from
true to false.
https://tracker.ceph.com/issues/44818

configureing this to true my problems seems to be solved.

Regards
Manuel

On Wed, 5 Aug 2020 13:30:45 +0200
Manuel Lausch <manuel.lausch@xxxxxxxx> wrote:

Hello Vladimir,

I just tested this with a single node testcluster with 60 HDDs (3 of
them with bluestore without separate wal and db).

With the 14.2.10, I see on the bluestore OSDs a lot of read IOPs while
snaptrimming. With 14.2.9 this was not an issue.

I wonder if this would explain the huge amount of slowops on my big
testcluster (44 Nodes 1056 OSDs) while snaptrimming. I
cannot test a downgrade there, because there are no packages of older
releases for CentOS 8 available.

Regards
Manuel

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx