Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 7 Aug 2020 08:08:40 -0500

It's quite possible that the issue is really about rocksdb living on top 
of bluefs with bluefs_buffered_io and rgw causing a ton of OMAP 
traffic.  rgw is the only case so far where the issue has shown up, but 
it was significant enough that we didn't feel like we could leave 
bluefs_buffered_io enabled.  In your case with a 14GB target per OSD, do 
you still see significantly increased disk reads with 
blufs_buffered_io=false?

Mark

On 8/7/20 2:27 AM, Manuel Lausch wrote:
I cannot confirm that more memory target will solve the problem
completly. In my case the OSDs have 14GB memory target and I did have
huge user IO impact while snaptrim (many slow ops the whole time). Since
I set bluefs_bufferd_io=true it seems to work without issue.
In my cluster I don't use rgw. But I don't see why
different types of access the cluster do affect the form the kernel
manages its memory. My experience why the kernel begins to swap are
mostly numa related and/or memory fragmentation.

Manuel

On Thu, 6 Aug 2020 15:06:49 -0500
Mark Nelson <mnelson@xxxxxxxxxx> wrote:

a 2GB memory target will absolutely starve the OSDs of memory for
rocksdb block cache which probably explains why you are hitting the
disk for reads and a shared page cache is helping so much. It's
definitely more memory efficient to have a page cache scheme rather
than having more cache for each OSD, but for NVMe drives you can end
up having more contention and overhead.  For older systems with
slower devices and lower amounts of memory the page cache is probably
a win.  FWIW with a 4GB+ memory target I suspect you would see far
fewer cache miss reads (but obviously you can't do that on your
nodes).

Mark

On 8/6/20 1:47 PM, Vladimir Prokofev wrote:
In my case I only have 16GB RAM per node with 5 OSD on each of
them, so I actually have to tune osd_memory_target=2147483648
because with the default value of 4GB my osd processes tend to get
killed by OOM. That is what I was looking into before the correct
solution. I disabled osd_memory_target limitation essentially
setting it to default 4GB
- it helped in a sense that workload on the block.db device
significantly dropped, but overall pattern was not the same - for
example there still were no merges on the block.db device. It all
came back to the usual pattern with bluefs_buffered_io=true.
osd_memory_target limitation was implemented somewhere around 10 >
12 release upgrade I think, before memory auto scaling feature for
bluestore was introduced - that's when my osds started to get OOM.
They worked fine before that.

чт, 6 авг. 2020 г. в 20:28, Mark Nelson <mnelson@xxxxxxxxxx>:

Yeah, there are cases where enabling it will improve performance as
rocksdb can then used the page cache as a (potentially large)
secondary cache beyond the block cache and avoid hitting the
underlying devices for reads.  Do you have a lot of spare memory
for page cache on your OSD nodes? You may be able to improve the
situation with bluefs_buffered_io=false by increasing the
osd_memory_target which should give the rocksdb block cache more
memory to work with directly. One downside is that we currently
double cache onodes in both the rocksdb cache and bluestore onode
cache which hurts us when memory limited.  We have some
experimental work that might help in this area by better balancing
bluestore onode and rocksdb block caches but it needs to be
rebased after Adam's column family sharding work.

The reason we had to disable bluefs_buffered_io again was that we
had users with certain RGW workloads where the kernel started
swapping large amounts of memory on the OSD nodes despite
seemingly have free memory available.  This caused huge latency
spikes and IO slowdowns (even stalls).  We never noticed it in our
QA test suites and it doesn't appear to happen with RBD workloads
as far as I can tell, but when it does happen it's really painful.

Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx