Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

Manuel Lausch <manuel.lausch@xxxxxxxx> · Fri, 7 Aug 2020 09:27:36 +0200

I cannot confirm that more memory target will solve the problem
completly. In my case the OSDs have 14GB memory target and I did have
huge user IO impact while snaptrim (many slow ops the whole time). Since
I set bluefs_bufferd_io=true it seems to work without issue. 
In my cluster I don't use rgw. But I don't see why
different types of access the cluster do affect the form the kernel
manages its memory. My experience why the kernel begins to swap are
mostly numa related and/or memory fragmentation. 

Manuel

On Thu, 6 Aug 2020 15:06:49 -0500
Mark Nelson <mnelson@xxxxxxxxxx> wrote:

> a 2GB memory target will absolutely starve the OSDs of memory for 
> rocksdb block cache which probably explains why you are hitting the
> disk for reads and a shared page cache is helping so much. It's
> definitely more memory efficient to have a page cache scheme rather
> than having more cache for each OSD, but for NVMe drives you can end
> up having more contention and overhead.  For older systems with
> slower devices and lower amounts of memory the page cache is probably
> a win.  FWIW with a 4GB+ memory target I suspect you would see far
> fewer cache miss reads (but obviously you can't do that on your
> nodes).
> 
> 
> Mark
> 
> 
> On 8/6/20 1:47 PM, Vladimir Prokofev wrote:
> > In my case I only have 16GB RAM per node with 5 OSD on each of
> > them, so I actually have to tune osd_memory_target=2147483648
> > because with the default value of 4GB my osd processes tend to get
> > killed by OOM. That is what I was looking into before the correct
> > solution. I disabled osd_memory_target limitation essentially
> > setting it to default 4GB
> > - it helped in a sense that workload on the block.db device
> > significantly dropped, but overall pattern was not the same - for
> > example there still were no merges on the block.db device. It all
> > came back to the usual pattern with bluefs_buffered_io=true.
> > osd_memory_target limitation was implemented somewhere around 10 >
> > 12 release upgrade I think, before memory auto scaling feature for
> > bluestore was introduced - that's when my osds started to get OOM.
> > They worked fine before that.
> >
> > чт, 6 авг. 2020 г. в 20:28, Mark Nelson <mnelson@xxxxxxxxxx>:
> >  
> >> Yeah, there are cases where enabling it will improve performance as
> >> rocksdb can then used the page cache as a (potentially large)
> >> secondary cache beyond the block cache and avoid hitting the
> >> underlying devices for reads.  Do you have a lot of spare memory
> >> for page cache on your OSD nodes? You may be able to improve the
> >> situation with bluefs_buffered_io=false by increasing the
> >> osd_memory_target which should give the rocksdb block cache more
> >> memory to work with directly. One downside is that we currently
> >> double cache onodes in both the rocksdb cache and bluestore onode
> >> cache which hurts us when memory limited.  We have some
> >> experimental work that might help in this area by better balancing
> >> bluestore onode and rocksdb block caches but it needs to be
> >> rebased after Adam's column family sharding work.
> >>
> >> The reason we had to disable bluefs_buffered_io again was that we
> >> had users with certain RGW workloads where the kernel started
> >> swapping large amounts of memory on the OSD nodes despite
> >> seemingly have free memory available.  This caused huge latency
> >> spikes and IO slowdowns (even stalls).  We never noticed it in our
> >> QA test suites and it doesn't appear to happen with RBD workloads
> >> as far as I can tell, but when it does happen it's really painful.
> >>
> >>
> >> Mark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx