In my case I only have 16GB RAM per node with 5 OSD on each of them, so I actually have to tune osd_memory_target=2147483648 because with the default value of 4GB my osd processes tend to get killed by OOM. That is what I was looking into before the correct solution. I disabled osd_memory_target limitation essentially setting it to default 4GB - it helped in a sense that workload on the block.db device significantly dropped, but overall pattern was not the same - for example there still were no merges on the block.db device. It all came back to the usual pattern with bluefs_buffered_io=true. osd_memory_target limitation was implemented somewhere around 10 > 12 release upgrade I think, before memory auto scaling feature for bluestore was introduced - that's when my osds started to get OOM. They worked fine before that. чт, 6 авг. 2020 г. в 20:28, Mark Nelson <mnelson@xxxxxxxxxx>: > Yeah, there are cases where enabling it will improve performance as > rocksdb can then used the page cache as a (potentially large) secondary > cache beyond the block cache and avoid hitting the underlying devices > for reads. Do you have a lot of spare memory for page cache on your OSD > nodes? You may be able to improve the situation with > bluefs_buffered_io=false by increasing the osd_memory_target which > should give the rocksdb block cache more memory to work with directly. > One downside is that we currently double cache onodes in both the > rocksdb cache and bluestore onode cache which hurts us when memory > limited. We have some experimental work that might help in this area by > better balancing bluestore onode and rocksdb block caches but it needs > to be rebased after Adam's column family sharding work. > > The reason we had to disable bluefs_buffered_io again was that we had > users with certain RGW workloads where the kernel started swapping large > amounts of memory on the OSD nodes despite seemingly have free memory > available. This caused huge latency spikes and IO slowdowns (even > stalls). We never noticed it in our QA test suites and it doesn't > appear to happen with RBD workloads as far as I can tell, but when it > does happen it's really painful. > > > Mark > > > On 8/6/20 6:53 AM, Manuel Lausch wrote: > > Hi, > > > > I found the reasen of this behavior change. > > With 14.2.10 the default value of "bluefs_buffered_io" was changed from > > true to false. > > https://tracker.ceph.com/issues/44818 > > > > configureing this to true my problems seems to be solved. > > > > Regards > > Manuel > > > > On Wed, 5 Aug 2020 13:30:45 +0200 > > Manuel Lausch <manuel.lausch@xxxxxxxx> wrote: > > > >> Hello Vladimir, > >> > >> I just tested this with a single node testcluster with 60 HDDs (3 of > >> them with bluestore without separate wal and db). > >> > >> With the 14.2.10, I see on the bluestore OSDs a lot of read IOPs while > >> snaptrimming. With 14.2.9 this was not an issue. > >> > >> I wonder if this would explain the huge amount of slowops on my big > >> testcluster (44 Nodes 1056 OSDs) while snaptrimming. I > >> cannot test a downgrade there, because there are no packages of older > >> releases for CentOS 8 available. > >> > >> Regards > >> Manuel > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx