I cannot confirm that more memory target will solve the problem completly. In my case the OSDs have 14GB memory target and I did have huge user IO impact while snaptrim (many slow ops the whole time). Since I set bluefs_bufferd_io=true it seems to work without issue. In my cluster I don't use rgw. But I don't see why different types of access the cluster do affect the form the kernel manages its memory. My experience why the kernel begins to swap are mostly numa related and/or memory fragmentation. Manuel On Thu, 6 Aug 2020 15:06:49 -0500 Mark Nelson <mnelson@xxxxxxxxxx> wrote: > a 2GB memory target will absolutely starve the OSDs of memory for > rocksdb block cache which probably explains why you are hitting the > disk for reads and a shared page cache is helping so much. It's > definitely more memory efficient to have a page cache scheme rather > than having more cache for each OSD, but for NVMe drives you can end > up having more contention and overhead. For older systems with > slower devices and lower amounts of memory the page cache is probably > a win. FWIW with a 4GB+ memory target I suspect you would see far > fewer cache miss reads (but obviously you can't do that on your > nodes). > > > Mark > > > On 8/6/20 1:47 PM, Vladimir Prokofev wrote: > > In my case I only have 16GB RAM per node with 5 OSD on each of > > them, so I actually have to tune osd_memory_target=2147483648 > > because with the default value of 4GB my osd processes tend to get > > killed by OOM. That is what I was looking into before the correct > > solution. I disabled osd_memory_target limitation essentially > > setting it to default 4GB > > - it helped in a sense that workload on the block.db device > > significantly dropped, but overall pattern was not the same - for > > example there still were no merges on the block.db device. It all > > came back to the usual pattern with bluefs_buffered_io=true. > > osd_memory_target limitation was implemented somewhere around 10 > > > 12 release upgrade I think, before memory auto scaling feature for > > bluestore was introduced - that's when my osds started to get OOM. > > They worked fine before that. > > > > чт, 6 авг. 2020 г. в 20:28, Mark Nelson <mnelson@xxxxxxxxxx>: > > > >> Yeah, there are cases where enabling it will improve performance as > >> rocksdb can then used the page cache as a (potentially large) > >> secondary cache beyond the block cache and avoid hitting the > >> underlying devices for reads. Do you have a lot of spare memory > >> for page cache on your OSD nodes? You may be able to improve the > >> situation with bluefs_buffered_io=false by increasing the > >> osd_memory_target which should give the rocksdb block cache more > >> memory to work with directly. One downside is that we currently > >> double cache onodes in both the rocksdb cache and bluestore onode > >> cache which hurts us when memory limited. We have some > >> experimental work that might help in this area by better balancing > >> bluestore onode and rocksdb block caches but it needs to be > >> rebased after Adam's column family sharding work. > >> > >> The reason we had to disable bluefs_buffered_io again was that we > >> had users with certain RGW workloads where the kernel started > >> swapping large amounts of memory on the OSD nodes despite > >> seemingly have free memory available. This caused huge latency > >> spikes and IO slowdowns (even stalls). We never noticed it in our > >> QA test suites and it doesn't appear to happen with RBD workloads > >> as far as I can tell, but when it does happen it's really painful. > >> > >> > >> Mark _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx