Re: bluefs_buffered_io=false performance regression

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Robert,


We are definitely aware of this issue.  It appears to often be related to snap trimming and we believe possibly related to excessive thrashing of the rocksdb block cache.  I suspect that when bluefs_buffered_io is enabled it hides the issue and people don't notice the problem, but that might be related to why we see the other issue with the kernel with rgw workloads.  I would recommend that if you didn't see issues with bluefs_buffered_io enabled, you can re-enable it and periodically check to make sure you aren't hitting issues with kernel swap.  Unfortunately we are sort of between a rock and a hard place on this one until we solve the root cause.


Right now we're looking at trying to reduce thrashing in the rocksdb block cache(s) by splitting up onode and omap (and potentially pglog and allocator) block cache into their own distinct entities.  My hope is that we can finesse the situation so that the overall system page cache is no longer required to avoid execessive reads assuming enough memory has been assigned to the osd_memory_target.


Mark


On 1/11/21 9:47 AM, Robert Sander wrote:
Hi,

bluefs_buffered_io was disabled in Ceph version 14.2.11.

The cluster started last year with 14.2.5 and got upgraded over the year now running 14.2.16.

The performance was OK first but got abysmal bad at the end of 2020.

We checked the components and HDDs and SSDs seem to be fine. Single disk benchmarks showed performance according the specs.

Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 HDDs distributed over 12 nodes.

Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s read instead of 123MB/s.

This setting was disabled in 14.2.11 because "in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time."
We have to monitor if this will happen in our cluster. Is there any other negative side effect currently known?

Here are the rados bench values, first with bluefs_buffered_io=false, then with bluefs_buffered_io=true:

Bench		Total	Total	Write	Object	Band	Stddev	Max	Min	Average Stddev	Max	Min	Average	Stddev	Max	Min
		time	writes	Read	size	width	     Bandwidth		             IOPS		          Latency (s)
		run	reads	size		(MB/sec)
			made
false write	33,081	490	4194304	4194304	59,2485	71,3829	264	0	14	17,8702	66	0	1,07362	2,83017	20,71	0,0741089
false seq	15,8226	490	4194304	4194304	123,874				30	46,8659	174	0	0,51453		9,53873	0,00343417
false rand	38,2615	2131	4194304	4194304	222,782				55	109,374	415	0	0,28191		12,1039	0,00327948
true write	30,4612	3308	4194304	4194304	434,389	26,0323	480	376	108	6,50809	120	94	0,14683	0,07368	0,99791	0,0751249
true seq	13,7628	3308	4194304	4194304	961,429				240	22,544	280	184	0,06528		0,88676	0,00338191
true rand	30,1007	8247	4194304	4194304	1095,92				273	25,5066	313	213	0,05719		0,99140	0,00325295

Regards

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux