Re: Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Thu, 16 May 2019 13:57:04 +0200

Hmmm, looks like diskpart is of, reports the same about a volume, that 
fsutil fsinfo ntfsinfo c: report 512 (in this case correct, because it 
is on a ssd)
Anyone knows how to use fsutil with a path mounted disk (without drive 
letter)?

-----Original Message-----
From: Marc Roos 
Sent: donderdag 16 mei 2019 13:46
To: aderumier; trent.lloyd
Cc: ceph-users
Subject: Re:  Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

I am not sure if it is possible to run fsutil on disk without drive 
letter, but mounted on path. 
So I used:
diskpart
select volume 3
Filesystems

And gives me this: 
Current File System

  Type                 : NTFS
  Allocation Unit Size : 4096
  Flags : 00000000

File Systems Supported for Formatting

  Type                 : NTFS (Default)
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 
32K, 64K

  Type                 : FAT32
  Allocation Unit Sizes: 4096, 8192 (Default), 16K, 32K, 64K

  Type                 : REFS
  Allocation Unit Sizes: 4096 (Default), 64K

So it looks like it detects 4k correctly? But I do not have the <blockio 
logical_block_size='512' physical_block_size='4096'/> in the disk of 
libvirt and have the WD with 512e:

[@c01 ~]# smartctl -a /dev/sdb | grep 'Sector Size'
Sector Sizes:     512 bytes logical, 4096 bytes physical

CentOS Linux release 7.6.1810 (Core)
ceph version 12.2.12
libvirt-4.5.0

-----Original Message-----
From: Trent Lloyd [mailto:trent.lloyd@xxxxxxxxxxxxx]
Sent: donderdag 16 mei 2019 9:57
To: Alexandre DERUMIER
Cc: ceph-users
Subject: Re:  Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

For libvirt VMs, first you need to add "<blockio 
logical_block_size='512' physical_block_size='4096'/>" to the relevant 
<disk> sections, and then stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat 
variety anyway) are from late 2018 or so. There was a bug fixed around 
July 2018, before that date, the physical_block_size=4096 parameter is 
not used by the Windows VirtIO driver (it was supposed to be, but did 
not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312 

After that, you can check if Windows is correctly recognizing the 
physical block size,

Start cmd.exe with "Run as administrator", then run fsutil fsinfo 
ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"

Lastly at least for Windows itself this makes it do 4096-byte writes 
"most of the time", however some applications including Exchange have 
special handling of the sector size. I'm not really sure how MSSQL 
handles it, for example, it may or may not work correctly if you switch 
to 4096 bytes after installation - you may have to create new data files 
or something for it to do 4k segments - or not. Hopefully the MSSQL 
documentation has some information about that.

It is also possible to set logical_block_size=4096 as well as
physical_block_size=4096 ("4k native") however this absolutely causes 
problems with some software (e.g. exchange) if you convert an existing 
installation between the two. If you try to use 4k native mode, ideally 
you would want to do a fresh install, to avoid any such issues. Or 
again, refer to the docs and test it. Just beware it may cause issues if 
you try to switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug 
filestore = 10" enabled, it will print out how many of the operations 
were unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb

You can just enable debug filestore = 10 dynamically on 1 OSD for about
5 minutes, turn it off, and process the log. And you could compare 
before/after. I haven't written an equivalent tool for BlueStore 
unfortunately if you are already in the modern world :) I also didnt' 
check maybe debug osd or something also has the writes and offsets, so I 
could write a generic tool to cover both cases, but also I have not done 
that.

Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER <aderumier@xxxxxxxxx>
wrote:

	Many thanks for the analysis !

	I'm going to test with 4K on heavy mssql database to see if I'm 
seeing improvement on ios/latency.
	I'll report results in this thread.

	----- Mail original -----
	De: "Trent Lloyd" <trent.lloyd@xxxxxxxxxxxxx>
	À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
	Envoyé: Vendredi 10 Mai 2019 09:59:39
	Objet:  Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

	I recently was investigating a performance problem for a reasonably 

sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes. 
	This specific deployment is Ceph Jewel (FileStore + 
SimpleMessenger) which while it is EOL, the issue is reproducible on 
current versions and also on BlueStore however for different reasons 
than FileStore. 

	Generally the Ceph cluster was suffering from very poor outlier 
performance, the numbers change a little bit depending on the exact 
situation but roughly 80% of I/O was happening in a "reasonable" time of 

0-200ms but 5-20% of I/O operations were taking excessively long 
anywhere from 500ms through to 10-20+ seconds. However the normal 
metrics for commit and apply latency were normal, and in fact, this 
latency was hard to spot in the performance metrics available in jewel. 

	Previously I more simply considered FileStore to have the "commit" 
(to journal) stage where it was written to the journal and it is OK to 
return to the client and then the "apply" (to disk) stage where it was 
flushed to disk and confirmed so that the data could be purged from the 
journal. However there is really a third stage in the middle where 
FileStore submits the I/O to the operating system and this is done 
before the lock on the object is released. Until that succeeds another 
operation cannot write to the same object (generally being a 4MB area of 

the disk). 

	I found that the fstore_op threads would get stuck for hundreds of 
MS or more inside of pwritev() which was blocking inside of the kernel. 
Normally we expect pwritev() to be buffered I/O into the page cache and 
return quite fast however in this case the kernel was in a few percent 
of cases blocking with the stack trace included at the end of the e-mail 

[1]. My finding from that stack is that inside __block_write_begin_int 
we see a call to out_of_line_wait_on_bit call which is really an inlined 

call for wait_on_buffer which occurs in linux/fs/buffer.c in the section 

around line 2000-2024 with the comment "If we issued read requests - let 

them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002 
| 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002 
] ) 

	My interpretation of that code is that for Linux to store a write 
in the page cache, it has to have the entire 4K page as that is the 
granularity of which it tracks the dirty state and it needs the entire 
4K page to later submit back to the disk. Since we wrote a part of the 
page, and the page wasn't already in the cache, it has to fetch the 
remainder of the page from the disk. When this happens, it blocks 
waiting for this read to complete before returning from the pwritev() 
call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such 
threads trying to process several hundred operations per second. 
Additionally the size of the osd_op_queue is bounded, and operations do 
not clear out of this queue until the tp_fstore_op thread is done. Which 

ultimately means that not only are these partial writes delayed but it 
knocks on to delay other writes behind them because of the constrained 
thread pools. 

	What was further confusing to this, is that I could easily 
reproduce this in a test deployment using an rbd benchmark that was only 

writing to a total disk size of 256MB which I would easily have expected 

to fit in the page cache: 
	rbd create -p rbd --size=256M bench2 
	rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 
--io-total 256M --io-pattern rand 

	This is explained by the fact that on secondary OSDs (at least, 
there was some refactoring of fadvise which I have not fully understood 
as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects 
after write which causes the kernel to immediately discard them from the 

page cache without any regard to their statistics of being 
recently/frequently used. The motivation for this addition appears to be 

that on a secondary OSD we don't service reads (only writes) and so 
therefor we can optimize memory usage by throwing away this object and 
in theory leaving more room in the page cache for objects which we are 
primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, 
where we now pathologically throw away the cached copy instantly such 
that a write even 1 second later will have to fetch the page from disk 
again. I also found that this FADVISE_DONTNEED is issue not only during 
filestore sync but also by the WBThrottle - which as this cluster was 
quite busy was constantly flushing writes leading to the cache being 
discarded almost instantly. 

	Changing filestore_fadvise to False on this cluster lead to a 
significant performance increase as it could now cache the pages in 
memory in many cases. The number of reads from disk was reduced from 
around 40/second to 2/second, and the number of slow writes (>200ms) 
operations was reduced by 75%. 

	I wrote a script to parse ceph-osd logs with debug_filestore=10 or 
15 to report the time spent inside of write() as well as to count and 
report on the number of operations that are unaligned and also slow. 
It's a bit rough but you can find it here: [ 
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb | 

https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb ] 

	It does not solve the problem entirely, in that a filestore thread 
can still be blocked in such a case where it is not cached - but the 
pathological case of never having it in the cache is removed at least. 
Understanding this problem, I looked to the situation for BlueStore. 
BlueStore suffers from a similar issue in that the performance is quite 
poor due to both fadvise and also because it is check-summing the data 
in 4k blocks so needs to read the rest of the block in, despite not 
having the limitations of the Linux page cache to deal with. I have not 
yet further fully investigated BlueStore implementation other than to 
note the following doc talking about how such writes are handled and a 
possible future improvement to submit partial writes into the WAL before 

reading the rest of the block, which is apparently not done currently 
(and would be a great optimization): [ 
http://docs.ceph.com/docs/mimic/dev/bluestore/ | 
http://docs.ceph.com/docs/mimic/dev/bluestore/ ] 

	Moving onto a full solution for this issue. We can tell Windows 
guests to send 4k-aligned I/O where possible by setting the 
physical_block_size hint on the disk. This support was added mainly for 
the incoming new series of hard drives which also have 4k blocks 
internally, and also need to do a similar 'read-modify-update' operation 

in the case where a smaller write is done. In this case Windows tries to 

align the I/O to 4k as much as possible, at the most basic level for 
example when a new file is created, it will pad out the write to the 
block to the nearest 4k. You can read more about support for that here: 
	[ 
https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows 
| 
https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows 
] 

	On a basic test, booting a Windows 2016 instance and then 
installing several months of Windows Updates the number of partial 
writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) 

- many of which were during early boot and don't re-occur once the VM is 

running. 

	I have submitted a patch to the OpenStack Cinder RBD driver to 
support setting this parameter. You can find that here: 
	[ https://review.opendev.org/#/c/658283/ | 
https://review.opendev.org/#/c/658283/ ] 

	I did not have much luck finding information about any of this 
online when I searched, so this e-mail is serving largely to document my 

findings for others. But I am also looking for input from anyone as to 
anything I have missed, confirming my analysis as sound, review for my 
Cinder patch, etc. 

	There is also likely scope to make this same patch to report a 
physical_block_size=4096 on other Ceph consumers such as the new(ish) 
iSCSI gateway, etc. 

	Regards, 
	Trent 

	[1] fstore_op pwritev blocking stack trace - if anyone is 
interested in the perf data, flamegraph, etc - I'd be happy to share. 

	tp_fstore_op 
	ceph::buffer::list::write_fd 
	pwritev64 
	entry_SYSCALL_64_after_hwframe 
	do_syscall_64 
	sys_pwritev 
	do_pwritev 
	vfs_writev 
	do_iter_write 
	do_iter_readv_writev 
	xfs_file_write_iter 
	xfs_file_buffered_aio_write 
	iomap_file_buffered_write 
	iomap_apply 
	iomap_write_actor 
	iomap_write_begin.constprop.18 
	__block_write_begin_int 
	out_of_line_wait_on_bit 
	__wait_on_bit 
	bit_wait_io 
	io_schedule 
	schedule 
	__schedule 
	finish_task_switch 

	_______________________________________________ 
	ceph-users mailing list 
	ceph-users@xxxxxxxxxxxxxx 
	http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com