Re: Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Martin Verges <martin.verges@xxxxxxxx> · Fri, 10 May 2019 11:23:21 +0200

yes, we recommend this as a precaution to get the best possible IO performance for all workloads and usage scenarios. 512e doesn't bring any advantage and in some cases can mean a performance disadvantage. By the way, 4kN and 512e cost exactly the same at our dealers.

Whether this really makes a difference in the individual case with virtual disks by the underlying physical disks, I can't say.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Fr., 10. Mai 2019 um 10:54 Uhr schrieb Trent Lloyd <trent.lloyd@xxxxxxxxxxxxx>:
Note that the issue I am talking about here is how a "Virtual" Ceph RBD disk is presented to a virtual guest, and specifically for Windows guests (Linux guests are not affected). I am not at all talking about how the physical disks are presented to Ceph itself (although Martin was, he wasn't clear whether changing these underlying physical disks to 4kn was for Ceph or other environments).

I would not expect that having your underlying physical disk presented to Ceph itself as 512b/512e or 4kn to have a significant impact on performance for the reason that Linux systems generally send 4k-aligned I/O anyway (regardless of what the underlying disk is reporting for physical_block_size). There may be some exceptions to that, such as applications performing Direct I/O to the disk. If anyone knows otherwise, it would be great to hear specific details.

Regards,
Trent

On Fri, May 10, 2019 at 4:40 PM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote:

Hmmm, so if I have (wd) drives that list this in smartctl output, I 

should try and reformat them to 4k, which will give me better 

performance?

Sector Sizes:     512 bytes logical, 4096 bytes physical

Do you have a link to this download? Can only find some .cz site with 

the rpms. 

-----Original Message-----

From: Martin Verges [mailto:martin.verges@xxxxxxxx] 

Sent: vrijdag 10 mei 2019 10:21

To: Trent Lloyd

Cc: ceph-users

Subject: Re:  Poor performance for 512b aligned "partial" 

writes from Windows guests in OpenStack + potential fix

Hello Trent,

many thanks for the insights. We always suggest to use 4kN over 512e 

HDDs to our users.

As we recently found out, is that WD Support offers a tool called HUGO 

to reformat 512e to 4kN drives with "hugo format -m <model_number> -n 

max --fastformat -b 4096" in seconds.

Maybe that helps someone that has bought the wrong disk.

--

Martin Verges

Managing director

Mobile: +49 174 9335695

E-Mail: martin.verges@xxxxxxxx

Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich

CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht 

Munich HRB 231263

Web: https://croit.io

YouTube: https://goo.gl/PGE1Bx

Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd 

<trent.lloyd@xxxxxxxxxxxxx>:

        I recently was investigating a performance problem for a reasonably 

sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 

HDD) with NVMe Journals. The primary workload is Windows guests backed 

by Cinder RBD volumes.

        This specific deployment is Ceph Jewel (FileStore + 

SimpleMessenger) which while it is EOL, the issue is reproducible on 

current versions and also on BlueStore however for different reasons 

than FileStore.

        Generally the Ceph cluster was suffering from very poor outlier 

performance, the numbers change a little bit depending on the exact 

situation but roughly 80% of I/O was happening in a "reasonable" time of 

0-200ms but 5-20% of I/O operations were taking excessively long 

anywhere from 500ms through to 10-20+ seconds. However the normal 

metrics for commit and apply latency were normal, and in fact, this 

latency was hard to spot in the performance metrics available in jewel.

        Previously I more simply considered FileStore to have the "commit" 

(to journal) stage where it was written to the journal and it is OK to 

return to the client and then the "apply" (to disk) stage where it was 

flushed to disk and confirmed so that the data could be purged from the 

journal. However there is really a third stage in the middle where 

FileStore submits the I/O to the operating system and this is done 

before the lock on the object is released. Until that succeeds another 

operation cannot write to the same object (generally being a 4MB area of 

the disk).

        I found that the fstore_op threads would get stuck for hundreds of 

MS or more inside of pwritev() which was blocking inside of the kernel. 

Normally we expect pwritev() to be buffered I/O into the page cache and 

return quite fast however in this case the kernel was in a few percent 

of cases blocking with the stack trace included at the end of the e-mail 

[1]. My finding from that stack is that inside __block_write_begin_int 

we see a call to out_of_line_wait_on_bit call which is really an inlined 

call for wait_on_buffer which occurs in linux/fs/buffer.c in the section 

around line 2000-2024 with the comment "If we issued read requests - let 

them complete." 

(https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255

9f7f827c/fs/buffer.c#L2002)

        My interpretation of that code is that for Linux to store a write 

in the page cache, it has to have the entire 4K page as that is the 

granularity of which it tracks the dirty state and it needs the entire 

4K page to later submit back to the disk. Since we wrote a part of the 

page, and the page wasn't already in the cache, it has to fetch the 

remainder of the page from the disk. When this happens, it blocks 

waiting for this read to complete before returning from the pwritev() 

call - hence our normally buffered write blocks. This holds up the 

tp_fstore_op thread, of which there are (by default) only 2-4 such 

threads trying to process several hundred operations per second. 

Additionally the size of the osd_op_queue is bounded, and operations do 

not clear out of this queue until the tp_fstore_op thread is done. Which 

ultimately means that not only are these partial writes delayed but it 

knocks on to delay other writes behind them because of the constrained 

thread pools.

        What was further confusing to this, is that I could easily 

reproduce this in a test deployment using an rbd benchmark that was only 

writing to a total disk size of 256MB which I would easily have expected 

to fit in the page cache:

        rbd create -p rbd --size=256M bench2

        rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 

--io-total 256M --io-pattern rand

        This is explained by the fact that on secondary OSDs (at least, 

there was some refactoring of fadvise which I have not fully understood 

as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects 

after write which causes the kernel to immediately discard them from the 

page cache without any regard to their statistics of being 

recently/frequently used. The motivation for this addition appears to be 

that on a secondary OSD we don't service reads (only writes) and so 

therefor we can optimize memory usage by throwing away this object and 

in theory leaving more room in the page cache for objects which we are 

primary for and expect to actually service reads from a client for. 

Unfortunately this behavior does not take into account partial writes, 

where we now pathologically throw away the cached copy instantly such 

that a write even 1 second later will have to fetch the page from disk 

again. I also found that this FADVISE_DONTNEED is issue not only during 

filestore sync but also by the WBThrottle - which as this cluster was 

quite busy was constantly flushing writes leading to the cache being 

discarded almost instantly.

        Changing filestore_fadvise to False on this cluster lead to a 

significant performance increase as it could now cache the pages in 

memory in many cases. The number of reads from disk was reduced from 

around 40/second to 2/second, and the number of slow writes (>200ms) 

operations was reduced by 75%.

        I wrote a script to parse ceph-osd logs with debug_filestore=10 or 

15 to report the time spent inside of write() as well as to count and 

report on the number of operations that are unaligned and also slow. 

It's a bit rough but you can find it here: 

https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb 

        It does not solve the problem entirely, in that a filestore thread 

can still be blocked in such a case where it is not cached - but the 

pathological case of never having it in the cache is removed at least. 

Understanding this problem, I looked to the situation for BlueStore. 

BlueStore suffers from a similar issue in that the performance is quite 

poor due to both fadvise and also because it is check-summing the data 

in 4k blocks so needs to read the rest of the block in, despite not 

having the limitations of the Linux page cache to deal with. I have not 

yet further fully investigated BlueStore implementation other than to 

note the following doc talking about how such writes are handled and a 

possible future improvement to submit partial writes into the WAL before 

reading the rest of the block, which is apparently not done currently 

(and would be a great optimization): 

http://docs.ceph.com/docs/mimic/dev/bluestore/

        Moving onto a full solution for this issue. We can tell Windows 

guests to send 4k-aligned I/O where possible by setting the 

physical_block_size hint on the disk. This support was added mainly for 

the incoming new series of hard drives which also have 4k blocks 

internally, and also need to do a similar 'read-modify-update' operation 

in the case where a smaller write is done. In this case Windows tries to 

align the I/O to 4k as much as possible, at the most basic level for 

example when a new file is created, it will pad out the write to the 

block to the nearest 4k. You can read more about support for that here:

        https://support.microsoft.com/en-au/help/2510009/microsoft-support-

policy-for-4k-sector-hard-drives-in-windows

        On a basic test, booting a Windows 2016 instance and then 

installing several months of Windows Updates the number of partial 

writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) 

- many of which were during early boot and don't re-occur once the VM is 

running.

        I have submitted a patch to the OpenStack Cinder RBD driver to 

support setting this parameter. You can find that here:

        https://review.opendev.org/#/c/658283/

        I did not have much luck finding information about any of this 

online when I searched, so this e-mail is serving largely to document my 

findings for others. But I am also looking for input from anyone as to 

anything I have missed, confirming my analysis as sound, review for my 

Cinder patch, etc.

        There is also likely scope to make this same patch to report a 

physical_block_size=4096 on other Ceph consumers such as the new(ish) 

iSCSI gateway, etc.

        Regards,

        Trent

        [1] fstore_op pwritev blocking stack trace - if anyone is 

interested in the perf data, flamegraph, etc - I'd be happy to share.

        tp_fstore_op 

        ceph::buffer::list::write_fd 

        pwritev64 

        entry_SYSCALL_64_after_hwframe 

        do_syscall_64 

        sys_pwritev 

        do_pwritev 

        vfs_writev 

        do_iter_write 

        do_iter_readv_writev 

        xfs_file_write_iter 

        xfs_file_buffered_aio_write 

        iomap_file_buffered_write 

        iomap_apply 

        iomap_write_actor 

        iomap_write_begin.constprop.18 

        __block_write_begin_int 

        out_of_line_wait_on_bit 

        __wait_on_bit 

        bit_wait_io 

        io_schedule 

        schedule 

        __schedule 

        finish_task_switch 

        _______________________________________________

        ceph-users mailing list

        ceph-users@xxxxxxxxxxxxxx

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com