Re: Very HIGH Disk I/O latency on instances

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, Jun 29, 2017 at 12:16 AM Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
On 06/28/17 21:57, Gregory Farnum wrote:
On Wed, Jun 28, 2017 at 9:17 AM Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:
On 06/28/17 16:52, Keynes_Lee@xxxxxxxxxxx wrote:
[...]backup VMs is create a snapshot by Ceph commands (rbd snapshot) then download (rbd export) it.

 

We found a very high Disk Read / Write latency during creating / deleting snapshots, it will higher than 10000 ms.

 

Even not during backup jobs, we often see a more than 4000 ms latency occurred.

 

Users start to complain.

Could you please help us to how to start the troubleshooting?

 

For creating snaps and keeping them, this was marked wontfix http://tracker.ceph.com/issues/10823

For deleting, see the recent "Snapshot removed, cluster thrashed" thread for some config to try.

Given he says he's seeing 4 second IOs even without snapshot involvement, I think Keynes must be seeing something else in his cluster.

If you have few enough OSDs and slow enough journals that seem ok without snaps, with snaps can be much worse than 4s IOs if you have any sync heavy clients, like ganglia.

Before I figured out that it was exclusive-lock causing VMs to hang, I tested many things and spent months on it and found that out. Also some people in freenode irc ##proxmox channel with cheap home setups with ceph complain about such things often.




https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
Consider a copy-on-write system, which copies any blocks before they are overwritten with new information (i.e. it copies on writes). In other words, if a block in a protected entity is to be modified, the system will copy that block to a separate snapshot area before it is overwritten with the new information. This approach requires three I/O operations for each write: one read and two writes. [...] This decision process for each block also comes with some computational overhead.

A redirect-on-write system uses pointers to represent all protected entities. If a block needs modification, the storage system merely redirects the pointer for that block to another block and writes the data there. [...] There is zero computational overhead of reading a snapshot in a redirect-on-write system.

The redirect-on-write system uses 1/3 the number of I/O operations when modifying a protected block, and it uses no extra computational overhead reading a snapshot. Copy-on-write systems can therefore have a big impact on the performance of the protected entity. The more snapshots are created and the longer they are stored, the greater the impact to performance on the protected entity.

I wouldn't consider that a very realistic depiction of the tradeoffs involved in different snapshotting strategies[1], but BlueStore uses "redirect-on-write" under the formulation presented in those quotes. RBD clones of protected images will remain copy-on-write forever, I imagine.
-Greg
It was simply the first link I found which I could quote, but I didn't find it too bad... just it describes it like all implementations are the same.


[1]: There's no reason to expect a copy-on-write system will first copy the original data and then overwrite it with the new data when it can simply inject the new data along the way. *Some* systems will copy the "old" block into a new location and then overwrite in the existing location (it helps prevent fragmentation), but many don't. And a "redirect-on-write" system needs to persist all those block metadata pointers, which may be much cheaper or much, much more expensive than just duplicating the blocks.

After a snap is unprotected, will the clones be redirect-on-write? Or after the image is flattened (like dd if=/dev/zero to the whole disk)?

Are there other cases where you get a copy-on-write behavior?

Glad to hear bluestore has something better. Is that avaliable and default behavior on kraken (which I tested but where it didn't seem to be fixed, although all storage backends were less block prone on kraken)?

If it was a true redirect-on-write system, I would expect that when you make a snap, there is just the overhead of organizing some metadata, and then after that, any writes just write as normal, to a new place, not requiring the old data to be copied, ideally not any of it, even partially written objects. And I don't think I saw that behavior on my kraken tests, although the performance was better (due to no blocked requests, but the iops at peak was basically the same; and I didn't measure total IO or something that would be more reliable...just looked at performance effects and blocking).


Bluestore was available for dev/testing in Kraken, but not the default. I think it's going to be the default in Luminous, and yes, it's "just metadata" with new block locations for updates.

Anything involving RBD clones is fundamentally different from "normal" snapshots, though — when you clone an RBD volume, you are writing data to a completely new location so the object has to be copied when you modify that object. (The only alternative would be to keep a per-block bitmap — ie, to keep in memory a data structure roughly 1/1000 the size of your volume for every layer of cloning you have, to indicate if it's in the new overwrite location or in the parent image.)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux