On 06/28/17 21:57, Gregory Farnum
wrote:
[...] backup VMs
is create a snapshot by Ceph commands (rbd snapshot)
then download (rbd export) it.
We found
a very high Disk Read / Write latency during
creating / deleting snapshots, it will higher than
10000 ms.
Even not
during backup jobs, we often see a more than 4000
ms latency occurred.
Users
start to complain.
Could
you please help us to how to start the
troubleshooting?
For creating snaps
and keeping them, this was marked wontfix http://tracker.ceph.com/issues/10823
For deleting, see the recent "Snapshot removed, cluster
thrashed" thread for some config to try.
Given he says he's seeing 4 second IOs even without
snapshot involvement, I think Keynes must be seeing
something else in his cluster.
If you have few enough OSDs and slow enough journals that seem ok
without snaps, with snaps can be much worse than 4s IOs if you have
any sync heavy clients, like ganglia.
Before I figured out that it was exclusive-lock causing VMs to hang,
I tested many things and spent months on it and found that out. Also
some people in freenode irc ##proxmox channel with cheap home setups
with ceph complain about such things often.
https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
Consider a copy-on-write
system, which copies any blocks before they
are overwritten with new information (i.e. it copies on
writes). In other words, if a block in a protected
entity is to be modified, the system will copy that
block to a separate snapshot area before it is
overwritten with the new information. This approach
requires three I/O operations for each write: one read
and two writes. [...] This decision process for each
block also comes with some computational overhead.
A redirect-on-write system
uses pointers to represent all protected entities. If a
block needs modification, the storage system merely redirects
the pointer for that block to another block and writes
the data there. [...] There is zero computational
overhead of reading a snapshot in a redirect-on-write
system.
The redirect-on-write system uses
1/3 the number of I/O operations when modifying a
protected block, and it uses no extra computational
overhead reading a snapshot. Copy-on-write systems can
therefore have a big impact on the performance of the
protected entity. The more snapshots are created and the
longer they are stored, the greater the impact to
performance on the protected entity.
I wouldn't consider that a very realistic depiction of
the tradeoffs involved in different snapshotting
strategies[1], but BlueStore uses "redirect-on-write" under
the formulation presented in those quotes. RBD clones of
protected images will remain copy-on-write forever, I
imagine.
-Greg
It was simply the first link I found which I could quote, but I
didn't find it too bad... just it describes it like all
implementations are the same.
[1]: There's no reason to expect a copy-on-write system
will first copy the original data and then overwrite it with
the new data when it can simply inject the new data along
the way. *Some* systems will copy the "old" block into a new
location and then overwrite in the existing location (it
helps prevent fragmentation), but many don't. And a
"redirect-on-write" system needs to persist all those block
metadata pointers, which may be much cheaper or much, much
more expensive than just duplicating the blocks.
After a snap is unprotected, will the clones be redirect-on-write?
Or after the image is flattened (like dd if=/dev/zero to the whole
disk)?
Are there other cases where you get a copy-on-write behavior?
Glad to hear bluestore has something better. Is that avaliable and
default behavior on kraken (which I tested but where it didn't seem
to be fixed, although all storage backends were less block prone on
kraken)?
If it was a true redirect-on-write system, I would expect that when
you make a snap, there is just the overhead of organizing some
metadata, and then after that, any writes just write as normal, to a
new place, not requiring the old data to be copied, ideally not any
of it, even partially written objects. And I don't think I saw that
behavior on my kraken tests, although the performance was better
(due to no blocked requests, but the iops at peak was basically the
same; and I didn't measure total IO or something that would be more
reliable...just looked at performance effects and blocking).
|