Re: Very HIGH Disk I/O latency on instances

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/28/17 16:52, Keynes_Lee@xxxxxxxxxxx wrote:

We were using HP Helion 2.1.5 ( OpenStack + Ceph )

The OpenStack version is Kilo and Ceph version is firefly

 

The way we backup VMs is create a snapshot by Ceph commands (rbd snapshot) then download (rbd export) it.

 

We found a very high Disk Read / Write latency during creating / deleting snapshots, it will higher than 10000 ms.

 

Even not during backup jobs, we often see a more than 4000 ms latency occurred.

 

Users start to complain.

Could you please help us to how to start the troubleshooting?

 

For creating snaps and keeping them, this was marked wontfix http://tracker.ceph.com/issues/10823

For deleting, see the recent "Snapshot removed, cluster thrashed" thread for some config to try.

And I find this to be a very severe problem. And you haven't even seen the worst... also make more and it gets slower and slower to do many things (resize, clone, snap revert, etc.) (but a fully flattened image seen by a client seems as fast as normal usually).

Let's pool some money together as a reward for making snapshots work properly/modern, like on ZFS and btrfs where they don't have to copy so much....they "redirect on write" rather than literally "copy on write". (what would be a good way to pool money like that?). If others are interested, I surely am, but would have to ask the boss about money. Even if it's only for bluestore, so only for future releases, that's ok with me. And if it keeps the copy on the same osd/fs as the original, that is acceptable too.


https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
Consider a copy-on-write system, which copies any blocks before they are overwritten with new information (i.e. it copies on writes). In other words, if a block in a protected entity is to be modified, the system will copy that block to a separate snapshot area before it is overwritten with the new information. This approach requires three I/O operations for each write: one read and two writes. [...] This decision process for each block also comes with some computational overhead.

A redirect-on-write system uses pointers to represent all protected entities. If a block needs modification, the storage system merely redirects the pointer for that block to another block and writes the data there. [...] There is zero computational overhead of reading a snapshot in a redirect-on-write system.

The redirect-on-write system uses 1/3 the number of I/O operations when modifying a protected block, and it uses no extra computational overhead reading a snapshot. Copy-on-write systems can therefore have a big impact on the performance of the protected entity. The more snapshots are created and the longer they are stored, the greater the impact to performance on the protected entity.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux