* Requirements There are two distinct requirements for zeroing applicable to the thin-provisioning target: - Avoid data leaks (DATA_LEAK) Consumers of thin devices may be using the same pool. eg, two different vm guests provisioned from the same pool. We must ensure that no data from one thin device appears on another in newly provisioned areas. - Provide guarantees about the presence/absence of sensitive data (ERASE) eg, When decomissioning a guest vm, the host (running the pool) wishes to guarantee that no data from that guest remains on the data device. * Implementing DATA_LEAK Currently the DATA_LEAK requirement is enforced by zeroing every newly provisioned thin device block. This zeroing can often be elided if the write io triggering the provisioning completely covers the block in question. This zeroing can be turned off at the pool level if data leaks are not a concern (eg, a desktop system). Already upstream. * Implementing ERASE ** Erase on deallocation The ERASE requirement is more difficult. Zeroing data blocks when they are deallocated (ie. their ref count drops to zero after deleting the device) sounds like a good approach, but this introduces certain difficulties, mainly: - To retain our crash recovery properties, the zeroing cannot occur until after the next commit. Extra on disk metadata would need to be stored to keep track of these blocks that need zeroing. A commit would trigger a storm of io; currently the cost of a copy-on-write exception is paid immediately by the io that triggers it. Building up delayed work like this makes it very hard to give performance estimates. ** Erasing from userland. Zeroing all unshared blocks when deleting a thin device will create a lot of io, I'd much rather this was being managed by userland. I really don't want message ioctls that take many minutes to complete. The 'held_root' facility allows userland to read a snapshot of the metadata in a live pool. [Note: this is a snapshot of the metadata, not a snapshot of data]. The following will implement ERASE in userland. - Deactivate all thin volumes that you wish to erase. Failure to deactivate would mean the mappings for the thins could be out of date by the time userland reads them. There is no mechanism for enforcing this at the device mapper level; but userland can easily do this (eg, lvm2 already has a comprehensive locking scheme that will handle this). It should also be pointed out that if you try and erase a volume while you're still using it, you are an idiot. - Grab a 'held_root' - Read the mappings for all of the thins you wish to erase. - Work out which data blocks are used exclusively by this subset of thins. - Write zeroes across these blocks. - send a thin-delete message to the pool for each thin. ** Crash recovery If we crash during the copy-on-write or provision operation the recovery process needs to zero those new, but not committed, blocks. This requires the introduction of an 'erase log' to the metadata format. This log would need to be committed *before* the copy/overwrite operation could proceed. I've implemented such an erase log [see patch], to get an idea of the performance overhead. Testing in ideal conditions (ie. large writes that are triggering many provision/copy operations so costs can be amortised), we see a 25% slowdown in throughput of provision/copy operations. Better than I feared. We can improve the performance significantly by observing that it's harmless to do too much zeroing of unprovisioned blocks on recovery. This suggests a scheme similar to a mirror log, where we mark regions of the data volume that we have pending provision/copy operations. When we recover we just zero *all* unallocated blocks in these marked regions. This will result in fewer commits, since newly allocated blocks will commonly come from the same region and so avoid the need for a commit. [TODO: get a proof of concept patch together]. ** Discards DISCARDs *must* result in data being zeroed. Some devices set the discard_zeroes_data flag. This is not good enough; you cannot use this flag as a guarantee that the data no longer exists on the disk. So real zeroing must occur. I suggest we write a separate target that zeroes data just before discarding it, and stack it under the thin-pool. The performance impact of this will be significant; to the point that we may wish to turn discard within the fs off; instead doing periodic tidy-ups. ** Avoid redundant copying The calculation to say whether a block is shared or not (and thus liable to suffer a copy-on-write exception), is an approximation. It sometimes says something is shared when it isn't, which causes us a problem wrt ERASE. To avoid leaving orphaned copies of data, we must either tighten up the sharing detection [patch in the works], or zero the old block (via discard). ** Summary of work items [0/5] Too much for linux 3.4 timeframe. - [ ] Change the shared block detection [1 day, worth doing anyway] - [ ] Bitmap based erase log [1 week] - [ ] Recovery tool that zeroes unallocated blocks in dirty regions [1 week] - [ ] Implement the discard-really-zeroes target [1 month] - [ ] Write thin_erase userland tool [1 week] - [ ] Update lvm2 tools [3 months] -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel