[dm-devel] Snapshot To-Do List

Kevin Corry <kevcorry@xxxxxxxxxx> · Wed Dec 31 11:31:02 2003

As Joe mentioned last week, we've been tossing around some ideas for changes 
and bug-fixes to the DM snapshot code. So here's a first crack at a to-do 
list. Perhaps Joe can put a copy of this on his web site. Anyone else with 
comments or ideas, feel free to add to this list.

Bug Fixes
=========

1. Reads to the snapshot

Currently, a read for the snapshot is only submitted to the cow device when 
there's a completed-exception. If there's a pending-exception, the request is 
still sent to the origin device. Instead, the request should be queued on the 
pending-exception, just like for the write requests.

2. Registering the snapshot

A snapshot is "registered" when it is added to the list of snapshots for the 
desired origin device. Currently, this happens during snapshot_ctr(). 
However, the cow metadata isn't read from disk until snapshot_resume(). We 
should move the registration to snapshot_resume(), after the metadata is 
read.

3. Multiple snapshots.

There are a couple of very subtle race conditions involving the origin_bhs 
queue in the pending-exceptions. When you have multiple snapshots of one 
origin and a new exception is triggered by a write to the origin, each 
snapshot gets its own pending-exception. However, the origin write request 
must be queued on only one of those PEs. Depending on the order in which the 
PEs complete, the origin write requests may have to be moved from one PE to 
another PE. Currently, there isn't any locking around this moving of queues. 
The problem is - where does the locking belong? Each PE is basically 
independent, so under the current method you'd need to lock multiple PEs, 
which could introduce possible deadlocks. You might be able to avoid the race 
with one global spinlock, but that's just gross. Ideally, we need a data 
structure to represent the origin device, where we can put a queue for origin 
I/O requests.

4. Reads before writes

Currently, we don't track reads on the snapshot that are submitted to the 
origin device. This leads to a theoretical race condition if an 
origin-write-request is received after a snapshot-read has already been 
submitted to the origin device. The fix would be to delay starting any new 
pending-exceptions until all outstanding reads are complete.

However, it seems unlikely that this race condition would occur in practice. 
In order for it to happen, the snapshot-read would have to be delayed through 
the entire process of the copying of a chunk from the origin to the cow 
device and the updating of the cow metadata.

Architecture Changes
====================

1. Exception-handling code.

It would be nice to separate the exception-handling code from the 
snapshot-specific code. The exception-handling code is what maintains the 
table of pending and completed exceptions, and does the high-level work of 
processing a pending-exception. The snapshot-specific code simply uses the 
exception table to determine which device (origin or cow) to submit a request 
to, and whether to create a new exception.

The exception-handling code could be re-used by a bad-block module. All the 
code that manages the tables and performs the copies is basically the same as 
for snapshotting. The only difference is how to decide when a new exception 
needs to be created, and what to do with an I/O request for a remapped chunk.

I started working on these changes back in September, and I had gotten as far 
as running some simple tests on the new snapshot code. However, I got busy 
with some other stuff and haven't looked at it in a couple months. I'm hoping 
to get back to it in the near future.

2. Exception table cache.

The in-memory exception-tables that track which chunks have been remapped from 
the origin to the snapshot can get quite large. Each entry in the table 
requires 24 bytes of kernel memory (assuming 64-bit sector values).  Using an 
example of a 100 GB origin volume, a 5 GB cow device (5% of the origin size), 
and a chunk-size of 16kB: the snapshot can hold 327680 chunks, for a total of 
7.5 MB of kernel memory by the time the snapshot fills up.

It would be nice to create a cache for the exception-table, so the entire 
table doesn't have to be kept in memory at all times.

-- 
Kevin Corry
kevcorry@xxxxxxxxxx
http://evms.sourceforge.net/