As Joe mentioned last week, we've been tossing around some ideas for changes and bug-fixes to the DM snapshot code. So here's a first crack at a to-do list. Perhaps Joe can put a copy of this on his web site. Anyone else with comments or ideas, feel free to add to this list. Bug Fixes ========= 1. Reads to the snapshot Currently, a read for the snapshot is only submitted to the cow device when there's a completed-exception. If there's a pending-exception, the request is still sent to the origin device. Instead, the request should be queued on the pending-exception, just like for the write requests. 2. Registering the snapshot A snapshot is "registered" when it is added to the list of snapshots for the desired origin device. Currently, this happens during snapshot_ctr(). However, the cow metadata isn't read from disk until snapshot_resume(). We should move the registration to snapshot_resume(), after the metadata is read. 3. Multiple snapshots. There are a couple of very subtle race conditions involving the origin_bhs queue in the pending-exceptions. When you have multiple snapshots of one origin and a new exception is triggered by a write to the origin, each snapshot gets its own pending-exception. However, the origin write request must be queued on only one of those PEs. Depending on the order in which the PEs complete, the origin write requests may have to be moved from one PE to another PE. Currently, there isn't any locking around this moving of queues. The problem is - where does the locking belong? Each PE is basically independent, so under the current method you'd need to lock multiple PEs, which could introduce possible deadlocks. You might be able to avoid the race with one global spinlock, but that's just gross. Ideally, we need a data structure to represent the origin device, where we can put a queue for origin I/O requests. 4. Reads before writes Currently, we don't track reads on the snapshot that are submitted to the origin device. This leads to a theoretical race condition if an origin-write-request is received after a snapshot-read has already been submitted to the origin device. The fix would be to delay starting any new pending-exceptions until all outstanding reads are complete. However, it seems unlikely that this race condition would occur in practice. In order for it to happen, the snapshot-read would have to be delayed through the entire process of the copying of a chunk from the origin to the cow device and the updating of the cow metadata. Architecture Changes ==================== 1. Exception-handling code. It would be nice to separate the exception-handling code from the snapshot-specific code. The exception-handling code is what maintains the table of pending and completed exceptions, and does the high-level work of processing a pending-exception. The snapshot-specific code simply uses the exception table to determine which device (origin or cow) to submit a request to, and whether to create a new exception. The exception-handling code could be re-used by a bad-block module. All the code that manages the tables and performs the copies is basically the same as for snapshotting. The only difference is how to decide when a new exception needs to be created, and what to do with an I/O request for a remapped chunk. I started working on these changes back in September, and I had gotten as far as running some simple tests on the new snapshot code. However, I got busy with some other stuff and haven't looked at it in a couple months. I'm hoping to get back to it in the near future. 2. Exception table cache. The in-memory exception-tables that track which chunks have been remapped from the origin to the snapshot can get quite large. Each entry in the table requires 24 bytes of kernel memory (assuming 64-bit sector values). Using an example of a 100 GB origin volume, a 5 GB cow device (5% of the origin size), and a chunk-size of 16kB: the snapshot can hold 327680 chunks, for a total of 7.5 MB of kernel memory by the time the snapshot fills up. It would be nice to create a cache for the exception-table, so the entire table doesn't have to be kept in memory at all times. -- Kevin Corry kevcorry@xxxxxxxxxx http://evms.sourceforge.net/