G'day, A number of people have already seen this; I'm posting for wider comment and to move some interesting discussion to a public list. I'll apologise in advance for the talk about SGI technologies (including proprietary ones), but all the problems mentioned apply to in-tree technologies too. This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. a) The first problem is the server's behaviour when a filesystem which is subject to snapshot is written to, and the snapshot repository runs out of room. The failure mode can be quite severe. XFS issues a metadata write to the block device, triggering a Copy-On-Write operation in the XVM snapshot element, which because of the full repository fails with EIO. When XFS sees the failure it shuts down the filesystem. All subsequent attempts to perform IO to the filesystem block indefinitely. In particular any NFS server thread will block and never reply to the NFS client. The NFS client will retry, causing another NFS server thread to block, and repeat until every NFS server thread is blocked. At this point all NFS service for all filesystems ceases. See PV 958220 and PV 958140 for a description of this problem and some of the approaches which have been discussed for resolving it. b) The second problem is that certain common combinations of filesystem operations can cause large wastes of space in the XVM snapshot repository. Examples include writing the same file twice with dd, or writing a new file and deleting it. The cause is the inability of the XVM snapshot code to be able to free regions in the snapshot repository that are no longer in use by the filesystem; this information is simply not available within the block layer. Note that problem b) also contributes to problem a) by increasing repository usage and thus making it easier to encounter an out-of-space condition on the repository. c) The third problem is an unfortunate interaction between an XFS internal log and block snapshots. The log is a fixed region of the block device which is written as a side effect of a great many different filesystem operations. The information written there has no value and is not even read until and unless log recovery needs to be performed after the server has crashed. This means the log does not need to be preserved by the block feature snapshot (because at the point in time when the snapshot is taken, log recovery must have already happened). In fact the correct procedure when mounting a read-only snapshot is to use the "norecovery" option to prevent any attempt to read the log (although the NAS server software actually doesn't do this). However, because the block device layer doesn't have enough information to know any better, the first pass of writes to the log are subjected to Copy-On-Write. This has two undesirable effects. Firstly, it increases the amount of snapshot repository space used by each snapshot, thus contributing to problem a). Secondly, it puts a significant performance penalty on filesystem metadata operations for some time after each snapshot is taken; given that the NAS server can be configured to take regular frequent snapshots this may mean all of the time. An obvious solution is to use an external XFS log, but this quite inconvenient for the NAS server software to arrange. For one thing, we would need to construct a separate external log device for the main filesystem and one for each mounted snapshot. Note that these problems are not specific to XVM but will be encountered by any Linux block-COWing snapshot implementation. For example the DM snapshot implementation is documented to suffer from problem a). From the linux/Documentation/device-mapper/snapshot.txt: > <COW device> will often be smaller than the origin and if it > fills up the snapshot will become useless and be disabled, > returning errors. So it is important to monitor the amount of > free space and expand the <COW device> before it fills up. During discussions, it became clear that we could solve all three of these problems by improving the block device interface to allow a filesystem to provide the block device with dynamic block usage hints. For example, when unlinking a file the filesystem could tell the block device a hint of the form "I'm about to stop using these blocks". Most block devices would silently ignore these hints, but a snapshot COW implementation (the "copy-on-write" XVM element or the "snapshot-origin" dm target) could use them to help avoid these problems. For example, the response to the "I'm about to stop using these blocks" hint could be to free the space used in the snapshot repository for unnecessary copies of those blocks. Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ We'll also need a field to tell us which kind of hint the bio represents. Perhaps a new field could be added, or perhaps the top 16 bits of bi_rw (currently used to encode the bio's priority, which has no meaning for hint bios) could be reused. The latter approach may allow hints to be used without modifying the bio structure or any code that uses it other than the filesystem and the snapshot implementation. Such a property would have obvious advantages for our NAS server software, where XFS and XVM modules are provided but the other users of struct bio are stock SLES code. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. BIO_HINT_DONTCOW (the Bart Simpson BIO). The bio's block extent is not needed in mounted snapshots and does not need to be subjected to COW. Here's how these proposed hints help solve the abovementioned problems. Problem a) The filesystem gives the BIO_HINT_ALLOCATE hint to the block device when preparing to write to blocks and when allocating blocks. The snapshot implementation checks whether COW is necessary, and if so performs it immediately. If the COW fails due to a lack of space in the snapshot repository, the bio fails. This can be caught in the filesystem and reported to userspace (or the NFS server) as ENOSPC via the existing mechanisms. Filesystem shutdown is no longer necessary. Problem b) is solved by the filesystem giving the BIO_HINT_RELEASE hint to the block device every time it unmaps blocks in xfs_bunmapi. The snapshot implementation can then free unnecessary copies of those blocks. Problem c) is solved by the filesystem giving to the block device a BIO_HINT_DONTCOW hint describing the block extent of the internal log, at filesystem mount time. The snapshot implementation marks that extent, and subsequent writes to those blocks do not cause COWs but proceed directly to the origin filesystem. Comments? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html