On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li <shli@xxxxxx> wrote: > > Your code does avoid write-hole-protection for fill-stripe-writes, and this > > would greatly reduce the number of block that were written multiple times. > > However I'm not convinced that is correct. > > A reasonable goal is that if the system crashes while writing to a storage > > device, then reads should return the old data or not new data, not anything > > else. A crash in the middle of a full-stripe-write to a degraded array > > could result in some block in the stripe appearing to contain data that is > > different to both the old and the new. If you are going to close the whole, > > I think it should be done properly. > > I can do it simpley. But don't think this assumption is true. If you > write to a disk range and there is failure, there is nothing guarantee > you can either read old data or new data. If you write a range of blocks to a normal disk and crash during the write, each block will contain either the old data or the new data. If you write a range to a degraded RAID5 and crash during the write, you cannot make that same guarantee. I don't know how important this is, but then I don't really know how important any of this is. > > > > > A combined log would "simply" involve writing every data block and every > > compute parity block (with index information) to the log device. > > Replaying the log would collect data blocks and flush out those in a stripe > > once the parity block(s) for that stripe became available. > > > > I think this would actually turn into a fairly simple logging mechanism. > > It's not simple at all. It's unlikely we write data and parity > continuously in disk and in the same time. This will make log checkpoint > fairly complex. I don't see any cause for complexity. Let me be more explicit. I imagine that all data remains in the stripe cache, in memory, until it is finally written to the RAID5. So the stripe cache will need to be quite a bit bigger. Every time we get a block that we want to write, either a new data block or a a computed parity block, we queue it to the log. The log works like this: - take the first (e.g.) 256 blocks in the queue, create a header to describe them, write the header with FUA, then write all the data blocks. If there are fewer than 256, just write what we have. - when the header write completes, all blocks written *previously* are now safe and we can call bio_end_io on data or unlock the stripe for parity. - loop back and write some more blocks. If there are no blocks to write, write a header which describes an empty set of blocks, and wait for more blocks to appear. Each stripe_head needs to track (roughly) where the relevant blocks were written so it can release them when the stripe is written. I would conceptually divide the log into 32 regions and keep a 32bit number with each stripe. When a block is assigned to a region in the log, the relevant bit is set for the stripe, and a per-region counter is incremented. When a stripe completes its write, the region counters for all the bits are cleared. The log cannot progress into a region which has a non-zero counter. We choose the size of transactions so that the first block of each region is a header block. These contain a magic number, a sequence number, and a checksum together with the addresses of the data/parity blocks. On restart we read all 32 of these to find out where the log starts and ends. Then we replay all the blocks into the stripe cache - discarding any that don't come with the required parity blocks. So it is a very simple log which is never read exact on crash recovery. It commits everything ASAP so that the writeout to the array can be lazy and can gather related blocks and sort address etc with not impact on filesystem latency. Does that make sense? NeilBrown
Attachment:
pgptDu8KpVScN.pgp
Description: OpenPGP digital signature