On Wed, Aug 05, 2015 at 01:07:36PM +1000, NeilBrown wrote: > On Wed, 29 Jul 2015 17:38:43 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > This introduces a simple log for raid5. Data/parity writting to raid > > array first writes to the log, then write to raid array disks. If crash > > happens, we can recovery data from the log. This can speed up raid > > resync and fix write hole issue. > > > > The log structure is pretty simple. Data/meta data is stored in block > > unit, which is 4k generally. It has only one type of meta data block. > > The meta data block can track 3 types of data, stripe data, stripe > > parity and flush block. MD superblock will point to the last valid meta > > data block. Each meta data block has checksum/seq number, so recovery > > can scan the log correctly. We store a checksum of stripe data/parity to > > the metadata block, so meta data and stripe data/parity can be written > > to log disk together. otherwise, meta data write must wait till stripe > > data/parity is finished. > > > > For stripe data, meta data block will record stripe data sector and > > size. Currently the size is always 4k. This meta data record can be made > > simpler if we just fix write hole (eg, we can record data of a stripe's > > different disks together), but this format can be extended to support > > caching in the future, which must record data address/size. > > > > For stripe parity, meta data block will record stripe sector. It's size > > should be 4k (for raid5) or 8k (for raid6). We always store p parity > > first. This format should work for caching too. > > It feels a bit odd have a 8K parity block for RAID6 as it is really two > blocks: a parity block and a Q-syndrome block. What would you think of > introducing another block type for Q? So there are Data blocks, Parity > blocks, and Q blocks ??? > > Not a big issue, but I thought I would mention it. I'd prefer not adding the complexity, it's just a naming. > > > > flush block indicates a stripe is in raid array disks. Fixing write hole > > doesn't need this type of meta data, it's for caching extention. > > > > We should be careful about deadlock. Appending stripe data/parity to log > > is done in raid5d. The append need log space, which can trigger log > > reclaim, which might wait for raid5d to run (to flush data from log to > > raid array disks). Here we always do the log write in a separate thread. > > I'm not convinced about the need for a separate thread. As > raid5d/handle_stripe works as a state machine, and as all IO is async, > we should at most need an extra state, not an extra thread. > > I think a key issue is that you don't call r5l_get_reserve() until you > are about to submit writes to the log. > If instead you reserved the space in r5l_write_stripe and delay the > stripe if space is not available, there there would be no deadlock. > Then when available space crosses some threshold, re-activate those > delayed stripes. ok, that way works too. As I said before, I really hate making the log/cache stuff tie tightly together with stripe cache state machine if possible (eg, adding new state/list/stripe analysis etc). I'd rather to keep current logic if you don't strongly object. > > +#include <linux/kernel.h> > > +#include <linux/wait.h> > > +#include <linux/blkdev.h> > > +#include <linux/slab.h> > > +#include <linux/raid/md_p.h> > > +#include <linux/crc32.h> > > +#include <linux/random.h> > > +#include "md.h" > > +#include "raid5.h" > > + > > +typedef u64 r5blk_t; /* log blocks, 512B - 4k */ > > I would much rather use sector_t throughout and keep all addresses as > sector addresses. Much less room for confusion that way. > > If we wanted to pack lots of addresses into limited space then using > block addresses might be justified (filesystems do that), but I don't > think it is called for here at all. The original idea is metadata block size could be changed between 512B to 4k. Sometimes the metadata block can't have enough data if we don't want to delay IO. When the block size isn't a sector, the type helps me avoid coding error (eg, remind it's a block instead of a sector). Do you prefer metadata block size one sector? > > +static void r5l_put_reserve(struct r5l_log *log, unsigned int reserved_blocks) > > +{ > > + BUG_ON(!mutex_is_locked(&log->io_mutex)); > > + > > + log->reserved_blocks -= reserved_blocks; > > + if (r5l_free_space(log) > 0) > > + wake_up(&log->space_waitq); > > +} > > + > > +static struct r5l_io_unit *r5l_alloc_io_unit(struct r5l_log *log) > > +{ > > + struct r5l_io_unit *io; > > + gfp_t gfp = GFP_NOIO | __GFP_NOFAIL; > > + > > + io = kmem_cache_zalloc(log->io_kc, gfp); > > + if (!io) > > + return NULL; > > + io->log = log; > > + io->meta_page = alloc_page(gfp | __GFP_ZERO); > > + if (!io->meta_page) { > > + kmem_cache_free(log->io_kc, io); > > + return NULL; > > + } > > This can return NULL, but the one place where you call it you do not > check for NULL. > Maybe a mempool would be appropriate, maybe something else - I haven't > examine the issue closely. I use mempool for the cache patch, but eventually choose to use NOFAIL allocation here, because don't know what the minimal mempool element size should be. incorrect mempool element size could introduce trouble, eg, the allocation expects we free an element, which might not possible without increasing complexity in reclaim. Maybe I should just ignore the NULL, it's a NOFAIL allocation. The log code isn't ready with allocation failure. > > +static int r5l_recovery_log(struct r5l_log *log) > > +{ > > + /* fake recovery */ > > + log->seq = log->last_cp_seq + 1; > > + log->log_start = r5l_ring_add(log, log->last_checkpoint, 1); > > + return 0; > > +} > > + > > +static void r5l_write_super(struct r5l_log *log, sector_t cp) > > +{ > > + log->rdev->recovery_offset = cp; > > + md_update_sb(log->mddev, 1); > > +} > > This is only called from run() when the log is first initialised. At > this point there is nothing useful in the log, so recording it's > location is pointless. At most you could set the MD_SB_DIRTY flag (or > whatever it is). > So it really doesn't need to be a separate function. We must write super here. If we start append data to the log and super doesn't point to correct postion (reclaim might not run once yet), recovery doesn't know where to find the log. > > +void r5l_exit_log(struct r5l_log *log) > > +{ > > + md_unregister_thread(&log->log_thread); > > + > > + kmem_cache_destroy(log->io_kc); > > + kfree(log); > > +} > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index 59e44e9..9608a44 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -899,6 +899,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) > > > > might_sleep(); > > > > + if (!r5l_write_stripe(conf->log, sh)) > > + return; > > If no log is configured, r5l_write_stripe will return -EAGAIN, and so > ops_run_io will never submit any IO.... I think you read it wrong, we don't return in that case. > > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > > index 02c3bf8..a8daf39 100644 > > --- a/drivers/md/raid5.h > > +++ b/drivers/md/raid5.h > > @@ -223,6 +223,9 @@ struct stripe_head { > > struct stripe_head *batch_head; /* protected by stripe lock */ > > spinlock_t batch_lock; /* only header's lock is useful */ > > struct list_head batch_list; /* protected by head's batch lock*/ > > + > > + struct r5l_io_unit *log_io; > > + struct list_head log_list; > > /** > > * struct stripe_operations > > * @target - STRIPE_OP_COMPUTE_BLK target > > I wonder if we really need yet another 'list_head' in 'stripe_head'. > I guess one more is no great cost. I'm pretty sure using the lru list of stripe_head is broken, that's why I added the new list. I'll fix other issues or add more comments. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html