Hi, Song Liu I think there is a bug about the counter of pending_full_writes, I try to fix it with bellow patch. There is another question around me, the reclaim thread is wake up every 5 seconds which would update the super block no matter there is IO activity or not. If it should have some checks to update the super block? commit 100c221268fd893bfb35273ef9ba14994f38bc6c Author: ZhengYuan Liu <liuzhengyuan@xxxxxxxxxx> Date: Sat Sep 24 21:59:01 2016 +0800 md/r5cache: decrease the counter after full-write stripe was reclaimed Once the data was written to cache device,r5c_handle_cached_data_endio would be called to set dev->written with null and return the bio to up layer. As a result,handle_stripe_clean_event has no chance to be called to decrease the counter of conf->pending_full_writes when the stripe was written to raid disks at reclaim stage. It should test the STRIPE_FULL_WRITE state and decrease the counter at somewhere when the full-write stripe was reclaimed, r5c_handle_stripe_written may be the right place to do that. Signed-off-by: ZhengYuan Liu <liuzhengyuan@xxxxxxxxxx> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 3446bcc..326457d 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -1955,6 +1955,10 @@ void r5c_handle_stripe_written(struct r5conf *conf, list_del_init(&sh->r5c); spin_unlock_irqrestore(&conf->log->stripe_in_cache_lock, flags); sh->log_start = MaxSector; + + if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) + if (atomic_dec_and_test(&conf->pending_full_writes)) + md_wakeup_thread(conf->mddev->thread); } if (do_wakeup) 2016-09-01 6:18 GMT+08:00 Song Liu <songliubraving@xxxxxx>: > This is the write part of r5cache. The cache is integrated with > stripe cache of raid456. It leverages code of r5l_log to write > data to journal device. > > r5cache split current write path into 2 parts: the write path > and the reclaim path. The write path is as following: > 1. write data to journal > (r5c_handle_stripe_dirtying, r5c_cache_data) > 2. call bio_endio > (r5c_handle_data_cached, r5c_return_dev_pending_writes). > > Then the reclaim path is as: > 1. Freeze the stripe (r5c_freeze_stripe_for_reclaim) > 2. Calcualte parity (reconstruct or RMW) > 3. Write parity (and maybe some other data) to journal device > 4. Write data and parity to RAID disks > > Step 3 and 4 of reclaim path is very similar to write path of > raid5 journal. > > With r5cache, write operation does not wait for parity calculation > and write out, so the write latency is lower (1 write to journal > device vs. read and then write to raid disks). Also, r5cache will > reduce RAID overhead (multipile IO due to read-modify-write of > parity) and provide more opportunities of full stripe writes. > > r5cache adds 2 flags to stripe_head.state: STRIPE_R5C_FROZEN and > STRIPE_R5C_WRITTEN. The write path runs w/ STRIPE_R5C_FROZEN == 0. > Cache writes start from r5c_handle_stripe_dirtying(), where bit > R5_Wantcache is set for devices with bio in towrite. Then, the > data is written to the journal through r5l_log implementation. > Once the data is in the journal, we set bit R5_InCache, and > presue bio_endio for these writes. > > The reclaim path starts by setting STRIPE_R5C_FROZEN. This makes > the stripe into reclaim. If some write operation arrives at this > time, it will be handled as raid5 journal (calculate parity, > write to jorunal, write to disks, bio_endio). > > Once frozen, the stripe is sent back to raid5 state machine, > where handle_stripe_dirtying will evaluate the stripe for > reconstruct writes or RMW writes (read data and calculate parity). > > For RMW, the code allocates an extra page for each data block > being updated. This is stored in r5dev->page and the old data > is read into it. Then the prexor calculation subtracts ->page > from the parity block, and the reconstruct calculation adds the > ->orig_page data back into the parity block. > > r5cache naturally excludes SkipCopy. With R5_Wantcache bit set, > async_copy_data will not skip copy. > > Before writing data to RAID disks, the r5l_log logic stores > parity (and non-overwrite data) to the journal. > > Instead of inactive_list, stripes with cached data are tracked in > r5conf->r5c_cached_list. r5conf->r5c_cached_stripes tracks how > many stripes has dirty data in the cache. > > There are some known limitations of the cache implementation: > > 1. Write cache only covers full page writes (R5_OVERWRITE). Writes > of smaller granularity are write through. > 2. Only one log io (sh->log_io) for each stripe at anytime. Later > writes for the same stripe have to wait. This can be improved by > moving log_io to r5dev. > 3. When use with bitmap, there is a deadlock in add_stripe_bio => > bitmap_startwrite. > > Signed-off-by: Song Liu <songliubraving@xxxxxx> > Signed-off-by: Shaohua Li <shli@xxxxxx> > --- > drivers/md/raid5-cache.c | 271 +++++++++++++++++++++++++++++++++++++++++++++-- > drivers/md/raid5.c | 164 ++++++++++++++++++++++++---- > drivers/md/raid5.h | 25 ++++- > 3 files changed, 429 insertions(+), 31 deletions(-) > > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c > index 1b1ab4a..cc7b80d 100644 > --- a/drivers/md/raid5-cache.c > +++ b/drivers/md/raid5-cache.c > @@ -40,6 +40,13 @@ > */ > #define R5L_POOL_SIZE 4 > > +enum r5c_state { > + R5C_STATE_NO_CACHE = 0, > + R5C_STATE_WRITE_THROUGH = 1, > + R5C_STATE_WRITE_BACK = 2, > + R5C_STATE_CACHE_BROKEN = 3, > +}; > + > struct r5l_log { > struct md_rdev *rdev; > > @@ -96,6 +103,9 @@ struct r5l_log { > spinlock_t no_space_stripes_lock; > > bool need_cache_flush; > + > + /* for r5c_cache */ > + enum r5c_state r5c_state; > }; > > /* > @@ -168,12 +178,73 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io, > io->state = state; > } > > +/* > + * Freeze the stripe, thus send the stripe into reclaim path. > + * > + * This function should only be called from raid5d that handling this stripe, > + * or when the stripe is on r5c_cached_list (hold conf->device_lock) > + */ > +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh) > +{ > + struct r5conf *conf = sh->raid_conf; > + > + if (!conf->log) > + return; > + > + WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state)); > + set_bit(STRIPE_R5C_FROZEN, &sh->state); > + > + if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) > + atomic_inc(&conf->preread_active_stripes); > + if (test_and_clear_bit(STRIPE_IN_R5C_CACHE, &sh->state)) { > + BUG_ON(atomic_read(&conf->r5c_cached_stripes) == 0); > + atomic_dec(&conf->r5c_cached_stripes); > + } > +} > + > +static void r5c_handle_data_cached(struct stripe_head *sh) > +{ > + int i; > + > + for (i = sh->disks; i--; ) > + if (test_and_clear_bit(R5_Wantcache, &sh->dev[i].flags)) { > + set_bit(R5_InCache, &sh->dev[i].flags); > + clear_bit(R5_LOCKED, &sh->dev[i].flags); > + atomic_inc(&sh->dev_in_cache); > + } > +} > + > +/* > + * this journal write must contain full parity, > + * it may also contain some data pages > + */ > +static void r5c_handle_parity_cached(struct stripe_head *sh) > +{ > + int i; > + > + for (i = sh->disks; i--; ) > + if (test_bit(R5_InCache, &sh->dev[i].flags)) > + set_bit(R5_Wantwrite, &sh->dev[i].flags); > + set_bit(STRIPE_R5C_WRITTEN, &sh->state); > +} > + > +static void r5c_finish_cache_stripe(struct stripe_head *sh) > +{ > + if (test_bit(STRIPE_R5C_FROZEN, &sh->state)) > + r5c_handle_parity_cached(sh); > + else > + r5c_handle_data_cached(sh); > +} > + > static void r5l_io_run_stripes(struct r5l_io_unit *io) > { > struct stripe_head *sh, *next; > > list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) { > list_del_init(&sh->log_list); > + > + r5c_finish_cache_stripe(sh); > + > set_bit(STRIPE_HANDLE, &sh->state); > raid5_release_stripe(sh); > } > @@ -402,7 +473,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh, > io = log->current_io; > > for (i = 0; i < sh->disks; i++) { > - if (!test_bit(R5_Wantwrite, &sh->dev[i].flags)) > + if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) && > + !test_bit(R5_Wantcache, &sh->dev[i].flags)) > continue; > if (i == sh->pd_idx || i == sh->qd_idx) > continue; > @@ -412,18 +484,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh, > r5l_append_payload_page(log, sh->dev[i].page); > } > > - if (sh->qd_idx >= 0) { > + if (parity_pages == 2) { > r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, > sh->sector, sh->dev[sh->pd_idx].log_checksum, > sh->dev[sh->qd_idx].log_checksum, true); > r5l_append_payload_page(log, sh->dev[sh->pd_idx].page); > r5l_append_payload_page(log, sh->dev[sh->qd_idx].page); > - } else { > + } else if (parity_pages == 1) { > r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, > sh->sector, sh->dev[sh->pd_idx].log_checksum, > 0, false); > r5l_append_payload_page(log, sh->dev[sh->pd_idx].page); > - } > + } else > + BUG_ON(parity_pages != 0); > > list_add_tail(&sh->log_list, &io->stripe_list); > atomic_inc(&io->pending_stripe); > @@ -432,7 +505,6 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh, > return 0; > } > > -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space); > /* > * running in raid5d, where reclaim could wait for raid5d too (when it flushes > * data from log to raid disks), so we shouldn't wait for reclaim here > @@ -456,11 +528,17 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh) > return -EAGAIN; > } > > + WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state)); > + > for (i = 0; i < sh->disks; i++) { > void *addr; > > if (!test_bit(R5_Wantwrite, &sh->dev[i].flags)) > continue; > + > + if (test_bit(R5_InCache, &sh->dev[i].flags)) > + continue; > + > write_disks++; > /* checksum is already calculated in last run */ > if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) > @@ -473,6 +551,9 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh) > parity_pages = 1 + !!(sh->qd_idx >= 0); > data_pages = write_disks - parity_pages; > > + pr_debug("%s: write %d data_pages and %d parity_pages\n", > + __func__, data_pages, parity_pages); > + > meta_size = > ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32)) > * data_pages) + > @@ -735,7 +816,6 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log, > } > } > > - > static void r5l_do_reclaim(struct r5l_log *log) > { > sector_t reclaim_target = xchg(&log->reclaim_target, 0); > @@ -798,7 +878,7 @@ static void r5l_reclaim_thread(struct md_thread *thread) > r5l_do_reclaim(log); > } > > -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space) > +void r5l_wake_reclaim(struct r5l_log *log, sector_t space) > { > unsigned long target; > unsigned long new = (unsigned long)space; /* overflow in theory */ > @@ -1111,6 +1191,180 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp) > set_bit(MD_CHANGE_DEVS, &mddev->flags); > } > > +static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh) > +{ > + list_del_init(&sh->lru); > + r5c_freeze_stripe_for_reclaim(sh); > + atomic_inc(&conf->active_stripes); > + atomic_inc(&sh->count); > + set_bit(STRIPE_HANDLE, &sh->state); > + raid5_release_stripe(sh); > +} > + > +int r5c_flush_cache(struct r5conf *conf) > +{ > + int count = 0; > + > + assert_spin_locked(&conf->device_lock); > + if (!conf->log) > + return 0; > + while (!list_empty(&conf->r5c_cached_list)) { > + struct list_head *l = conf->r5c_cached_list.next; > + struct stripe_head *sh; > + > + sh = list_entry(l, struct stripe_head, lru); > + r5c_flush_stripe(conf, sh); > + ++count; > + } > + return count; > +} > + > +int r5c_handle_stripe_dirtying(struct r5conf *conf, > + struct stripe_head *sh, > + struct stripe_head_state *s, > + int disks) { > + struct r5l_log *log = conf->log; > + int i; > + struct r5dev *dev; > + > + if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state)) > + return -EAGAIN; > + > + if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH || > + conf->quiesce != 0 || conf->mddev->degraded != 0) { > + /* write through mode */ > + r5c_freeze_stripe_for_reclaim(sh); > + return -EAGAIN; > + } > + > + s->to_cache = 0; > + > + for (i = disks; i--; ) { > + dev = &sh->dev[i]; > + /* if none-overwrite, use the reclaim path (write through) */ > + if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) && > + !test_bit(R5_InCache, &dev->flags)) { > + r5c_freeze_stripe_for_reclaim(sh); > + return -EAGAIN; > + } > + } > + > + for (i = disks; i--; ) { > + dev = &sh->dev[i]; > + if (dev->towrite) { > + set_bit(R5_Wantcache, &dev->flags); > + set_bit(R5_Wantdrain, &dev->flags); > + set_bit(R5_LOCKED, &dev->flags); > + s->to_cache++; > + } > + } > + > + if (s->to_cache) > + set_bit(STRIPE_OP_BIODRAIN, &s->ops_request); > + > + return 0; > +} > + > +void r5c_handle_stripe_written(struct r5conf *conf, > + struct stripe_head *sh) { > + int i; > + int do_wakeup = 0; > + > + if (test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state)) { > + WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state)); > + clear_bit(STRIPE_R5C_FROZEN, &sh->state); > + > + for (i = sh->disks; i--; ) { > + if (test_and_clear_bit(R5_InCache, &sh->dev[i].flags)) > + atomic_dec(&sh->dev_in_cache); > + clear_bit(R5_UPTODATE, &sh->dev[i].flags); > + if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) > + do_wakeup = 1; > + } > + } > + > + if (do_wakeup) > + wake_up(&conf->wait_for_overlap); > +} > + > +int > +r5c_cache_data(struct r5l_log *log, struct stripe_head *sh, > + struct stripe_head_state *s) > +{ > + int pages; > + int meta_size; > + int reserve; > + int i; > + int ret = 0; > + int page_count = 0; > + > + BUG_ON(!log); > + BUG_ON(s->to_cache == 0); > + > + for (i = 0; i < sh->disks; i++) { > + void *addr; > + > + if (!test_bit(R5_Wantcache, &sh->dev[i].flags)) > + continue; > + addr = kmap_atomic(sh->dev[i].page); > + sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum, > + addr, PAGE_SIZE); > + kunmap_atomic(addr); > + page_count++; > + } > + WARN_ON(page_count != s->to_cache); > + > + pages = s->to_cache; > + > + meta_size = > + ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32)) > + * pages); > + /* Doesn't work with very big raid array */ > + if (meta_size + sizeof(struct r5l_meta_block) > PAGE_SIZE) > + return -EINVAL; > + > + /* > + * The stripe must enter state machine again to call endio, so > + * don't delay. > + */ > + clear_bit(STRIPE_DELAYED, &sh->state); > + atomic_inc(&sh->count); > + > + mutex_lock(&log->io_mutex); > + /* meta + data */ > + reserve = (1 + pages) << (PAGE_SHIFT - 9); > + if (!r5l_has_free_space(log, reserve)) { > + spin_lock(&log->no_space_stripes_lock); > + list_add_tail(&sh->log_list, &log->no_space_stripes); > + spin_unlock(&log->no_space_stripes_lock); > + > + r5l_wake_reclaim(log, reserve); > + } else { > + ret = r5l_log_stripe(log, sh, pages, 0); > + if (ret) { > + spin_lock_irq(&log->io_list_lock); > + list_add_tail(&sh->log_list, &log->no_mem_stripes); > + spin_unlock_irq(&log->io_list_lock); > + } > + } > + > + mutex_unlock(&log->io_mutex); > + return 0; > +} > + > +void r5c_do_reclaim(struct r5conf *conf) > +{ > + struct stripe_head *sh, *next; > + struct r5l_log *log = conf->log; > + > + assert_spin_locked(&conf->device_lock); > + > + if (!log) > + return; > + list_for_each_entry_safe(sh, next, &conf->r5c_cached_list, lru) > + r5c_flush_stripe(conf, sh); > +} > + > static int r5l_load_log(struct r5l_log *log) > { > struct md_rdev *rdev = log->rdev; > @@ -1230,6 +1484,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev) > INIT_LIST_HEAD(&log->no_space_stripes); > spin_lock_init(&log->no_space_stripes_lock); > > + /* flush full stripe */ > + log->r5c_state = R5C_STATE_WRITE_BACK; > + > if (r5l_load_log(log)) > goto error; > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index b95c54c..ed6efd0 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -316,8 +316,16 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh, > < IO_THRESHOLD) > md_wakeup_thread(conf->mddev->thread); > atomic_dec(&conf->active_stripes); > - if (!test_bit(STRIPE_EXPANDING, &sh->state)) > - list_add_tail(&sh->lru, temp_inactive_list); > + if (!test_bit(STRIPE_EXPANDING, &sh->state)) { > + if (atomic_read(&sh->dev_in_cache) == 0) { > + list_add_tail(&sh->lru, temp_inactive_list); > + } else { > + if (!test_and_set_bit(STRIPE_IN_R5C_CACHE, > + &sh->state)) > + atomic_inc(&conf->r5c_cached_stripes); > + list_add_tail(&sh->lru, &conf->r5c_cached_list); > + } > + } > } > } > > @@ -901,6 +909,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) > > might_sleep(); > > + if (s->to_cache) { > + if (r5c_cache_data(conf->log, sh, s) == 0) > + return; > + /* array is too big that meta data size > PAGE_SIZE */ > + r5c_freeze_stripe_for_reclaim(sh); > + } > + > if (r5l_write_stripe(conf->log, sh) == 0) > return; > for (i = disks; i--; ) { > @@ -1029,6 +1044,7 @@ again: > > if (test_bit(R5_SkipCopy, &sh->dev[i].flags)) > WARN_ON(test_bit(R5_UPTODATE, &sh->dev[i].flags)); > + > sh->dev[i].vec.bv_page = sh->dev[i].page; > bi->bi_vcnt = 1; > bi->bi_io_vec[0].bv_len = STRIPE_SIZE; > @@ -1115,7 +1131,7 @@ again: > static struct dma_async_tx_descriptor * > async_copy_data(int frombio, struct bio *bio, struct page **page, > sector_t sector, struct dma_async_tx_descriptor *tx, > - struct stripe_head *sh) > + struct stripe_head *sh, int no_skipcopy) > { > struct bio_vec bvl; > struct bvec_iter iter; > @@ -1155,7 +1171,8 @@ async_copy_data(int frombio, struct bio *bio, struct page **page, > if (frombio) { > if (sh->raid_conf->skip_copy && > b_offset == 0 && page_offset == 0 && > - clen == STRIPE_SIZE) > + clen == STRIPE_SIZE && > + !no_skipcopy) > *page = bio_page; > else > tx = async_memcpy(*page, bio_page, page_offset, > @@ -1237,7 +1254,7 @@ static void ops_run_biofill(struct stripe_head *sh) > while (rbi && rbi->bi_iter.bi_sector < > dev->sector + STRIPE_SECTORS) { > tx = async_copy_data(0, rbi, &dev->page, > - dev->sector, tx, sh); > + dev->sector, tx, sh, 0); > rbi = r5_next_bio(rbi, dev->sector); > } > } > @@ -1364,7 +1381,8 @@ static int set_syndrome_sources(struct page **srcs, > if (i == sh->qd_idx || i == sh->pd_idx || > (srctype == SYNDROME_SRC_ALL) || > (srctype == SYNDROME_SRC_WANT_DRAIN && > - test_bit(R5_Wantdrain, &dev->flags)) || > + (test_bit(R5_Wantdrain, &dev->flags) || > + test_bit(R5_InCache, &dev->flags))) || > (srctype == SYNDROME_SRC_WRITTEN && > dev->written)) > srcs[slot] = sh->dev[i].page; > @@ -1543,9 +1561,18 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) > static void ops_complete_prexor(void *stripe_head_ref) > { > struct stripe_head *sh = stripe_head_ref; > + int i; > > pr_debug("%s: stripe %llu\n", __func__, > (unsigned long long)sh->sector); > + > + for (i = sh->disks; i--; ) > + if (sh->dev[i].page != sh->dev[i].orig_page) { > + struct page *p = sh->dev[i].page; > + > + sh->dev[i].page = sh->dev[i].orig_page; > + put_page(p); > + } > } > > static struct dma_async_tx_descriptor * > @@ -1567,7 +1594,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu, > for (i = disks; i--; ) { > struct r5dev *dev = &sh->dev[i]; > /* Only process blocks that are known to be uptodate */ > - if (test_bit(R5_Wantdrain, &dev->flags)) > + if (test_bit(R5_Wantdrain, &dev->flags) || > + test_bit(R5_InCache, &dev->flags)) > xor_srcs[count++] = dev->page; > } > > @@ -1618,6 +1646,10 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) > > again: > dev = &sh->dev[i]; > + if (test_and_clear_bit(R5_InCache, &dev->flags)) { > + BUG_ON(atomic_read(&sh->dev_in_cache) == 0); > + atomic_dec(&sh->dev_in_cache); > + } > spin_lock_irq(&sh->stripe_lock); > chosen = dev->towrite; > dev->towrite = NULL; > @@ -1625,7 +1657,8 @@ again: > BUG_ON(dev->written); > wbi = dev->written = chosen; > spin_unlock_irq(&sh->stripe_lock); > - WARN_ON(dev->page != dev->orig_page); > + if (!test_bit(R5_Wantcache, &dev->flags)) > + WARN_ON(dev->page != dev->orig_page); > > while (wbi && wbi->bi_iter.bi_sector < > dev->sector + STRIPE_SECTORS) { > @@ -1637,8 +1670,10 @@ again: > set_bit(R5_Discard, &dev->flags); > else { > tx = async_copy_data(1, wbi, &dev->page, > - dev->sector, tx, sh); > - if (dev->page != dev->orig_page) { > + dev->sector, tx, sh, > + test_bit(R5_Wantcache, &dev->flags)); > + if (dev->page != dev->orig_page && > + !test_bit(R5_Wantcache, &dev->flags)) { > set_bit(R5_SkipCopy, &dev->flags); > clear_bit(R5_UPTODATE, &dev->flags); > clear_bit(R5_OVERWRITE, &dev->flags); > @@ -1746,7 +1781,8 @@ again: > xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; > for (i = disks; i--; ) { > struct r5dev *dev = &sh->dev[i]; > - if (head_sh->dev[i].written) > + if (head_sh->dev[i].written || > + test_bit(R5_InCache, &head_sh->dev[i].flags)) > xor_srcs[count++] = dev->page; > } > } else { > @@ -2001,6 +2037,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp, > INIT_LIST_HEAD(&sh->batch_list); > INIT_LIST_HEAD(&sh->lru); > atomic_set(&sh->count, 1); > + atomic_set(&sh->dev_in_cache, 0); > for (i = 0; i < disks; i++) { > struct r5dev *dev = &sh->dev[i]; > > @@ -2887,6 +2924,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, > if (!expand) > clear_bit(R5_UPTODATE, &dev->flags); > s->locked++; > + } else if (test_bit(R5_InCache, &dev->flags)) { > + set_bit(R5_LOCKED, &dev->flags); > + s->locked++; > } > } > /* if we are not expanding this is a proper write request, and > @@ -2926,6 +2966,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, > set_bit(R5_LOCKED, &dev->flags); > clear_bit(R5_UPTODATE, &dev->flags); > s->locked++; > + } else if (test_bit(R5_InCache, &dev->flags)) { > + set_bit(R5_LOCKED, &dev->flags); > + s->locked++; > } > } > if (!s->locked) > @@ -3577,6 +3620,9 @@ static void handle_stripe_dirtying(struct r5conf *conf, > int rmw = 0, rcw = 0, i; > sector_t recovery_cp = conf->mddev->recovery_cp; > > + if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0) > + return; > + > /* Check whether resync is now happening or should start. > * If yes, then the array is dirty (after unclean shutdown or > * initial creation), so parity in some stripes might be inconsistent. > @@ -3597,9 +3643,12 @@ static void handle_stripe_dirtying(struct r5conf *conf, > } else for (i = disks; i--; ) { > /* would I have to read this buffer for read_modify_write */ > struct r5dev *dev = &sh->dev[i]; > - if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && > + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx || > + test_bit(R5_InCache, &dev->flags)) && > !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > + !((test_bit(R5_UPTODATE, &dev->flags) && > + (!test_bit(R5_InCache, &dev->flags) || > + dev->page != dev->orig_page)) || > test_bit(R5_Wantcompute, &dev->flags))) { > if (test_bit(R5_Insync, &dev->flags)) > rmw++; > @@ -3611,13 +3660,15 @@ static void handle_stripe_dirtying(struct r5conf *conf, > i != sh->pd_idx && i != sh->qd_idx && > !test_bit(R5_LOCKED, &dev->flags) && > !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags))) { > + test_bit(R5_InCache, &dev->flags) || > + test_bit(R5_Wantcompute, &dev->flags))) { > if (test_bit(R5_Insync, &dev->flags)) > rcw++; > else > rcw += 2*disks; > } > } > + > pr_debug("for sector %llu, rmw=%d rcw=%d\n", > (unsigned long long)sh->sector, rmw, rcw); > set_bit(STRIPE_HANDLE, &sh->state); > @@ -3629,10 +3680,18 @@ static void handle_stripe_dirtying(struct r5conf *conf, > (unsigned long long)sh->sector, rmw); > for (i = disks; i--; ) { > struct r5dev *dev = &sh->dev[i]; > - if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && > + if (test_bit(R5_InCache, &dev->flags) && > + dev->page == dev->orig_page) > + dev->page = alloc_page(GFP_NOIO); /* prexor */ > + > + if ((dev->towrite || > + i == sh->pd_idx || i == sh->qd_idx || > + test_bit(R5_InCache, &dev->flags)) && > !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags)) && > + !((test_bit(R5_UPTODATE, &dev->flags) && > + (!test_bit(R5_InCache, &dev->flags) || > + dev->page != dev->orig_page)) || > + test_bit(R5_Wantcompute, &dev->flags)) && > test_bit(R5_Insync, &dev->flags)) { > if (test_bit(STRIPE_PREREAD_ACTIVE, > &sh->state)) { > @@ -3658,6 +3717,7 @@ static void handle_stripe_dirtying(struct r5conf *conf, > i != sh->pd_idx && i != sh->qd_idx && > !test_bit(R5_LOCKED, &dev->flags) && > !(test_bit(R5_UPTODATE, &dev->flags) || > + test_bit(R5_InCache, &dev->flags) || > test_bit(R5_Wantcompute, &dev->flags))) { > rcw++; > if (test_bit(R5_Insync, &dev->flags) && > @@ -3697,7 +3757,7 @@ static void handle_stripe_dirtying(struct r5conf *conf, > */ > if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) && > (s->locked == 0 && (rcw == 0 || rmw == 0) && > - !test_bit(STRIPE_BIT_DELAY, &sh->state))) > + !test_bit(STRIPE_BIT_DELAY, &sh->state))) > schedule_reconstruction(sh, s, rcw == 0, 0); > } > > @@ -4010,6 +4070,46 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh) > async_tx_quiesce(&tx); > } > > +static void > +r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev, > + struct bio_list *return_bi) > +{ > + struct bio *wbi, *wbi2; > + > + wbi = dev->written; > + dev->written = NULL; > + while (wbi && wbi->bi_iter.bi_sector < > + dev->sector + STRIPE_SECTORS) { > + wbi2 = r5_next_bio(wbi, dev->sector); > + if (!raid5_dec_bi_active_stripes(wbi)) { > + md_write_end(conf->mddev); > + bio_list_add(return_bi, wbi); > + } > + wbi = wbi2; > + } > +} > + > +static void r5c_handle_cached_data_endio(struct r5conf *conf, > + struct stripe_head *sh, int disks, struct bio_list *return_bi) > +{ > + int i; > + > + for (i = sh->disks; i--; ) { > + if (test_bit(R5_InCache, &sh->dev[i].flags) && > + sh->dev[i].written) { > + set_bit(R5_UPTODATE, &sh->dev[i].flags); > + r5c_return_dev_pending_writes(conf, &sh->dev[i], > + return_bi); > + } > + } > + r5l_stripe_write_finished(sh); > + > + bitmap_endwrite(conf->mddev->bitmap, sh->sector, > + STRIPE_SECTORS, > + !test_bit(STRIPE_DEGRADED, &sh->state), > + 0); > +} > + > /* > * handle_stripe - do things to a stripe. > * > @@ -4188,6 +4288,8 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s) > if (rdev && !test_bit(Faulty, &rdev->flags)) > do_recovery = 1; > } > + if (test_bit(R5_InCache, &dev->flags) && dev->written) > + s->just_cached++; > } > if (test_bit(STRIPE_SYNCING, &sh->state)) { > /* If there is a failed device being replaced, > @@ -4416,7 +4518,7 @@ static void handle_stripe(struct stripe_head *sh) > struct r5dev *dev = &sh->dev[i]; > if (test_bit(R5_LOCKED, &dev->flags) && > (i == sh->pd_idx || i == sh->qd_idx || > - dev->written)) { > + dev->written || test_bit(R5_InCache, &dev->flags))) { > pr_debug("Writing block %d\n", i); > set_bit(R5_Wantwrite, &dev->flags); > if (prexor) > @@ -4456,6 +4558,12 @@ static void handle_stripe(struct stripe_head *sh) > test_bit(R5_Discard, &qdev->flags)))))) > handle_stripe_clean_event(conf, sh, disks, &s.return_bi); > > + if (s.just_cached) > + r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi); > + > + if (test_bit(STRIPE_R5C_FROZEN, &sh->state)) > + r5l_stripe_write_finished(sh); > + > /* Now we might consider reading some blocks, either to check/generate > * parity, or to satisfy requests > * or to load a block that is being partially written. > @@ -4467,13 +4575,17 @@ static void handle_stripe(struct stripe_head *sh) > || s.expanding) > handle_stripe_fill(sh, &s, disks); > > - /* Now to consider new write requests and what else, if anything > - * should be read. We do not handle new writes when: > + r5c_handle_stripe_written(conf, sh); > + > + /* Now to consider new write requests, cache write back and what else, > + * if anything should be read. We do not handle new writes when: > * 1/ A 'write' operation (copy+xor) is already in flight. > * 2/ A 'check' operation is in flight, as it may clobber the parity > * block. > + * 3/ A r5c cache log write is in flight. > */ > - if (s.to_write && !sh->reconstruct_state && !sh->check_state) > + if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) && > + !sh->reconstruct_state && !sh->check_state && !sh->log_io) > handle_stripe_dirtying(conf, sh, &s, disks); > > /* maybe we need to check and possibly fix the parity for this stripe > @@ -5192,7 +5304,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi) > * later we might have to read it again in order to reconstruct > * data on failed drives. > */ > - if (rw == READ && mddev->degraded == 0 && > + if (rw == READ && mddev->degraded == 0 && conf->log == NULL && > mddev->reshape_position == MaxSector) { > bi = chunk_aligned_read(mddev, bi); > if (!bi) > @@ -5917,6 +6029,7 @@ static void raid5d(struct md_thread *thread) > md_check_recovery(mddev); > spin_lock_irq(&conf->device_lock); > } > + r5c_do_reclaim(conf); > } > pr_debug("%d stripes handled\n", handled); > > @@ -6583,6 +6696,9 @@ static struct r5conf *setup_conf(struct mddev *mddev) > for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) > INIT_LIST_HEAD(conf->temp_inactive_list + i); > > + atomic_set(&conf->r5c_cached_stripes, 0); > + INIT_LIST_HEAD(&conf->r5c_cached_list); > + > conf->level = mddev->new_level; > conf->chunk_sectors = mddev->new_chunk_sectors; > if (raid5_alloc_percpu(conf) != 0) > @@ -7655,8 +7771,10 @@ static void raid5_quiesce(struct mddev *mddev, int state) > /* '2' tells resync/reshape to pause so that all > * active stripes can drain > */ > + r5c_flush_cache(conf); > conf->quiesce = 2; > wait_event_cmd(conf->wait_for_quiescent, > + atomic_read(&conf->r5c_cached_stripes) == 0 && > atomic_read(&conf->active_stripes) == 0 && > atomic_read(&conf->active_aligned_reads) == 0, > unlock_all_device_hash_locks_irq(conf), > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > index 517d4b6..e301d0e 100644 > --- a/drivers/md/raid5.h > +++ b/drivers/md/raid5.h > @@ -226,6 +226,7 @@ struct stripe_head { > > struct r5l_io_unit *log_io; > struct list_head log_list; > + atomic_t dev_in_cache; > /** > * struct stripe_operations > * @target - STRIPE_OP_COMPUTE_BLK target > @@ -263,6 +264,7 @@ struct stripe_head_state { > */ > int syncing, expanding, expanded, replacing; > int locked, uptodate, to_read, to_write, failed, written; > + int to_cache, just_cached; > int to_fill, compute, req_compute, non_overwrite; > int failed_num[2]; > int p_failed, q_failed; > @@ -313,6 +315,8 @@ enum r5dev_flags { > */ > R5_Discard, /* Discard the stripe */ > R5_SkipCopy, /* Don't copy data from bio to stripe cache */ > + R5_Wantcache, /* Want write data to write cache */ > + R5_InCache, /* Data in cache */ > }; > > /* > @@ -345,7 +349,11 @@ enum { > STRIPE_BITMAP_PENDING, /* Being added to bitmap, don't add > * to batch yet. > */ > - STRIPE_LOG_TRAPPED, /* trapped into log */ > + STRIPE_LOG_TRAPPED, /* trapped into log */ > + STRIPE_IN_R5C_CACHE, /* in r5c cache (to-be/being handled or > + * in conf->r5c_cached_list) */ > + STRIPE_R5C_FROZEN, /* r5c_cache frozen and being written out */ > + STRIPE_R5C_WRITTEN, /* ready for r5c_handle_stripe_written() */ > }; > > #define STRIPE_EXPAND_SYNC_FLAGS \ > @@ -521,6 +529,8 @@ struct r5conf { > */ > atomic_t active_stripes; > struct list_head inactive_list[NR_STRIPE_HASH_LOCKS]; > + atomic_t r5c_cached_stripes; > + struct list_head r5c_cached_list; > atomic_t empty_inactive_list_nr; > struct llist_head released_stripes; > wait_queue_head_t wait_for_quiescent; > @@ -635,4 +645,17 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh); > extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio); > extern void r5l_quiesce(struct r5l_log *log, int state); > extern bool r5l_log_disk_error(struct r5conf *conf); > +extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space); > +extern int > +r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh, > + struct stripe_head_state *s, int disks); > +extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh, > + struct stripe_head_state *s); > +extern int r5c_cache_parity(struct r5l_log *log, struct stripe_head *sh, > + struct stripe_head_state *s); > +extern void > +r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh); > +extern void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh); > +extern void r5c_do_reclaim(struct r5conf *conf); > +extern int r5c_flush_cache(struct r5conf *conf); > #endif > -- > 2.8.0.rc2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html