Re: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/19/19 6:04 PM, Coly Li wrote:
When bcache journal initiates during running cache set, cache set
journal.blocks_free is initiated as 0. Then during journal replay if
journal_meta() is called and an empty jset is written to cache device,
journal_reclaim() is called. If there is available journal bucket to
reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
bucket, which is c->sb.bucket_size >> c->block_bits.

Most of time the above process works correctly, expect the condtion
when journal space is almost full. "Almost full" means there is no free
journal bucket, but there are still free blocks in last available
bucket indexed by ja->cur_idx.

If system crashes or reboots when journal space is almost full, problem
comes. During cache set reload after the reboot, c->journal.blocks_free
is initialized as 0, when jouranl replay process writes bcache jouranl,
journal_reclaim() will be called to reclaim available journal bucket and
set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
there is no fully free bucket to reclaim in journal_reclaim(), so value
of c->journal.blocks_free will keep 0. If the first journal entry
processed by journal_replay() causes btree split and requires writing
journal space by journal_meta(), journal_meta() has to go into an
infinite loop to reclaim jouranl bucket, and blocks the whole cache set
to run.

Such buggy situation can be solved if we do following things before
journal replay starts,
- Recover previous value of c->journal.blocks_free in last run time,
   and set it to current c->journal.blocks_free as initial value.
- Recover previous value of ja->cur_idx in last run time, and set it to
   KEY_PTR of current c->journal.key as initial value.

After c->journal.blocks_free and c->journal.key are recovered, in
condition when jouranl space is almost full and cache set is reloaded,
meta journal entry from journal reply can be written into free blocks of
the last available journal bucket, then old jouranl entries can be
replayed and reclaimed for further journaling request.

This patch adds bch_journal_key_reload() to recover journal blocks_free
and key ptr value for above purpose. bch_journal_key_reload() is called
in bch_journal_read() before replying journal by bch_journal_replay().

Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Coly Li <colyli@xxxxxxx>
---
  drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 87 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..a6deb16c15c8 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
  	return ret;
  }
+static int bch_journal_key_reload(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int iter, n = 0;
+	struct bkey *k = &c->journal.key;
+	int ret = 0;
+
+	for_each_cache(ca, c, iter) {
+		struct journal_device *ja = &ca->journal;
+		struct bio *bio = &ja->bio;
+		struct jset *j, *data = c->journal.w[0].data;
+		struct closure cl;
+		unsigned int len, left;
+		unsigned int offset = 0, used_blocks = 0;
+		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
+
+		closure_init_stack(&cl);
+
+		while (offset < ca->sb.bucket_size) {
+reread:			left = ca->sb.bucket_size - offset;

Please place the label on a line of its own.

+			len = min_t(unsigned int,
+				    left, PAGE_SECTORS << JSET_BITS);
+
+			bio_reset(bio);
+			bio->bi_iter.bi_sector = bucket + offset;
+			bio_set_dev(bio, ca->bdev);
+			bio->bi_iter.bi_size = len << 9;
+
+			bio->bi_end_io = journal_read_endio;
+			bio->bi_private = &cl;
+			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			bch_bio_map(bio, data);
+
+			closure_bio_submit(c, bio, &cl);
+			closure_sync(&cl);
+
+			j = data;
+			while (len) {
+				size_t blocks, bytes = set_bytes(j);
+
+				if (j->magic != jset_magic(&ca->sb))
+					goto out;
+
+				if (bytes > left << 9 ||
+				    bytes > PAGE_SIZE << JSET_BITS) {
+					pr_err("jset may be correpted: too big");
+					ret = -EIO;
+					goto err;
+				}
+
+				if (bytes > len << 9)
+					goto reread;
+
+				if (j->csum != csum_set(j)) {
+					pr_err("jset may be corrupted: bad csum");
+					ret = -EIO;
+					goto err;
+				}
+
+				blocks = set_blocks(j, block_bytes(c));
+				used_blocks += blocks;
+
+				offset	+= blocks * ca->sb.block_size;
+				len	-= blocks * ca->sb.block_size;
+				j = ((void *) j) + blocks * block_bytes(ca);
+			}
+		}
+out:
+		c->journal.blocks_free =
+			(c->sb.bucket_size >> c->block_bits) -
+			used_blocks;
+
+		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
+	}
+
+	BUG_ON(n == 0);
+	bkey_init(k);
+	SET_KEY_PTRS(k, n);
+
+err:
+	return ret;
+}
+
  int bch_journal_read(struct cache_set *c, struct list_head *list)
  {
  #define read_bucket(b)							\
This is a _quite_ convoluted nested loop.
It would be far better if it could be split into functions, so as to avoid the loop-within-loop constructs.
Oh, and some documentation (especially the 'reread' bit) would be nice.

Cheers,

Hannes
--
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux