On Wed, 4 Jan 2017, Mike Snitzer wrote: > On Wed, Jan 04 2017 at 12:12am -0500, > NeilBrown <neilb@xxxxxxxx> wrote: > > > On Tue, Jan 03 2017, Jack Wang wrote: > > > > > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@xxxxxxxxxx>: > > >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote: > > >>> Dear Maintainers > > >>> > > >>> I'd like to ask for the status of this patch since we hit the > > >>> issue too during our testing on md raid1. > > >>> > > >>> Split remainder bio_A was queued ahead, following by bio_B for > > >>> lower device, at this moment raid start freezing, the loop take > > >>> out bio_A firstly and deliver it, which will hung since raid is > > >>> freezing, while the freezing never end since it waiting for > > >>> bio_B to finish, and bio_B is still on the queue, waiting for > > >>> bio_A to finish... > > >>> > > >>> We're looking for a good solution and we found this patch > > >>> already progressed a lot, but we can't find it on linux-next, > > >>> so we'd like to ask are we still planning to have this fix > > >>> in upstream? > > >> > > >> I don't see why not, I'd even like to have it in older kernels, > > >> but did not have the time and energy to push it. > > >> > > >> Thanks for the bump. > > >> > > >> Lars > > >> > > > Hi folks, > > > > > > As Michael mentioned, we hit a bug this patch is trying to fix. > > > Neil suggested another way to fix it. I attached below. > > > I personal prefer Neil's version as it's less code change, and straight forward. > > > > > > Could you share your comments, we can get one fix into mainline. > > > > > > Thanks, > > > Jinpu > > > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001 > > > From: NeilBrown <neilb@xxxxxxxx> > > > Date: Wed, 14 Dec 2016 16:55:52 +0100 > > > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() > > > > > > When we call wait_barrier, we might have some bios waiting > > > in current->bio_list, which prevents the array_freeze call to > > > complete. Those can only be internal READs, which have already > > > passed the wait_barrier call (thus incrementing nr_pending), but > > > still were not submitted to the lower level, due to generic_make_request > > > logic to avoid recursive calls. In such case, we have a deadlock: > > > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so > > > - internal READ bios will not be submitted, thus freeze_array will > > > never completes. > > > > > > To fix this, modify generic_make_request to always sort bio_list_on_stack > > > first with lowest level, then higher, until same level. > > > > > > Sent to linux-raid mail list: > > > https://marc.info/?l=linux-raid&m=148232453107685&w=2 > > > > > > > This should probably also have > > > > Inspired-by: Lars Ellenberg <lars.ellenberg@xxxxxxxxxx> > > > > or something that, as I was building on Lars' ideas when I wrote this. > > > > It would also be worth noting in the description that this addresses > > issues with dm and drbd as well as md. > > I never saw this patch but certainly like the relative simplicity of the > solution when compared with other approaches taken, e.g. (5 topmost > commits on this branch): > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip > > > In fact, I think that with this patch in place, much of the need for the > > rescue_workqueue won't exist any more. I cannot promise it can be > > removed completely, but it should be to hard to make it optional and > > only enabled for those few block devices that will still need it. > > The rescuer should only be needed for a bioset which can be allocated > > From twice in the one call the ->make_request_fn. This would include > > raid0 for example, though raid0_make_reqest could be re-written to not > > use a loop and to just call generic_make_request(bio) if bio != split. > > > > Suggested-by: NeilBrown <neilb@xxxxxxxx> > > > Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> > > > --- > > > block/blk-core.c | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > > index 9e3ac56..47ef373 100644 > > > --- a/block/blk-core.c > > > +++ b/block/blk-core.c > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio) > > > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > > > > > if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) { > > > + struct bio_list lower, same, hold; > > > + > > > + /* Create a fresh bio_list for all subordinate requests */ > > > + bio_list_init(&hold); > > > + bio_list_merge(&hold, &bio_list_on_stack); > > > + bio_list_init(&bio_list_on_stack); > > > > > > ret = q->make_request_fn(q, bio); > > > > > > blk_queue_exit(q); > > > + /* sort new bios into those for a lower level > > > + * and those for the same level > > > + */ > > > + bio_list_init(&lower); > > > + bio_list_init(&same); > > > + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) > > > + if (q == bdev_get_queue(bio->bi_bdev)) > > > + bio_list_add(&same, bio); > > > + else > > > + bio_list_add(&lower, bio); > > > + /* now assemble so we handle the lowest level first */ > > > + bio_list_merge(&bio_list_on_stack, &lower); > > > + bio_list_merge(&bio_list_on_stack, &same); > > > + bio_list_merge(&bio_list_on_stack, &hold); > > > > > > bio = bio_list_pop(current->bio_list); > > > } else { > > > -- > > > 2.7.4 > > Mikulas, would you be willing to try the below patch with the > dm-snapshot deadlock scenario and report back on whether it fixes that? > > Patch below looks to be the same as here: > https://marc.info/?l=linux-raid&m=148232453107685&q=p3 > > Neil and/or others if that isn't the patch that should be tested please > provide a pointer to the latest. > > Thanks, > Mike The bad news is that this doesn't fix the snapshot deadlock. I created a test program for the snapshot deadlock bug (it was originally created years ago to test for a different bug, so it contains some cruft). You also need to insert "if (ci->sector_count) msleep(100);" to the end of __split_and_process_non_flush to make the kernel sleep when splitting the bio. And with the above above patch, the snapshot deadlock bug still happens. Mikulas #define _XOPEN_SOURCE 500 #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <malloc.h> #include <pthread.h> #include <asm/unistd.h> /* * Change "VG" symbol to a volume group name that you are using. * * You must apply this patch to the kernel to trigger the bug: * Index: linux-4.10-rc2/drivers/md/dm.c * =================================================================== * --- linux-4.10-rc2.orig/drivers/md/dm.c * +++ linux-4.10-rc2/drivers/md/dm.c * @@ -1223,6 +1223,9 @@ static int __split_and_process_non_flush * ci->sector += len; * ci->sector_count -= len; * * + if (ci->sector_count) * + msleep(100); * + * return 0; * } * */ #define VG "vg1" #define LV "test_lv" #define LV_SNAP "test_snap" #define MEGABYTES "12" #define SNAP_MEGABYTES "16" #define THREADS 1 #define BS 4096 #define SKEW 512 #define ORIG_PATTERN 'p' #define NEW_PATTERN 'n' enum { IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE, }; enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP, IOPRIO_WHO_USER, }; #define IOPRIO_CLASS_SHIFT 13 static inline int ioprio_set(int which, int who, int ioprio) { return syscall(__NR_ioprio_set, which, who, ioprio); } static inline int ioprio_get(int which, int who) { return syscall(__NR_ioprio_get, which, who); } #define PRIO_READER ((IOPRIO_CLASS_IDLE << IOPRIO_CLASS_SHIFT) | 0xff) #define PRIO_WRITER (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) static void do_cmd(char *cmd, int ign_err) { int r; fprintf(stderr, "* %s\n", cmd); r = system(cmd); if (r) { if (r == -1) { perror("system"); } else { if (ign_err) return; fprintf(stderr, "return code %x\n", r); } exit(1); } } static char pattern[BS]; static int h_orig, h_snap; static int n; static long long test_of; static pthread_rwlock_t rw_lock_1; static pthread_rwlock_t rw_lock_2; static pthread_rwlock_t rw_lock_3; static volatile int started = 0; static void pthread_error(int r) { fprintf(stderr, "pthread_error: %s\n", strerror(r)); exit(1); } static void *test_read(long long of) { int r; char *t = memalign(BS, BS); if (!t) perror("memalign"), exit(1); if ((r = pread(h_snap, t, BS, of)) != BS) { fprintf(stderr, "can't read (%d): %s\n", r, strerror(errno)); exit(1); } if (memcmp(pattern, t, BS)) { int i; for (i = 0; i < BS; i++) if (t[i] != pattern[i]) break; fprintf(stderr, "!!!! SNAPSHOT VOLUME DAMAGE AT BLOCK OFFSET %llX, BYTE OFFSET %X: %02x != %02x\n", of, i, (unsigned char)t[i], (unsigned char)pattern[i]); exit(2); } free(t); return NULL; } static void *test_thread(void *_) { int r; _ = _; //fprintf(stderr, "start\n"); if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_READER))) perror("ioprio_set"), exit(1); if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r); started = 1; if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_READER) { if (r == -1) perror("ioprio_get"); else fprintf(stderr, "reader priority not set: %x\n", r); exit(1); } again: if ((r = pthread_rwlock_rdlock(&rw_lock_1))) pthread_error(r); if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r); if (test_of == -1) { if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r); //fprintf(stderr, "return\n"); return NULL; } //fprintf(stderr, "test(%lld)\n", test_of); test_read(test_of); if ((r = pthread_rwlock_rdlock(&rw_lock_3))) pthread_error(r); if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r); if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r); if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r); goto again; } int main(void) { int i, j, r; char *np; pthread_t thr[THREADS]; memset(pattern, ORIG_PATTERN, sizeof pattern); do_cmd("lvremove -f "VG"/"LV_SNAP"", 1); do_cmd("lvremove -f "VG"/"LV"", 1); do_cmd("lvcreate -L "MEGABYTES" -n "LV" "VG"", 0); h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR); if (h_orig < 0) perror("open orig"), exit(1); if (lseek(h_orig, SKEW, SEEK_SET) == -1) perror("lseek"), exit(1); n = 0; while (write(h_orig, pattern, BS) == BS) { n++; fprintf(stderr, "creating %llx...\r", (long long)n * BS + SKEW); } if (fsync(h_orig)) perror("fsync"), exit(1); fprintf(stderr,"\n"); lseek(h_orig, 0, SEEK_SET); close(h_orig); do_cmd("lvcreate -L "SNAP_MEGABYTES" -n "LV_SNAP" -s "VG"/"LV"", 0); h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR | O_DIRECT); if (h_orig < 0) perror("open orig"), exit(1); h_snap = open("/dev/mapper/"VG"-"LV_SNAP"", O_RDONLY | O_DIRECT); if (h_snap < 0) perror("open snap"), exit(1); if ((r = pthread_rwlock_init(&rw_lock_1, NULL))) pthread_error(r); if ((r = pthread_rwlock_init(&rw_lock_2, NULL))) pthread_error(r); if ((r = pthread_rwlock_init(&rw_lock_3, NULL))) pthread_error(r); if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r); if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r); if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_WRITER))) perror("ioprio_set"), exit(1); for (j = 0; j < THREADS; j++) { if ((r = pthread_create(&thr[j], NULL, test_thread, NULL))) pthread_error(r); } while (!started) usleep(1000); if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_WRITER) { if (r == -1) perror("ioprio_get"); else fprintf(stderr, "writer priority not set: %x\n", r); exit(1); } np = memalign(BS, BS); if (!np) perror("memalign"), exit(1); memset(np, NEW_PATTERN, BS); for (i = 0; i < n; i++) { test_of = (off_t)i * BS + SKEW; fprintf(stderr, "testing %llx...\r", test_of); if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r); sched_yield(); if (pwrite(h_orig, np, BS, test_of) != BS) { fprintf(stderr, "can't write (%d): %s\n", r, strerror(errno)); exit(1); } if ((r = pthread_rwlock_wrlock(&rw_lock_2))) pthread_error(r); if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r); if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r); if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r); if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r); } fprintf(stderr,"\n"); test_of = -1; if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r); for (j = 0; j < THREADS; j++) { if ((r = pthread_join(thr[j], NULL))) pthread_error(r); } fprintf(stderr, "TEST PASSED OK.\n"); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html