On Fri, Feb 17 2017, Shaohua Li wrote: > On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote: >> On Thu, Feb 16 2017, colyli@xxxxxxx wrote: >> >> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' >> > introduces a sliding resync window for raid1 I/O barrier, this idea limits >> > I/O barriers to happen only inside a slidingresync window, for regular >> > I/Os out of this resync window they don't need to wait for barrier any >> > more. On large raid1 device, it helps a lot to improve parallel writing >> > I/O throughput when there are background resync I/Os performing at >> > same time. >> > >> > The idea of sliding resync widow is awesome, but code complexity is a >> > challenge. Sliding resync window requires several veriables to work >> >> variables >> >> > collectively, this is complexed and very hard to make it work correctly. >> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches >> > to fix the original resync window patch. This is not the end, any further >> > related modification may easily introduce more regreassion. >> > >> > Therefore I decide to implement a much simpler raid1 I/O barrier, by >> > removing resync window code, I believe life will be much easier. >> > >> > The brief idea of the simpler barrier is, >> > - Do not maintain a logbal unique resync window >> >> global >> >> > - Use multiple hash buckets to reduce I/O barrier conflictions, regular >> >> conflicts >> >> > I/O only has to wait for a resync I/O when both them have same barrier >> > bucket index, vice versa. >> > - I/O barrier can be recuded to an acceptable number if there are enought >> >> reduced >> enough >> >> > barrier buckets >> > >> > Here I explain how the barrier buckets are designed, >> > - BARRIER_UNIT_SECTOR_SIZE >> > The whole LBA address space of a raid1 device is divided into multiple >> > barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. >> > Bio request won't go across border of barrier unit size, that means >> >> requests >> >> > maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. >> > For random I/O 64MB is large enough for both read and write requests, >> > for sequential I/O considering underlying block layer may merge them >> > into larger requests, 64MB is still good enough. >> > Neil also points out that for resync operation, "we want the resync to >> > move from region to region fairly quickly so that the slowness caused >> > by having to synchronize with the resync is averaged out over a fairly >> > small time frame". For full speed resync, 64MB should take less then 1 >> > second. When resync is competing with other I/O, it could take up a few >> > minutes. Therefore 64MB size is fairly good range for resync. >> > >> > - BARRIER_BUCKETS_NR >> > There are BARRIER_BUCKETS_NR buckets in total, which is defined by, >> > #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) >> > #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) >> > this patch makes the bellowed members of struct r1conf from integer >> > to array of integers, >> > - int nr_pending; >> > - int nr_waiting; >> > - int nr_queued; >> > - int barrier; >> > + int *nr_pending; >> > + int *nr_waiting; >> > + int *nr_queued; >> > + int *barrier; >> > number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB >> > kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O >> > barrier buckets, and each array of integers occupies single memory page. >> > 1024 means for a request which is smaller than the I/O barrier unit size >> > has ~0.1% chance to wait for resync to pause, which is quite a small >> > enough fraction. Also requesting single memory page is more friendly to >> > kernel page allocator than larger memory size. >> > >> > - I/O barrier bucket is indexed by bio start sector >> > If multiple I/O requests hit different I/O barrier units, they only need >> > to compete I/O barrier with other I/Os which hit the same I/O barrier >> > bucket index with each other. The index of a barrier bucket which a >> > bio should look for is calculated by sector_to_idx() which is defined >> > in raid1.h as an inline function, >> > static inline int sector_to_idx(sector_t sector) >> > { >> > return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, >> > BARRIER_BUCKETS_NR_BITS); >> > } >> > Here sector_nr is the start sector number of a bio. >> >> "hash_long() is used so that sequential writes in are region of the >> array which is not being resynced will not consistently align with >> the buckets that are being sequentially resynced, as described below" >> >> > >> > - Single bio won't go across boundary of a I/O barrier unit >> > If a request goes across boundary of barrier unit, it will be split. A >> > bio may be split in raid1_make_request() or raid1_sync_request(), if >> > sectors returned by align_to_barrier_unit_end() is small than original >> >> smaller >> >> > bio size. >> > >> > Comparing to single sliding resync window, >> > - Currently resync I/O grows linearly, therefore regular and resync I/O >> > will have confliction within a single barrier units. So the I/O >> >> ... will conflict within ... >> >> > behavior is similar to single sliding resync window. >> > - But a barrier unit bucket is shared by all barrier units with identical >> > barrier uinit index, the probability of confliction might be higher >> > than single sliding resync window, in condition that writing I/Os >> > always hit barrier units which have identical barrier bucket indexs with >> > the resync I/Os. This is a very rare condition in real I/O work loads, >> > I cannot imagine how it could happen in practice. >> > - Therefore we can achieve a good enough low confliction rate with much >> >> ... low conflict rate ... >> >> > simpler barrier algorithm and implementation. >> > >> > There are two changes should be noticed, >> > - In raid1d(), I change the code to decrease conf->nr_pending[idx] into >> > single loop, it looks like this, >> > spin_lock_irqsave(&conf->device_lock, flags); >> > conf->nr_queued[idx]--; >> > spin_unlock_irqrestore(&conf->device_lock, flags); >> > This change generates more spin lock operations, but in next patch of >> > this patch set, it will be replaced by a single line code, >> > atomic_dec(&conf->nr_queueud[idx]); >> > So we don't need to worry about spin lock cost here. >> > - Mainline raid1 code split original raid1_make_request() into >> > raid1_read_request() and raid1_write_request(). If the original bio >> > goes across an I/O barrier unit size, this bio will be split before >> > calling raid1_read_request() or raid1_write_request(), this change >> > the code logic more simple and clear. >> > - In this patch wait_barrier() is moved from raid1_make_request() to >> > raid1_write_request(). In raid_read_request(), original wait_barrier() >> > is replaced by raid1_read_request(). >> > The differnece is wait_read_barrier() only waits if array is frozen, >> > using different barrier function in different code path makes the code >> > more clean and easy to read. >> >> Thank you for putting the effort into writing a comprehensve change >> description. I really appreciate it. >> >> > >> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, >> > >> > static void raid1_make_request(struct mddev *mddev, struct bio *bio) >> > { >> > - struct r1conf *conf = mddev->private; >> > - struct r1bio *r1_bio; >> > + void (*make_request_fn)(struct mddev *mddev, struct bio *bio); >> > + struct bio *split; >> > + sector_t sectors; >> > >> > - /* >> > - * make_request() can abort the operation when read-ahead is being >> > - * used and no empty request is available. >> > - * >> > - */ >> > - r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO); >> > + make_request_fn = (bio_data_dir(bio) == READ) ? >> > + raid1_read_request : raid1_write_request; >> > >> > - r1_bio->master_bio = bio; >> > - r1_bio->sectors = bio_sectors(bio); >> > - r1_bio->state = 0; >> > - r1_bio->mddev = mddev; >> > - r1_bio->sector = bio->bi_iter.bi_sector; >> > - >> > - /* >> > - * We might need to issue multiple reads to different devices if there >> > - * are bad blocks around, so we keep track of the number of reads in >> > - * bio->bi_phys_segments. If this is 0, there is only one r1_bio and >> > - * no locking will be needed when requests complete. If it is >> > - * non-zero, then it is the number of not-completed requests. >> > - */ >> > - bio->bi_phys_segments = 0; >> > - bio_clear_flag(bio, BIO_SEG_VALID); >> > + /* if bio exceeds barrier unit boundary, split it */ >> > + do { >> > + sectors = align_to_barrier_unit_end( >> > + bio->bi_iter.bi_sector, bio_sectors(bio)); >> > + if (sectors < bio_sectors(bio)) { >> > + split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); >> > + bio_chain(split, bio); >> > + } else { >> > + split = bio; >> > + } >> > >> > - if (bio_data_dir(bio) == READ) >> > - raid1_read_request(mddev, bio, r1_bio); >> > - else >> > - raid1_write_request(mddev, bio, r1_bio); >> > + make_request_fn(mddev, split); >> > + } while (split != bio); >> > } >> >> I know you are going to change this as Shaohua wantsthe spitting >> to happen in a separate function, which I agree with, but there is >> something else wrong here. >> Calling bio_split/bio_chain repeatedly in a loop is dangerous. >> It is OK for simple devices, but when one request can wait for another >> request to the same device it can deadlock. >> This can happen with raid1. If a resync request calls raise_barrier() >> between one request and the next, then the next has to wait for the >> resync request, which has to wait for the first request. >> As the first request will be stuck in the queue in >> generic_make_request(), you get a deadlock. >> It is much safer to: >> >> if (need to split) { >> split = bio_split(bio, ...) >> bio_chain(...) >> make_request_fn(split); >> generic_make_request(bio); >> } else >> make_request_fn(mddev, bio); >> >> This way we first process the initial section of the bio (in 'split') >> which will queue some requests to the underlying devices. These >> requests will be queued in generic_make_request. >> Then we queue the remainder of the bio, which will be added to the end >> of the generic_make_request queue. >> Then we return. >> generic_make_request() will pop the lower-level device requests off the >> queue and handle them first. Then it will process the remainder >> of the original bio once the first section has been fully processed. > > Good point! raid10 has the same problem. Looks this doesn't solve the issue for > device with 3 times stack though. > > I knew you guys are working on this issue in block layer. Should we fix the > issue in MD side (for 2 stack devices) or wait for the block layer patch? We cannot fix everything at the block layer, or at the individual device layer. We need changes in both. I think that looping over bios in a device driver is wrong and can easily lead to deadlocks. We should remove that from md. If the block layer gets fixed that way I want it to, then we could move the generic_make_request() call earlier, so that above could be >> if (need to split) { >> split = bio_split(bio, ...) >> bio_chain(...) >> generic_make_request(bio); >> bio = split() >> } >> make_request_fn(mddev, bio); which is slightly simpler. But the original would still work. So yes, I think we need this change in md/raid1. I suspect that if you built a kernel with a smaller BARRIER_UNIT_SECTOR_BITS - e.g. 4 - you could very easily trigger a deadlock with md/raid1 on scsi. At 17, it is not quite so easy, but is a real possibility. I've had similar deadlocks reported before when the code wasn't quite careful enough. NeilBrown
Attachment:
signature.asc
Description: PGP signature