On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote: > [added some relevant people and lists to CC] > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote: > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote: > > > clone of "starship_production" > > > > The git-publish destroyed the cover letter: > > > > For the reference this is for bz #1872633 > > > > The issue is that current kernel code that implements 'fallocate' > > on kernel block devices roughly works like that: > > > > 1. Flush the page cache on the range that is about to be discarded. > > 2. Issue the discard and wait for it to finish. > > (as far as I can see the discard doesn't go through the > > page cache). > > > > 3. Check if the page cache is dirty for this range, > > if it is dirty (meaning that someone wrote to it meanwhile) > > return -EBUSY. > > > > This means that if qemu (or qemu-img) issues a write, and then > > discard to the area that shares a page, -EBUSY can be returned by > > the kernel. > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back > EBUSY which seems wrong to me. IMO we should handle this gracefully in the > kernel so we need to fix this. > > > On the other hand, for example, the ext4 implementation of discard > > doesn't seem to be affected. It does take a lock on the inode to avoid > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought. > > Well, filesystem hole punching is somewhat different beast than block device > discard (at least implementation wise). > > > Doing fsync and retrying is seems to resolve this issue, but it might be > > a too big hammer. Just retrying doesn't work, indicating that maybe the > > code that flushes the page cache in (1) doesn't do this correctly ? > > > > It also can be racy unless special means are done to block IO from happening > > from qemu during this fsync. > > > > This patch series contains two patches: > > > > First patch just lets the file-posix ignore the -EBUSY errors, which is > > technically enough to fail back to plain write in this case, but seems wrong. > > > > And the second patch adds an optimization to qemu-img to avoid such a > > fragmented write/discard in the first place. > > > > Both patches make the reproducer work for this particular bugzilla, > > but I don't think they are enough. > > > > What do you think? > > So if the EBUSY error happens because something happened to the page cache > outside of discarded range (like you describe above), that is a kernel bug > than needs to get fixed. EBUSY should really mean - someone wrote to the > discarded range while discard was running and userspace app has to deal > with that depending on what it aims to do... I double checked this, those are the writes/discards according to my debug prints (I print start and then start+len-1 for each request) I have attached the patch for this for reference. ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000) fallocate 00007fe00000 00007fffefff WRITE: 0x00007ffff000 00007ffffdff (len:0xe00) write 00007ffff000 00007ffffdff ZERO: 0x00007ffffe00 0000801fefff (len:0x1ff200) fallocate 00007ffffe00 0000801fefff FALLOCATE failed with error 16 qemu-img: error while writing at byte 2147483136: Device or resource busy Best regards, Maxim Levitsky > > Honza
commit ce897250babe3527f451c1c54c86b62659a2c29e Author: Maxim Levitsky <mlevitsk@xxxxxxxxxx> Date: Thu Oct 15 17:02:58 2020 +0300 hacks diff --git a/block/file-posix.c b/block/file-posix.c index c63926d592..91ca690505 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & QEMU_AIO_WRITE) { + + long unsigned int size = aiocb->aio_nbytes - offset; + long unsigned int off = aiocb->aio_offset + offset; + + printf(" write %012lx %012lx\n", off, off+size-1); + len = pwrite(aiocb->aio_fildes, (const char *)buf + offset, aiocb->aio_nbytes - offset, @@ -1581,10 +1587,15 @@ static int translate_err(int err) static int do_fallocate(int fd, int mode, off_t offset, off_t len) { do { + + printf(" fallocate %012lx %012lx\n", offset, offset+len-1); + if (fallocate(fd, mode, offset, len) == 0) { return 0; } } while (errno == EINTR); + + printf("FALLOCATE failed with error %d\n", errno); return translate_err(-errno); } #endif diff --git a/qemu-img.c b/qemu-img.c index c2c56fc797..64d3b84728 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num) } s->sector_next_status = sector_num + n; + printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9); } n = MIN(n, s->sector_next_status - sector_num); @@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num, return 0; } +static void test_print(const char* op, unsigned long start, unsigned long len) +{ + printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len); +} static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, int nb_sectors, uint8_t *buf, @@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, (s->compressed && !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) { + test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS, buf, flags); if (ret < 0) { @@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num, assert(!s->target_has_backing); break; } + test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS); + ret = blk_co_pwrite_zeroes(s->target, sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS,