Re: xfs: invalid requests to request_fn from xfs_repair

Jamie Pocas <pocas.jamie@xxxxxxxxx> · Mon, 7 Apr 2014 23:56:10 -0400

Hi Dave,
Okay, I think I've found a workaround but I don't have the root cause yet. So just to recap there is a macro defined in include/linux/blkdev.h called rq_for_each_segment(bvec, rq, iter). The arguments are of type struct bio_vec*, struct request*, and struct req_iterator respectively. The macro expands to several other macro invocations, which eventually expand to approx. 40-50 lines of code.

(NOTE: in newer kernels the first argument is a bio_vec, and not a bio_vec*. I think this is due to the immutable bio_vec changes, but I could be wrong. For more on that, see http://lwn.net/Articles/544713/ , but I digress.)

It seems like in some instances when using O_DIRECT and iterating through the request using the above macro, I see that some fields of iter.bio are not updated to the correct value and macros like blk_rq_cur_sectors(), which ultimately rely on rq->bio, thus do not return the correct values. If I assume (<--- dangerous word) that the sectors are all adjacent, as they really are supposed to be in all my investigation, and use only the bio_vec fields to track progress then the problems go away. I have been running all sorts of data consistency tests since this discovery and I don't have any further problems so far. Also, xfs_repair is returning cleanly as expected now.

So basically I am only using blk_rq_pos(rq) once before the loop to get the start sector, and then only relying on fields of bvec after that. For example if I need to count the number of sectors, I can accumulate (bvec->bv_len >> 9) through all the iterations. Summing up all of blk_req_curr_sectors(rq) only worked sometimes, and so was unreliable. Basically, the code to walk the request * looks something like this now, where it used to make use of iter.bio->bi_sector and blk_rq_cur_sectors.

    sector_t curr_sector = blk_rq_pos(req);

    unsigned curr_num_sectors = 0;

    unsigned total_sectors = 0;
    // Should not do the memory mapping on a discard request, payload is invalid

    BUG_ON(unlikely(req->cmd_flags & REQ_DISCARD));
    /* Iterate through the request */

    rq_for_each_segment(bvec, req, iter) {

        kern_buffer = kmap(bvec->bv_page) + bvec->bv_offset;

        curr_sector += curr_num_sectors;

        curr_num_sectors = (bvec->bv_len >> KERNEL_SECTOR_SHIFT);
        /* Do the transfer to the device. This part is not really relevant to the discussion here

           but you can just imagine it as a memcpy because that's effectively what's happening

           a failed copy would return -EFAULT and eventually the request would be ended with

           blk_end_request_all with an error code*/
        kunmap(bvec->bv_page);
        /* FAIL */

        if (ret) {

            return ret;

        }
        total_sectors += curr_num_sectors; // Just for accounting

    }
I am going to run the compiler with -E and examine the rq_for_each_segment() macro expansion to see where the suspected problem happens. If it does look like a bug, I am going to write to LKML. Perhaps it's something I am supposed to implicitly understand that is not well documented (like the rest of O_DIRECT), and they can set me straight.

As it stands this doesn't seem to be an xfs issue so I'm going to drop this discussion unless anyone else is interested in this issue or wants more info.
Regards,

Jamie

On Wed, Apr 2, 2014 at 12:47 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

> > > While iterating through segments in rq_for_each_segment(), for some

> > > requests I am seeing some odd behavior.

> > >

> > > segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok,

> > > this looks normal

> > > segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //

> > > Whoah... this doesn't look right to me

> >

> > Seems fine to me. There's absolutely no reason two separate IOs

> > can't be sub-page sector aligned or discontiguous given the above

> > configuration. If that's what the getblocks callback returned to the

> > DIO layer, then that's what you're going to see in the bios...

> >

> >

> I guess my assumption that sectors would be adjacent was invalid, but this

> is what I have read in ldd3 and other books and seen in some driver

> examples. So I apologize for my folly and it's another lesson in not

> believing everything you see on the net. The back half of my driver is

> really optimized for transferring a single contiguous lba & transfer length

> pair (ala SCSI WRITE 16 for example). However in desperation, I had changed

> the driver to support requests like this (with disjoint sector ranges) but

> it still seemed odd to me that occasionally some requests would even have

> multiple segments that would *write* the same sectors/lengths. For example

> I also see requests like the following.

>

> Given: rq_data_dir(rq) == 1 (i.e. a write)

> segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Ok

> segment   1: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //

> Again?

> segment   2: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Yet

> again?

> segment   3: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //

> Really?

> ...

> segment 126: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //

> What's going on?

> // Reminder: segment limit is 128

> segment 127: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 1 //

> Haha, bet you though I was going to say sector = 0!

That really looks to me like a bug in whatever is building that iter

structure. It looks like there's an off-by-one on each page (should

be 8 sectors not 7) and the bi_sector is not being set correctly to

the sector offset of the bio within the IO. i.e. for a contiguous

IO it should look something like:

segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 8

segment   1: iter.bio->bi_sector = 8, blk_rq_cur_sectors(rq) = 8

segment   2: iter.bio->bi_sector = 16, blk_rq_cur_sectors(rq) = 8

....

segment 127: iter.bio->bi_sector = 1016, blk_rq_cur_sectors(rq) = 8

I highly doubt the direct IO code is doing something wrong, because

a mapping bug like that would show up everywhere and not just in your

special driver....

I agree that the iterator *looks* wrong, Like I said earlier, I am just using the rq_for_each_segment() macro (defined in include/linux/blkdev.h and expands to a few dozen lines of code). This is the recommended way of iterating requests instead of using the lower level bio iterating functions or going it yourself which is why it's recommended to use that instead of going deeper and breaking compatibility when the kernel changes. I suspect it's the request itself which was constructed incorrectly, or contains bios that should not have been grouped into the same request, or contains bios that have their bi_sector set incorrectly.

> > The read syscall is for a byte offset (from the fd, set by lseek)

> > and a length, not a range of contiguous sectors on the device. That

> > off/len tuple gets mapped by the underlying filesystem or device

> > into an sector/len via a getblocks callback in the dio code and the

> > bios are then built according to the mappings that are returned. So

> > in many cases the IO that hits the block device looks nothing at all

> > like the IO that came from userspace.

> >

> >

> In general if I were reading a file I agree 100%, but this was the read

> call issued by xfs_repair to read the superblock directly from a block

> device. So I, perhaps wrongly, anticipated straightforward requests or

> maybe even a single 512k request coming down to the request_fn. I should

> have mentioned that this was observed while using xfs_repair earlier in the

> email though, so no offense intended. There's more of the xfs_repair strace

> in my response to your next comment below. It fails very early as you can

> imagine when xfs_repair can't read the superblock.

I figured that this is what you were talking about, but remember

that the block device also does it's own offset->sector mapping in

the direct IO layer. So it's entirely possible that the mapping

function (i.e the getblocks callback) is broken and that is why you

are seeing weird mappings in the bios being built by the direct Io

code.

This is the xfs_repair tool being run directly against a block device, so if I am reading the fs/direct-io.c code correctly this means that the mapping function (get_block_t) is using fs/block_dev.c:blkdev_get_block(). Is that correct? That function is very short and simple, so I can't imagine how there would be a bug in there. If it were a file on an xfs filesystem, the get_block_t callback would be xfs_get_blocks or xfs_get_blocks_direct (which both internally call __xfs_get_blocks). Is my understanding correct?

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs