Re: [PATCH] block: bio-integrity: fix potential null-ptr-deref in bio_integrity_free

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 18 Jun 2024 09:43:58 +0800

On Mon, Jun 17, 2024 at 11:49:59AM +0800, yebin (H) wrote:
> 
> 
> On 2024/6/11 11:29, Ming Lei wrote:
> > On Tue, Jun 11, 2024 at 10:48:13AM +0800, yebin (H) wrote:
> > > 
> > > On 2024/6/7 9:35, Ming Lei wrote:
> > > > On Fri, Jun 07, 2024 at 09:32:29AM +0800, yebin wrote:
> > > > > On 2024/6/7 8:13, Ming Lei wrote:
> > > > > > On Thu, Jun 06, 2024 at 02:26:55PM +0800, Ye Bin wrote:
> > > > > > > From: Ye Bin <yebin10@xxxxxxxxxx>
> > > > > > > 
> > > > > > > There's a issue as follows when do format NVME with IO:
> > > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > > > > > > PGD 101727f067 P4D 1011fae067 PUD fbed78067 PMD 0
> > > > > > > Oops: 0000 [#1] SMP NOPTI
> > > > > > > RIP: 0010:kfree+0x4f/0x160
> > > > > > > RSP: 0018:ff705a800912b910 EFLAGS: 00010247
> > > > > > > RAX: 0000000000000000 RBX: 0d06d30000000000 RCX: ff4fb320260ad990
> > > > > > > RDX: ff4fb30ee7acba40 RSI: 0000000000000000 RDI: 00b04cff80000000
> > > > > > > RBP: ff4fb30ee7acba40 R08: 0000000000000200 R09: ff705a800912bb60
> > > > > > > R10: 0000000000000000 R11: ff4fb3103b67c750 R12: ffffffff9a62d566
> > > > > > > R13: ff4fb30aa0530000 R14: 0000000000000000 R15: 000000000000000a
> > > > > > > FS:  00007f4399b6b700(0000) GS:ff4fb31040140000(0000) knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 0000000000000008 CR3: 0000001014cd4002 CR4: 0000000000761ee0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> > > > > > > PKRU: 55555554
> > > > > > > Call Trace:
> > > > > > >     bio_integrity_free+0xa6/0xb0
> > > > > > >     __bio_integrity_endio+0x8c/0xa0
> > > > > > >     bio_endio+0x2b/0x130
> > > > > > >     blk_update_request+0x78/0x2b0
> > > > > > >     blk_mq_end_request+0x1a/0x140
> > > > > > >     blk_mq_try_issue_directly+0x5d/0xc0
> > > > > > >     blk_mq_make_request+0x46b/0x540
> > > > > > >     generic_make_request+0x121/0x300
> > > > > > >     submit_bio+0x6c/0x140
> > > > > > >     __blkdev_direct_IO_simple+0x1ca/0x3a0
> > > > > > >     blkdev_direct_IO+0x3d9/0x460
> > > > > > >     generic_file_read_iter+0xb4/0xc60
> > > > > > >     new_sync_read+0x121/0x170
> > > > > > >     vfs_read+0x89/0x130
> > > > > > >     ksys_read+0x52/0xc0
> > > > > > >     do_syscall_64+0x5d/0x1d0
> > > > > > >     entry_SYSCALL_64_after_hwframe+0x65/0xca
> > > > > > > 
> > > > > > > Assuming a 512 byte directIO is issued, the initial logical block size of
> > > > > > > the state block device is 512 bytes, and then modified to 4096 bytes.
> > > > > > > Above issue may happen as follows:
> > > > > > >             Direct read                    format NVME
> > > > > > > __blkdev_direct_IO_simple(iocb, iter, nr_pages);
> > > > > > >      if ((pos | iov_iter_alignment(iter)) & (bdev_logical_block_size(bdev) - 1))
> > > > > > > 	-->The logical block size is 512, and the IO issued is 512 bytes,
> > > > > > > 	   which can be checked
> > > > > > >        return -EINVAL;
> > > > > > >      submit_bio(&bio);
> > > > > > >                                          nvme_dev_ioctl
> > > > > > >                                            case NVME_IOCTL_RESCAN:
> > > > > > >                                              nvme_queue_scan(ctrl);
> > > > > > >                                                 ...
> > > > > > >                                                nvme_update_disk_info(disk, ns, id);
> > > > > > >                                                  blk_queue_logical_block_size(disk->queue, bs);
> > > > > > >                                                    --> 512->4096
> > > > > > >         blk_queue_enter(q, flags)
> > > > > > >         blk_mq_make_request(q, bio)
> > > > > > >           bio_integrity_prep(bio)
> > > > > > > 	 len = bio_integrity_bytes(bi, bio_sectors(bio));
> > > > > > > 	   -->At this point, because the logical block size has increased to
> > > > > > > 	      4096 bytes, the calculated 'len' here is 0
> > > > > > >             buf = kmalloc(len, GFP_NOIO | q->bounce_gfp);
> > > > > > > 	   -->Passed in len=0 and returned buf=16
> > > > > > >             end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > > > >             start = ((unsigned long) buf) >> PAGE_SHIFT;
> > > > > > >             nr_pages = end - start;  -->nr_pages == 1
> > > > > > >             bip->bip_flags |= BIP_BLOCK_INTEGRITY;
> > > > > > >             for (i = 0 ; i < nr_pages ; i++) {
> > > > > > >               if (len <= 0)
> > > > > > >                  -->Not initializing the bip_vec of bio_integrity, will result
> > > > > > > 		 in null pointer access during subsequent releases. Even if
> > > > > > > 		 initialized, it will still cause subsequent releases access
> > > > > > > 		 null pointer because the buffer address is incorrect.
> > > > > > >                 break;
> > > > > > > 
> > > > > > > Firstly, it is unreasonable to format NVME in the presence of IO. It is also
> > > > > > > possible to see IO smaller than the logical block size in the block layer for
> > > > > > > this type of concurrency. It is expected that this type of IO device will
> > > > > > > return an error, so exception handling should also be done for this type of
> > > > > > > IO to prevent null pointer access from causing system crashes.
> > > > > > Actually unaligned IO handling is one mess for nvme hardware. Yes, IO may fail,
> > > > > > but it is observed that meta buffer is overwrite by DMA in read IO.
> > > > > > 
> > > > > > Ye and Yi, can you test the following patch in your 'nvme format' & IO workload?
> > > > > > 
> > > > > > 
> > > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > > index 82c3ae22d76d..a41ab4a3a398 100644
> > > > > > --- a/block/blk-core.c
> > > > > > +++ b/block/blk-core.c
> > > > > > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > > > >     	return 0;
> > > > > >     }
> > > > > > +static bool bio_unaligned(struct bio *bio)
> > > > > > +{
> > > > > > +	unsigned int bs = bdev_logical_block_size(bio->bi_bdev);
> > > > > > +
> > > > > > +	if (bio->bi_iter.bi_size & (bs - 1))
> > > > > > +	        return true;
> > > > > > +
> > > > > > +	if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1))
> > > > > > +	        return true;
> > > > > > +
> > > > > > +	return false;
> > > > > > +}
> > > > > I think this judgment is a bit incorrect. It should not be sufficient to
> > > > > only determine whether
> > > > > the length and starting sector are logically block aligned.
> > > > Can you explain why the two are not enough? Other limits should be handled
> > > > by bio split.
> > > If logical block size is 512 bytes, BIO has 4 segments, each segment length
> > > is 512 bytes,
> > > bio->bi_iter.bi_sector == 0. If logical block size change to 4096 bytes,
> > > bio_unaligned() will
> > > return false.
> > Yes, this IO is still 4096 aligned in block size level.
> > 
> > It is just that each bvec buffer isn't page-aligned, for nvme, if virt_boundary
> > is set, this bio will be split. However, we don't add logical block size
> > check in submit_bio_noacct() yet, 512byte bio still can be sent to
> > device.
> > 
> > > I'm not sure if the example I gave is appropriate?
> > Absolutely it is one good example.
> > 
> > BTW, Yi have tested both your patch and my patch which checks lbs in
> > blk_queue_enter(), looks slab corruption still can be triggered with
> > either one.
> Yes, my patch only solves the integrity process in a single point to avoid
> affecting
> the normal IO process. I am not sure if other processes will have similar
> issues.
> > 
> > Yi, can you test the following patch?
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 82c3ae22d76d..c47e69795c86 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -336,6 +336,19 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> >   	return 0;
> >   }
> > +static inline bool bio_unaligned(struct bio *bio)
> > +{
> > +	unsigned int bs = bdev_logical_block_size(bio->bi_bdev);
> > +
> > +	if (bio->bi_iter.bi_size & (bs - 1))
> > +	        return true;
> > +
> > +	if ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & (bs - 1))
> > +	        return true;
> > +
> > +	return false;
> > +}
> > +
> >   int __bio_queue_enter(struct request_queue *q, struct bio *bio)
> >   {
> >   	while (!blk_try_enter_queue(q, false)) {
> > @@ -362,6 +375,15 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio)
> >   			   test_bit(GD_DEAD, &disk->state));
> >   		if (test_bit(GD_DEAD, &disk->state))
> >   			goto dead;
> > +		/*
> > +		 * Not like other queue limits, logical block size is one
> > +		 * fundamental limit which can't be covered by bio split.
> > +		 *
> > +		 * Device reconfiguration may happen and logical block size
> > +		 * is changed, so fail the IO if that is true.
> > +		 */
> > +		if (bio_unaligned(bio))
> > +			goto dead;
> >   	}
> >   	return 0;
> > @@ -765,6 +787,8 @@ void submit_bio_noacct(struct bio *bio)
> >   	if (should_fail_bio(bio))
> >   		goto end_io;
> > +	if (bio->bi_iter.bi_size && bio_unaligned(bio))
> > +		goto end_io;
> I think this check should be added after the bio_queue_enter() call. Early
> judgment should be unreliable.

The test patch checks bio alignment in both __bio_queue_enter() and
submit_bio_noacct(), and it should solve this trouble if that is the
reason.

And the final fix can convert the two checks into single one in
blk_try_enter_queue() or __bio_queue_enter(), which may not add extra
cost if both ->chunk_sectors and ->logical_block_size are put into
single cache line.

Thanks,
Ming