On Fri, May 06 2016 at 11:56am -0400, Christoph Hellwig <hch@xxxxxx> wrote: > On Fri, May 06, 2016 at 05:25:35PM +0200, Christoph Hellwig wrote: > > Can you explain that code flow to me? I still fail to why exactly > > dm-thinp (and only dm-thinp) needs this. Maybe start by pointing me > > to the additional bio_endio that pairs with the bio_inc_remaining > > call. > > > > > All said, bio_inc_remaining() should really only be used in conjunction > > > with bio_chain(). It isn't intended for generic bio reference counting. > > > > It's obviously used outside bio_chain in dm-thinp, so this sentence > > doesn't make too much sense to me. > > FYI, this untested patch should fix the abuse in dm-think. Not abusing > bio_inc_remaining actually makes the code a lot simpler and more obvious. But sadly I've tried this very same approach and it crashes, as I just verified again with your patch (nevermind the fugly proprietary fusionio symbols ;): (this is a direct result of completing the discard bio before all mappings, and associated sub-discards, have): 73232.788728] BUG: unable to handle kernel paging request at 0000000000619df0 [73232.796519] IP: [<ffffffff810ca68a>] native_queued_spin_lock_slowpath+0x10a/0x1a0 [73232.804876] PGD 2a19e6067 PUD 29d5d0067 PMD 2a1cac067 PTE 0 [73232.811132] Oops: 0002 [#1] SMP [73232.814750] Modules linked in: dm_thin_pool(O) dm_persistent_data(O) dm_bio_prison(O) dm_bufio(O) ext4 jbd2 mbcache iomemory_vsl(POE) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel iTCO_wdt glue_helper iTCO_vendor_support lrw sg i7core_edac gf128mul ablk_helper edac_core ipmi_si pcspkr acpi_power_meter cryptd shpchp ipmi_msghandler lpc_ich i2c_i801 mfd_core acpi_cpufreq ip_tables xfs libcrc32c sr_mod sd_mod cdrom mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_generic ttm pata_acpi ixgbe igb drm mdio ata_piix ptp libata crc32c_intel i2c_algo_bit megaraid_sas pps_core skd i2c_core dca dm_mod [last unloaded: dm_bufio] [73232.888557] CPU: 1 PID: 29156 Comm: fct0-worker Tainted: P IOE 4.6.0-rc6.snitm+ #162 [73232.898266] Hardware name: FUJITSU PRIMERGY RX300 S6 /D2619, BIOS 6.00 Rev. 1.10.2619.N1 05/24/2011 [73232.912919] task: ffff880330901900 ti: ffff8802a9b54000 task.ti: ffff8802a9b54000 [73232.921269] RIP: 0010:[<ffffffff810ca68a>] [<ffffffff810ca68a>] native_queued_spin_lock_slowpath+0x10a/0x1a0 [73232.932347] RSP: 0018:ffff8802a9b57848 EFLAGS: 00010006 [73232.938272] RAX: 0000000000003ffe RBX: 0000000000000086 RCX: 0000000000080000 [73232.946234] RDX: 0000000000619df0 RSI: 00000000ffffc900 RDI: ffff88029fc9a2cc [73232.954197] RBP: ffff8802a9b57848 R08: ffff880333257900 R09: 0000000000000000 [73232.962161] R10: 000000009fcc5b01 R11: ffff88029fcc5e00 R12: ffff88029fc9a280 [73232.970122] R13: ffff88032f9472a0 R14: ffff88029fc9a2cc R15: 0000000000000000 [73232.978085] FS: 0000000000000000(0000) GS:ffff880333240000(0000) knlGS:0000000000000000 [73232.987115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [73232.993524] CR2: 0000000000619df0 CR3: 00000002a1597000 CR4: 00000000000006e0 [73233.001485] Stack: [73233.003727] ffff8802a9b57858 ffffffff8117faf1 ffff8802a9b57870 ffffffff81695427 [73233.012017] 0000000000000000 ffff8802a9b578a8 ffffffffa0671788 ffff88029fc9a420 [73233.020307] 0000000000000000 ffff88029fc9a420 ffff88032a6442a0 0000000000000000 [73233.028598] Call Trace: [73233.031326] [<ffffffff8117faf1>] queued_spin_lock_slowpath+0xb/0xf [73233.038321] [<ffffffff81695427>] _raw_spin_lock_irqsave+0x37/0x40 [73233.045221] [<ffffffffa0671788>] cell_defer_no_holder+0x28/0x80 [dm_thin_pool] [73233.053378] [<ffffffffa0671a4d>] thin_endio+0x14d/0x180 [dm_thin_pool] [73233.060760] [<ffffffff811e5e69>] ? kmem_cache_free+0x1c9/0x1e0 [73233.067372] [<ffffffffa00016aa>] clone_endio+0x3a/0xe0 [dm_mod] [73233.074076] [<ffffffff812fb345>] bio_endio+0x55/0x60 [73233.079713] [<ffffffffa0000d3e>] dec_pending+0x12e/0x280 [dm_mod] [73233.086611] [<ffffffffa00016cb>] clone_endio+0x5b/0xe0 [dm_mod] [73233.093313] [<ffffffff812fb345>] bio_endio+0x55/0x60 [73233.098951] [<ffffffffa0000d3e>] dec_pending+0x12e/0x280 [dm_mod] [73233.105849] [<ffffffffa00016cb>] clone_endio+0x5b/0xe0 [dm_mod] [73233.112552] [<ffffffff812fb345>] bio_endio+0x55/0x60 [73233.118187] [<ffffffff81303fc7>] blk_update_request+0x87/0x320 [73233.124793] [<ffffffff8130427c>] blk_update_bidi_request+0x1c/0x70 [73233.131786] [<ffffffff81304da7>] __blk_end_bidi_request+0x17/0x40 [73233.138684] [<ffffffff81304de0>] __blk_end_request+0x10/0x20 [73233.145128] [<ffffffffa0530a68>] complete_list_entries.isra.9+0x38/0x90 [iomemory_vsl] [73233.154074] [<ffffffffa0530b66>] kfio_blk_complete_request+0xa6/0x140 [iomemory_vsl] [73233.162828] [<ffffffffa0530c2c>] kfio_req_completor+0x2c/0x50 [iomemory_vsl] [73233.170810] [<ffffffffa054e282>] ifio_f9142.4bb664afe4728d2c3817c99088f2b5dd004.3.2.9.1461+0x12/0x50 [iomemory_vsl] [73233.182574] [<ffffffffa054b62c>] ? fio_device_ptrim_available+0x9c/0x2c0 [iomemory_vsl] [73233.191606] [<ffffffff810b57c8>] ? dequeue_entity+0x468/0x900 [73233.198136] [<ffffffffa05b84ad>] ? ifio_1e5a3.5e62afc90b8d1b25ae145e583e480994677.3.2.9.1461+0xb0d/0x17c0 [iomemory_vsl] [73233.210390] [<ffffffffa05b913f>] ? ifio_1e5a3.5e62afc90b8d1b25ae145e583e480994677.3.2.9.1461+0x179f/0x17c0 [iomemory_vsl] [73233.222740] [<ffffffffa05bcbd0>] ? ifio_fbeca.b91e0233dee7fc9112bb37f58c4e526bcd8.3.2.9.1461+0x840/0xf20 [iomemory_vsl] [73233.234889] [<ffffffffa0532343>] ? __fusion_condvar_timedwait+0xb3/0x120 [iomemory_vsl] [73233.243941] [<ffffffffa05bdd78>] ? ifio_5fb65.a9c484151da25a9eb60ef9a6e7309d1a95f.3.2.9.1461+0xf8/0x2a0 [iomemory_vsl] [73233.255998] [<ffffffffa05bdc80>] ? ifio_f0cee.039ff5a1c1f314a171e2a8a425361a5875d.3.2.9.1461+0x50/0x50 [iomemory_vsl] [73233.267960] [<ffffffffa05bdc80>] ? ifio_f0cee.039ff5a1c1f314a171e2a8a425361a5875d.3.2.9.1461+0x50/0x50 [iomemory_vsl] [73233.279899] [<ffffffff8109ffe8>] ? kthread+0xd8/0xf0 [73233.285535] [<ffffffff81695702>] ? ret_from_fork+0x22/0x40 [73233.291754] [<ffffffff8109ff10>] ? kthread_park+0x60/0x60 [73233.297873] Code: c1 e0 10 45 31 c9 85 c0 74 46 48 89 c2 c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 00 79 01 00 48 03 14 c5 c0 87 d1 81 <4c> 89 02 41 8b 40 08 85 c0 75 0a f3 90 41 8b 40 08 85 c0 74 f6 [73233.319551] RIP [<ffffffff810ca68a>] native_queued_spin_lock_slowpath+0x10a/0x1a0 [73233.328010] RSP <ffff8802a9b57848> [73233.331899] CR2: 0000000000619df0 [73233.341412] ---[ end trace c81cf06a98ad2d88 ]--- > I'm not a 100% sure, but it seems the current pattern can also lead > to a leak of the bi_remaining references when set_pool_mode managed > to set a new process_prepared_discard function pointer after the > references have been grabbed already. We have quite a few tests in the device-mapper-test-suite that hammer discards (in conjunction with the pool running out of space or if the pool fails). The replacement process_prepared_discard functions will issue a bio_endio() per mapping anyway.. so there should be no risk of a bi_remaining reference leaking like you thought. > Jens, I noticed you've already applied this patch - I'd really prefer > to see it reverted as using bio_inc_remaining outside bio_chain leads > to this sort of convoluted code. As you know not all code is simple. I've looked at this for quite a bit this week and unfortunately I don't see a way forward (yet) that doesn't require the use of bio_inc_remaining() to take extra bi_remaining references. > --- > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c > index e4bb9da..5875749 100644 > --- a/drivers/md/dm-thin.c > +++ b/drivers/md/dm-thin.c > @@ -382,7 +382,6 @@ static void end_discard(struct discard_op *op, int r) > */ > if (r && !op->parent_bio->bi_error) > op->parent_bio->bi_error = r; > - bio_endio(op->parent_bio); > } > > /*----------------------------------------------------------------*/ > @@ -1056,11 +1055,9 @@ static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m) > r = dm_thin_remove_range(tc->td, m->virt_begin, m->virt_end); > if (r) { > metadata_operation_failed(pool, "dm_thin_remove_range", r); > - bio_io_error(m->bio); > - > + m->bio->bi_error = -EIO; > } else if (m->maybe_shared) { > passdown_double_checking_shared_status(m); > - > } else { > struct discard_op op; > begin_discard(&op, tc, m->bio); > @@ -1069,6 +1066,7 @@ static void process_prepared_discard_passdown(struct dm_thin_new_mapping *m) > end_discard(&op, r); > } > > + bio_endio(m->bio); > cell_defer_no_holder(tc, m->cell); > mempool_free(m, pool->mapping_pool); > } > @@ -1537,15 +1535,6 @@ static void break_up_discard_bio(struct thin_c *tc, dm_block_t begin, dm_block_t > m->cell = data_cell; > m->bio = bio; > > - /* > - * The parent bio must not complete before sub discard bios are > - * chained to it (see end_discard's bio_chain)! > - * > - * This per-mapping bi_remaining increment is paired with > - * the implicit decrement that occurs via bio_endio() in > - * end_discard(). > - */ > - bio_inc_remaining(bio); > if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list)) > pool->process_prepared_discard(m); > > @@ -1565,13 +1554,6 @@ static void process_discard_cell_passdown(struct thin_c *tc, struct dm_bio_priso > */ > h->cell = virt_cell; > break_up_discard_bio(tc, virt_cell->key.block_begin, virt_cell->key.block_end, bio); > - > - /* > - * We complete the bio now, knowing that the bi_remaining field > - * will prevent completion until the sub range discards have > - * completed. > - */ > - bio_endio(bio); > } > > static void process_discard_bio(struct thin_c *tc, struct bio *bio) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html