Hello David, Thank you for your reply. For these days analysis code, I found below codes can be enhanced. (code changes base on git master branch.) --------------- commit 3768196011fb01e4016510bfab9eef0c7bdc04f5 (HEAD -> master) Author: Zhao Heming <heming.zhao@xxxxxxxx> Date: Sat Oct 12 14:28:06 2019 +0800 fix typo in lib/cache/lvmcache.c enhance error handling in bcache fix constant var 'error' in _scan_list fix gcc warning in _lvconvert_split_cache_single Signed-off-by: Zhao Heming <heming.zhao@xxxxxxxx> diff --git a/lib/cache/lvmcache.c b/lib/cache/lvmcache.c index f6e792459b..499f9437cb 100644 --- a/lib/cache/lvmcache.c +++ b/lib/cache/lvmcache.c @@ -939,7 +939,7 @@ int lvmcache_label_rescan_vg_rw(struct cmd_context *cmd, const char *vgname, con * incorrectly placed PVs should have been moved from the orphan vginfo * onto their correct vginfo's, and the orphan vginfo should (in theory) * represent only real orphan PVs. (Note: if lvmcache_label_scan is run - * after vg_read udpates to lvmcache state, then the lvmcache will be + * after vg_read updates to lvmcache state, then the lvmcache will be * incorrect again, so do not run lvmcache_label_scan during the * processing phase.) * diff --git a/lib/device/bcache.c b/lib/device/bcache.c index d100419770..cfe01bac2f 100644 --- a/lib/device/bcache.c +++ b/lib/device/bcache.c @@ -292,6 +292,10 @@ static bool _async_issue(struct io_engine *ioe, enum dir d, int fd, } while (r == -EAGAIN); if (r < 0) { + ((struct block *)context)->error = r; + log_warn("io_submit <%c> off %llu bytes %llu return %d:%s", + (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset, + (long long unsigned)nbytes, r, strerror(-r)); _cb_free(e->cbs, cb); return false; } @@ -842,7 +846,7 @@ static void _complete_io(void *context, int err) if (b->error) { dm_list_add(&cache->errored, &b->list); - + log_warn("fd: %d error: %d", b->fd, err); } else { _clear_flags(b, BF_DIRTY); _link_block(b); @@ -869,8 +873,7 @@ static void _issue_low_level(struct block *b, enum dir d) dm_list_move(&cache->io_pending, &b->list); if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) { - /* FIXME: if io_submit() set an errno, return that instead of EIO? */ - _complete_io(b, -EIO); + _complete_io(b, b->error); return; } } diff --git a/lib/label/label.c b/lib/label/label.c index dc4d32d151..60ad387219 100644 --- a/lib/label/label.c +++ b/lib/label/label.c @@ -647,7 +647,6 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f, int submit_count; int scan_failed; int is_lvm_device; - int error; int ret; dm_list_init(&wait_devs); @@ -694,12 +693,12 @@ static int _scan_list(struct cmd_context *cmd, struct dev_filter *f, dm_list_iterate_items_safe(devl, devl2, &wait_devs) { bb = NULL; - error = 0; scan_failed = 0; is_lvm_device = 0; if (!bcache_get(scan_bcache, devl->dev->bcache_fd, 0, 0, &bb)) { - log_debug_devs("Scan failed to read %s error %d.", dev_name(devl->dev), error); + log_debug_devs("Scan failed to read %s error %d.", + dev_name(devl->dev), bb ? bb->error : 0); scan_failed = 1; scan_read_errors++; scan_failed_count++; diff --git a/tools/lvconvert.c b/tools/lvconvert.c index 60ab956614..4939e5ec7d 100644 --- a/tools/lvconvert.c +++ b/tools/lvconvert.c @@ -4676,7 +4676,7 @@ static int _lvconvert_split_cache_single(struct cmd_context *cmd, struct logical_volume *lv_main = NULL; struct logical_volume *lv_fast = NULL; struct lv_segment *seg; - int ret; + int ret = 0; if (lv_is_writecache(lv)) { lv_main = lv; --- Thanks zhm On 10/11/19 11:14 PM, David Teigland wrote: > On Fri, Oct 11, 2019 at 08:11:29AM +0000, Heming Zhao wrote: > >> I analyze this issue for some days. It looks a new bug. > > Yes, thanks for the thorough analysis. > >> In user machine, this write action was failed, the PV header data (first >> 4K) save in bcache (cache->errored list), and then write (by >> bcache_flush) to another disk (f748). > > It looks like we need to get rid of cache->errored completely. > >> If dev_write_bytes failed, the bcache never clean last_byte. and the fd >> is closed at same time, but cache->errored still have errored fd's data. >> later lvm open new disk, the fd may reuse the old-errored fd number, >> error data will be written when later lvm call bcache_flush. > > That's a bad bug. > >> 2> duplicated pv header. >> as <1> description, fc68 metadata was overwritten to f748. >> this cause by lvm bug (I said in <1>). >> >> 3> device not correct >> I don't know why the disk scsi-360060e80072a670000302a670000fc68 has below wrong metadata: >> >> pre_pvr/scsi-360060e80072a670000302a670000fc68 >> (please also read the comments in below metadata area.) >> ``` >> vgpocdbcdb1_r2 { >> id = "PWd17E-xxx-oANHbq" >> seqno = 20 >> format = "lvm2" >> status = ["RESIZEABLE", "READ", "WRITE"] >> flags = [] >> extent_size = 65536 >> max_lv = 0 >> max_pv = 0 >> metadata_copies = 0 >> >> physical_volumes { >> >> pv0 { >> id = "3KTOW5-xxxx-8g0Rf2" >> device = "/dev/disk/by-id/scsi-360060e80072a660000302a660000f768" >> Wrong!! ^^^^^ >> I don't know why there is f768, please ask customer >> status = ["ALLOCATABLE"] >> flags = [] >> dev_size = 860160 >> pe_start = 2048 >> pe_count = 13 >> } >> } >> ``` >> fc68 => f768 the 'c' (b1100) change to '7' (b0111). >> maybe disk bit overturn, maybe lvm has bug. I don't know & have no idea. > > Is scsi-360060e80072a660000302a660000f768 the correct device for > PVID 3KTOW5...? If so, then it's consistent. If not, then I suspect > this is a result of duplicating the PVID on multiple devices above. > > >> On 9/11/19 5:17 PM, Gang He wrote: >>> Hello List, >>> >>> Our user encountered a meta-data corruption problem, when run pvresize command after upgrading to LVM2 v2.02.180 from v2.02.120. >>> >>> The details are as below, >>> we have following environment: >>> - Storage: HP XP7 (SAN) - LUN's are presented to ESX via RDM >>> - VMWare ESXi 6.5 >>> - SLES 12 SP 4 Guest >>> >>> Resize happened this way (is our standard way since years) - however - this is our first resize after upgrading SLES 12 SP3 to SLES 12 SP4 - until this upgrade, we >>> never had a problem like this: >>> - split continous access on storage box, resize lun on XP7 >>> - recreate ca on XP7 >>> - scan on ESX >>> - rescan-scsi-bus.sh -s on SLES VM >>> - pvresize ( at this step the error happened) >>> >>> huns1vdb01:~ # pvresize /dev/disk/by-id/scsi-360060e80072a660000302a6600003274 >> >> _______________________________________________ >> linux-lvm mailing list >> linux-lvm@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-lvm >> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/ > _______________________________________________ linux-lvm mailing list linux-lvm@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/