Actually, I made a mistake and failed to drop the system cache between dds when generating the comparison. There is one difference, in sector 2056 of the device! This must be the key.
On Wed, 16 Sep 2015 at 16:31 Aaron Young <aaron.young@ctl.io> wrote:
dd if=/dev/sbd13 bs=1M count=1 skip=1 of=sbd13.cacheThe contents of the metadata area seems to be the same (both contain seqno 5):Yes, I have lots of data to share, I thought first to open at high level. This is all happening inside a single VM. Archive is available, I will post them shortly. No lvmetad. No errors that I can tell (at least not on console or syslog).
root@VA1CTLT-SRN2-03:/etc/lvm/archive# grep seqno test_dvol-13-vg_00*
test_dvol-13-vg_00261-1410850844.vg: seqno = 0 <---- before vgcreate
test_dvol-13-vg_00262-1188507802.vg: seqno = 1 <-- before lvcreate 1
test_dvol-13-vg_00263-1818746321.vg: seqno = 2 <---- before lvcreate 2
test_dvol-13-vg_00264-1122545952.vg: seqno = 3 <--- before lvcreate 3
test_dvol-13-vg_00265-1497145254.vg: seqno = 4 <---- before lvcreate 4
test_dvol-13-vg_00266-1300493675.vg: seqno = 5 <--- before lvs
test_dvol-13-vg_00267-490193445.vg: seqno = 4 <----- disabled device cache, lvs
test_dvol-13-vg_00268-2051497792.vg: seqno = 4 <----- disabled device cache, lvs
test_dvol-13-vg_00269-370016695.vg: seqno = 5 <---- enabled device cache, lvs
dd if=/dev/sbd13 bs=1M count=1 skip=1 of=sbd13.nocachecmp sbd13.nocache sbd13.cacheI tracked down these sectors by running strace on pvcreate/vgcreate/lvcreate. As far as I can tell, all the sectors involved are being written correctly.
Random facts:1. Devicemapper still correctly lists the logical volume that is missing from lvs
2. 3.13.0-44-generic, Ubuntu 14.04
3. LVM version: 2.02.98(2) (2012-10-15) Library version: 1.02.77 (2012-10-15) Driver version: 4.27.0Random suspicious snippet generated by lvscan -vvv
/dev/mapper/sbd13p1: lvm2 label detected at sector 1
lvmcache: /dev/mapper/sbd13p1: now in VG #orphans_lvm2 (#orphans_lvm2) with 1 mdas
/dev/mapper/sbd13p1: Found metadata at 8704 size 1749 (in area at 4096 size 1044480) for test_dvol-13-vg (DFvQDG-nYVS-QQlT-Uv35-aPr4-2pY0-zMQ0dr)
lvmcache: /dev/mapper/sbd13p1: now in VG test_dvol-13-vg with 1 mdas
lvmcache: /dev/mapper/sbd13p1: setting test_dvol-13-vg VGID to DFvQDGnYVSQQlTUv35aPr42pY0zMQ0dr
lvmcache: /dev/mapper/sbd13p1: VG test_dvol-13-vg: Set creation host to VA1CTLT-SRN2-03. Allocated VG test_dvol-13-vg at 0x257bc00.
Using cached label for /dev/mapper/sbd13p1
Read test_dvol-13-vg metadata (4) from /dev/mapper/sbd13p1 at 8704 size 1749
/dev/mapper/sbd13p1 0: 0 19: VM-test_dvol-13-0-hard-drive-0(0:0)
/dev/mapper/sbd13p1 1: 19 19: VM-test_dvol-13-0-hard-drive-1(0:0)
/dev/mapper/sbd13p1 2: 38 19: VM-test_dvol-13-1-hard-drive-0(0:0)
/dev/mapper/sbd13p1 3: 57 42: NULL(0:0) *<----missing logical volume*I don't understand how this is possible if that sector (8704) is identical in both cases.Attached are two verbose straces of vgdisplay, one of which discovered 3 logical volumes and one of that discovers 4.I am looking for insight into the disk contents that are necessary for this discovery. Thank you very much.AaronOn Wed, 16 Sep 2015 at 03:05 Zdenek Kabelac <zkabelac@redhat.com> wrote:Dne 15.9.2015 v 23:18 Aaron Young napsal(a):
> Hello, I'm deep into debugging an issue we have with a disk driver of ours and
> LVM.
>
> Long story short:
>
> create vg -> seqno 1
> create lv1 -> seqno 2
> create lv2 -> seqno 3
> create lv3 -> seqno 4
> create lv4 -> seqno 5
> <clear our device cache> (note, this generates no IO)
> vgdisplay: seqno = 4, lv4 is missing
>
> * This happens only after dozens to hundreds of iterations. Most of the time
> it is fine.
>
> I dd all the metadata blocks off of the pv, yep, seqno5 is on disk metadata
> area perfectly fine. But the system believes 4 is the current version.
> Shouldn't the system be using the highest value? Or is it stored somewhere?
> What mechanism is responsible for changing the seqno? And where does it change
> it? (Not the metadata contents, just the number)
Hi
Your email is quite 'mystic' - I'd need lots of crystal balls to see your
surrounding conditions.
1.) Is this 'clustered' environment or a 'single' host setup ?
2.) Do you have 'archive' backup enabled - can you check what are last
operations in history before problem happens?
3.) Are you using 'lvmetad' ? (if so, try use_lvmetad=0 )
4.) Kernel version, lvm2 version ?
5.) Was there any lvm2 command error ?
(as vgdisplay may just do a backup of most recent metadata in case they are
are missing after some command failure)
Zdenek
_______________________________________________ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/