Dne 23. 09. 20 v 21:54 Duncan Townsend napsal(a):
On Wed, Sep 23, 2020 at 2:49 PM Zdenek Kabelac <zkabelac@xxxxxxxxxx> wrote:
Dne 23. 09. 20 v 20:13 Duncan Townsend napsal(a):
On Tue, Sep 22, 2020, 5:02 PM Zdenek Kabelac <zkabelac@xxxxxxxxxx
I have encountered a further problem in the process of restoring my thin pool
to a working state. After using vgcfgrestore to fix the mismatching metadata
using the file Zdenek kindly provided privately, when I try to activate my
thin LVs, I'm now getting the error message:
Thin pool <THIN POOL LONG NAME>-tpool transaction_id (MAJOR:MINOR)
transaction_id is XXX, while expected YYY.
Set the transaction_id to the right number in the ASCII lvm2 metadata file.
I apologize, but I am back with a related, similar problem. After
editing the metadata file and replacing the transaction number, my
system became serviceable again. After making absolutely sure that
dmeventd was running correctly, my next order of business was to
finish backing up before any other tragedy happens. Unfortunately,
taking a snapshot as part of the backup process has once again brought
my system to its knees. The first error message I saw was:
Hi
And now you've hit an interesting bug inside lvm2 code - I've opened new BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1882483
This actually explains few so far not well understood problems I've
seen before without good explanation how to hit them.
WARNING: Sum of all thin volume sizes (XXX TiB) exceeds the size of
thin pool <VG>/<THIN POOL LV> and the size of whole volume group (YYY
TiB).
device-mapper: message ioctl on (MAJOR:MINOR) failed: File exists
Failed to process thin pool message "create_snap 11 4".
Failed to suspend thin snapshot origin <VG>/<THIN LV>.
Internal error: Writing metadata in critical section.
Releasing activation in critical section.
libdevmapper exiting with 1 device(s) still suspended.
So I've now quite simple reproducer for unhanded error case.
It's basically exposing mismatch between kernel (_tmeta) and lvm2
metadata content. And lvm2 can handle this discovery better
than what you see now,
There were further error messages as further snapshots were attempted,
but I was unable to capture them as my system went down. Upon reboot,
the "transaction_id" message that I referred to in my previous message
was repeated (but with increased transaction IDs).
For better fix it would need to be better understood what has happened
in parallel while 'lvm' inside dmeventd was resizing pool data.
It looks like the 'other' lvm managed to create another snapshot
(and thus the DeviceID appeared to already exists - while it should not
according to lvm2 metadata before it hit problem with mismatch of
transaction_id.
I will reply privately with my lvm metadata archive and with my
header. My profuse thanks, again, for assisting me getting my system
back up and running.
So the valid fix would be to take 'thin_dump' of kernel metadata
(aka content of _tmeta device)
Then check what you have in lvm2 metadata and likely you will
find some device in kernel - for which you don't have match
in lvm2 metadata - these devices would need to be copied
from your other sequence of lvm2 metadata.
Other maybe more simple way could be to just remove devices
from xml thin_dump and thin_restore those metadata that should should
now match lvm2.
The last issue is then to match 'transaction_id' with the number
stored in kernel metadata.
So not sure which way you want to go and how important those
snapshot (that could be dropped) are ?
Zdenek
_______________________________________________
linux-lvm mailing list
linux-lvm@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/