Re: thin: pool target too small

Duncan Townsend <duncancmt@xxxxxxxxxxxxx> · Fri, 9 Oct 2020 16:15:55 -0500 (EST)

On Fri, 2 Oct 2020, Duncan Townsend wrote:

On Wed, Sep 30, 2020, 1:00 PM Duncan Townsend <duncancmt@xxxxxxxxx> wrote:

On Tue, Sep 29, 2020, 10:54 AM Zdenek Kabelac <zkabelac@xxxxxxxxxx> wrote:

Dne 29. 09. 20 v 16:33 Duncan Townsend napsal(a):

So the lvm2 has been fixed upstream to report more educative messages to
the user - although it still does require some experience in managing
thin-pool kernel metadata and lvm2 metadata.

That's good news! However, I believe I lack the requisite experience. Is
there some documentation that I ought to read as a starting point? Or is it
best to just read the source?

In your case - dmeventd did 'unlocked' resize - while other command
was taking a snapshot - and it happened the sequence with 'snapshot' has
won - so until the reload of thin-pool - lvm2 has not spotted difference.
(which is simply a bad race cause due to badly working locking on your
system)

After reading more about lvm locking, it looks like the original issue
might have been that the locking directory lives on a lv instead of on a
non-lvm-managed block device. (Although, the locking directory is on a
different vg on a different pv from the one that had the error.)

Is there a way to make dmeventd (or any other lvm program) abort if this
locking fails? Should I switch to using a clustered locking daemon (even
though I have only the single, non-virtualized host)?

You will need to vgcfgrestore - but I think you've misused my passed
recoverd
piece, where I've specifically asked to only replace specific segments of
resized thin-pool within your latest VG metadata - since those likely have
all the proper mappings to thin LVs.

All I did was use vgcfgrestore to apply the metadata file attached to your
previous private email. I had to edit the transaction number, as I noted
previously. That was a single line change. Was that the wrong thing to do?
I lack the experience with lvm/thin metadata, so I am flying a bit blind
here. I apologize if I've made things worse.

While you have taken the metadata from 'resize' moment - you've lost all
the thinLV lvm2 metadata for later created one.

I'll try to make one for you.

Thank you very much. I am extremely grateful that you've helped me so much
in repairing my system.

Well - lvm2 is glibc oriented project - so users of those 'esoteric'
distribution need to be expert on its own.

If you can provide coredump or even better patch for crash - we might
replace the code with something better usable - but there is zero testing
with anything else then glibc...

Noted. I believe I'll be switching to glibc because there are a number of
other packages that are broken for this distro.

If you have an interest, this is the issue I've opened with my distro
about the crash: https://github.com/void-linux/void-packages/issues/25125
. I despair that this will receive much attention, given that not even gdb
works properly.

Hello! Could somebody advise whether restoring the VG metadata is likely to
cause this system's condition to worsen? At this point, all I want is to do
is get the data off this drive and then start over with something more
stable.

I believe I have repaired my system to the point of usability. I ended up 
editing *JUST* the transaction number in the metadata backup from before 
the snapshot creation that brought the system down. Restoring that VG 
metadata backup enabled me to activate my thin pool and LVs. After booting 
the system and inspecting further, I discovered that the locking directory 
for dmeventd was *NOT* mounted on a thin LV as a stated earlier, but was 
instead mounted on a tmpfs. As Zdenek hinted that the root cause of my 
original problem may have been a failure of local, file-based locking, my 
conclusion is that I may have gotten unlucky with the timing of the 
delivery of a SIGTERM from my broken runit script. I'm happy to be shown 
to be wrong about that conclusion, but it's the best I can draw given the 
information available to me.

The follow-on problem appears to have been a metadata mismatch on the 
transaction ID, which was effectively repaired by the aforementioned 2nd 
metadata transaction ID editing and restoration.

I have now been able to boot my system and complete several full 
iterations of my backup script using thin LV snapshotting. There were no 
problems. FWIW, I have been using this backup script as a cronjob for the 
last 8 years with minimal modification. I would be surprised to discover 
that there's something fundamental about the script that is erroneous.

Thanks again to Zdenek Kabelac for all the help. If there's any additional 
information I can provide to stop this from happening to anyone else in 
the future, I'm happy to assist. Otherwise, I'm going to consider this 
problem solved.

--Duncan Townsend

_______________________________________________
linux-lvm mailing list
linux-lvm@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/