Replacing the external log device

Ansgar Esztermann-Kirchner <aeszter@xxxxxxxxxxxxx> · Wed, 12 Apr 2023 09:21:32 +0200

Hello List,

what should I expect when I replace the device that contains my xfs
log? Is there a specific procedure to follow? Also, did the expected
behaviour change at some point (in kernel history)?

Some background:
When I joined my current employer in 2006, I performed some benchmarks
to see which FS would provide the best performace for our workload:
multiple NFS clients appending to large (multi GB) files. XFS was the
clear winner, so since then, we have several dozen workstations with
XFS on mdraid (with LVM inbetween). For performance reasons, we keep 
the log on a partion of the SSD that also holds the OS.
In all those years, the only data loss I can remember was caused by a
flaky controller that threw out disks from a RAID6 faster than they 
could be rebuilt.

When a user leaves us but their data should still be kept online, we
move the HDDs to a disk array connected to a special fileserver. The
workstation can then get a fresh install and be used for someone else.
On the fileserver, for every set of disks addedm we create a new LV 
in a VG dedicated to XFS logs and use that to mount the FS.
That, too, has never posed any problems (except for a duplicate UUID at
some point, but that was easily fixed with xfs_db).

However, I've been bitten by a nasty problem twice in recent weeks: in
the first instance, I wanted to replace a bunch of disks in a machine
(something like 4x10TB to 4x16TB). Usually, we do that by setting up a
new machine, rsyncing all the data, and then swap the machines. In
this instance, I refrained from swapping the machines (due to lack of
hardware), and merely swapped the disks. Initially, the kernel refused
to mount the new disks (this was expected: the UUID of the log was
incorrect, as I only swapped the HDDs, not the log device). I called
xfs_repair to fix that. xfs_repair completed successfully, and the 
only modification reported was reformatting the log. However, the
kernel still refused to mount the file system ("structure needs
cleaning"), and a second run of xfs_repair reported hundreds of
problems. It managed to repair them all, but afterwards, the file
system was empty.  I started over, this time calling xfs_repair -L,
but the results were the same.
The hardware, kernel version, and Linux distribution were exactly the
same on both machines. 
At the time, I thought maybe there was a strange bug in that (quite
old) kernel (4.12.14 from opensuse 15.1), so I resorted to waiting for
new hardware and setting up a fresh machine.

Yesterday, I did a Linux upgrade for a different user. After a clean
shutdown, I wiped the SSD (including the XFS log) and re-imaged it
with an up-to-date opensuse install. Afterwards, everything went as
described above. 

I find this extremely puzzling (especially since we've been moving
disks like this to our file server more than a dozen times, all without
any problems, and I fail to see what is different there).

I'd be happy for an explanation of what can happen to damage the FS in
this scenario -- just out of curiosity -- but of course, any steps I
can take to keep the FS intact during this procedure are also very
welcome.

Thank you,

A.
-- 
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
https://www.mpinat.mpg.de/person/11315/3883774
Attachment:
smime.p7s

Description: S/MIME cryptographic signature