Hello List, what should I expect when I replace the device that contains my xfs log? Is there a specific procedure to follow? Also, did the expected behaviour change at some point (in kernel history)? Some background: When I joined my current employer in 2006, I performed some benchmarks to see which FS would provide the best performace for our workload: multiple NFS clients appending to large (multi GB) files. XFS was the clear winner, so since then, we have several dozen workstations with XFS on mdraid (with LVM inbetween). For performance reasons, we keep the log on a partion of the SSD that also holds the OS. In all those years, the only data loss I can remember was caused by a flaky controller that threw out disks from a RAID6 faster than they could be rebuilt. When a user leaves us but their data should still be kept online, we move the HDDs to a disk array connected to a special fileserver. The workstation can then get a fresh install and be used for someone else. On the fileserver, for every set of disks addedm we create a new LV in a VG dedicated to XFS logs and use that to mount the FS. That, too, has never posed any problems (except for a duplicate UUID at some point, but that was easily fixed with xfs_db). However, I've been bitten by a nasty problem twice in recent weeks: in the first instance, I wanted to replace a bunch of disks in a machine (something like 4x10TB to 4x16TB). Usually, we do that by setting up a new machine, rsyncing all the data, and then swap the machines. In this instance, I refrained from swapping the machines (due to lack of hardware), and merely swapped the disks. Initially, the kernel refused to mount the new disks (this was expected: the UUID of the log was incorrect, as I only swapped the HDDs, not the log device). I called xfs_repair to fix that. xfs_repair completed successfully, and the only modification reported was reformatting the log. However, the kernel still refused to mount the file system ("structure needs cleaning"), and a second run of xfs_repair reported hundreds of problems. It managed to repair them all, but afterwards, the file system was empty. I started over, this time calling xfs_repair -L, but the results were the same. The hardware, kernel version, and Linux distribution were exactly the same on both machines. At the time, I thought maybe there was a strange bug in that (quite old) kernel (4.12.14 from opensuse 15.1), so I resorted to waiting for new hardware and setting up a fresh machine. Yesterday, I did a Linux upgrade for a different user. After a clean shutdown, I wiped the SSD (including the XFS log) and re-imaged it with an up-to-date opensuse install. Afterwards, everything went as described above. I find this extremely puzzling (especially since we've been moving disks like this to our file server more than a dozen times, all without any problems, and I fail to see what is different there). I'd be happy for an explanation of what can happen to damage the FS in this scenario -- just out of curiosity -- but of course, any steps I can take to keep the FS intact during this procedure are also very welcome. Thank you, A. -- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774
Attachment:
smime.p7s
Description: S/MIME cryptographic signature