On Tue, Oct 11, 2022 at 10:29:32AM +1100, Dave Chinner wrote: > On Mon, Oct 10, 2022 at 03:40:51PM +0000, Darrick Wong wrote: > > LGTM, want to send this to the upstream list to start that discussion? UGH, so I thought this was an internal thread, but it turns out that linux-xfs has been cc'd for a while but none of the messages made it to lore. I'll fill in some missing context below. > > --D > > > > ________________________________________ > > From: Srikanth C S <srikanth.c.s@xxxxxxxxxx> > > Sent: Monday, October 10, 2022 08:24 > > To: linux-xfs@xxxxxxxxxxxxxxx; Darrick Wong > > Cc: Rajesh Sivaramasubramaniom; Junxiao Bi > > Subject: [PATCH] fsck.xfs: mount/umount xfs fs to replay log before running xfs_repair > > > > fsck.xfs does xfs_repair -e if fsck.mode=force is set. It is > > possible that when the machine crashes, the fs is in inconsistent > > state with the journal log not yet replayed. This can put the > > machine into rescue shell. To address this problem, mount and > > umount the fs before running xfs_repair. > > What's the purpose of forcing xfs_repair to be run on every boot? > The whole point of having a journalling filesystem is to avoid > needing to run fsck on every boot. I don't think repair-at-every-boot here is the overall goal for our customer base. We've had some <cough> major support events over the last few months. There are two things going on here: some of the problems are due to are datacenters abending, and the rest of it are the pile of data corruption problems that you and I and Chandan have been working through for months now. This means that customer support has 30,000 VMs to reboot. 90% of the systems involved seem to have survived more or less intact, but that leaves 10% of them with latent errors, unmountable root filesystems, etc. They probably have even more than that, but I don't recommend inviting the G-men for a visit to find out the real sum. Since these machines are remotely manageable, support /can/ inject the kernel command line with 'fsck.mode=force' to kick off xfs_repair if the machine won't come up or if they suspect there might be deeper issues with latent errors in the fs metadata, which is what they did to try to get everyone running ASAP while anticipating any future problems... > I get why one might want to occasionally force a repair check on > boot (e.g. to repair a problem with the root filesystem), but this > is a -rescue operation- and really shouldn't be occurring > automatically on every boot or after a kernel crash. > > If it is only occurring during rescue operations, then why is it a problem > dumping out to a shell for the admin performing rescue > operations to deal with this directly? e.g. if the fs has a > corrupted journal, then a mount cycle will not fix the problem and > the admin will still get dumped into a rescue shell to fix the > problem manually. ...however, most of those filesystems in the abended datacenter had dirty logs, so repair failed and dumped all those machines to the emergency shell. 3000 machines * 15 minutes per ticket is a lot of downtime and a lot of manual labor. Support would really like a means to automate as much of this as they can. They had assumed that fsck.repair=yes would DTRT to try to recover and/or repair the fs, and were surprised to discover that this script would not even try to recover the log. > Hence I don't really know why anyone would be configuring their > systems like this: Some of the machines are VM hypervisors where unexpected crashes due to the rootfs going offline unexpectedly create unplanned downtime for a lot of machines. It would be less disruptive to our cloud fleet overall for a hypervisor take a little longer to boot if it's running fsck, since the rootfs is the bare minimum needed to get the smartnic started. For this case it might be preferable to configure this permanently, since there aren't /that/ many VM hypervisors. Nobody else gets fsck.mode=force in the grub config. > > Run xfs_repair -e when fsck.mode=force and repair=auto or yes. (If it were me writing the patch, I'd have made repair=auto detect a dirty log and continue the boot without running repair at all...) > as it makes no sense at all for a journalling filesystem. > > > If fsck.mode=force and fsck.repair=no, run xfs_repair -n without > > replaying the logs. > > Nor is it clear why anyone would want force a boot time fsck and > then not repair the damage that might be found.... That part's my fault, I suggested that we should fix the script so that repair=no selects dry run mode like you might expect. > More explanation, please! Frankly, I /don't/ want to expend a lot of time wringing our hands over how exactly do we hammer 2022 XFS tools into 1994 ext2 behavioral semantics. What they really want is online fsck to detect and fix problems in real time, but I can't seem to engage the community on how exactly do we land this thing now that I've finished writing it. --D > > Signed-off-by: Srikanth C S <srikanth.c.s@xxxxxxxxxx> > > --- > > fsck/xfs_fsck.sh | 20 ++++++++++++++++++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh > > index 6af0f22..21a8c19 100755 > > --- a/fsck/xfs_fsck.sh > > +++ b/fsck/xfs_fsck.sh > > @@ -63,8 +63,24 @@ if [ -n "$PS1" -o -t 0 ]; then > > fi > > > > if $FORCE; then > > - xfs_repair -e $DEV > > - repair2fsck_code $? > > + if $AUTO; then > > + xfs_repair -e $DEV > > + error=$? > > + if [ $error -eq 2 ]; then > > + echo "Replaying log for $DEV" > > + mkdir -p /tmp/tmp_mnt > > + mount $DEV /tmp/tmp_mnt > > + umount /tmp/tmp_mnt > > + xfs_repair -e $DEV > > + error=$? > > + rmdir /tmp/tmp_mnt > > + fi > > + else > > + #fsck.mode=force is set but fsck.repair=no > > + xfs_repair -n $DEV > > + error=$? > > + fi > > + repair2fsck_code $error > > exit $? > > fi > > As a side note, the patch has damaged whitespace.... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx