Fwd: Re: NILFS error after power loss

mikael@xxxxxxxxxxx · Wed, 30 Jan 2019 00:01:04 +0100

Hi,

I first sent this email to pg@xxxxxxxxxxxxxxxxxxxxx since it was the 
email list I sent my previous emails to. However, I am unsure whether my 
email really reached the email list.

Sorry it took me 1,5 years to respond to this (see below). I have used 
another computer meanwhile so have not been stuck since then. :) I have 
a daughter that just have started kindergarten so I have a bit more time 
now. :)

See my answers below.

2017-07-31 15:20 skrev pg@xxxxxxxxxxxxxxxxxxxxx:
[ ... ]

But as far as I understand it is not possible to mount a
previous snapshot as writable if there are snapshots/checkpoints
after this snapshot. Since I only get a filesystem error when
mounting a snapshot writable,

That seems unlikely to me. After mounting read-only, check whether
the whole filetree can be accessed error-free, with something like

  find $DIR -xdev -perm /07777 | wc -l

for metadata and then for data too:

  tar -f /dev/zero -c --one $DIR

The metadata test worked well, i.e. without any errors.

The other test resulted in:

root@ubuntu:/mnt/home/mikael# tar -f /dev/zero -c --one-file-system /mnt
...
tar: 
/mnt/var/log/journal/abdb30cb66eb43ec8f9c05e1bc6e2af5/system@06f2a50856274666b1535caf10c332fa-0000000000000001-00054a024e49b482.journal: 
Read error at byte 0, while reading 9728 bytes: Input/output error
tar: /mnt/nix/var/nix/daemon-socket/socket: socket ignored
tar: Removing leading `/' from hard link targets
tar: 
/mnt/nix/store/5n2f3kak5vf8978h98kw3zq5p191cvyl-ghostscript-fonts/n021003l.pfb: 
Read error at byte 0, while reading 7680 bytes: Input/output error
tar: Exiting with failure status due to previous errors
root@ubuntu:/mnt/home/mikael#

I suppose this does not look so good. Do you want me to send you some 
more information regarding the problem or should I just remove the 
newest checkpoint and see if that helps?

Eventually, if you can find a checkpoint/snapshot that is
error-free, you can delete any newer corrupted ones and mount that
one read-write. Ideally you would do a nice backup before doing
that.

If I remove any checkpoints with errors until I find an error-free one, 
should I not just be able to reboot the system after that? Does not 
NILFS automatically mount and continue on the last error-free checkpoint 
then?

If you cannot find any that is error-free, probably that was
either a grievous IO error (most likely lack of proper barriers)
or the consequences of that recently discovered bug, if you are
very unlucky.

Usually the second newest checkpoint/snapshot is error-free when
a system crashes and the newest has got errors, that is usually
only the newest checkpoint is invalid.

--
Kind regards

Mikael Andersson