NFS outage retrospective

Kevin Fenzi <kevin@xxxxxxxxx> · Sat, 10 Dec 2011 15:19:40 -0700

I just posted a blog post about the NFS outage, but I thought I would
copy it here to get more feedback. 

http://scrye.com/wordpress-mu/nirik/2011/12/10/fedora-nfs-server-outage-retrospective/

As you may have seen if you are on the fedora announce list, we had an
outage the other day of our main build system NFS storage. This meant
that no builds could be made and also data could not be downloaded from
koji (rpms, build info, etc). I thought I would share here what
happened so we can learn from and try and prevent or mitigate this
happening again. 

First, a bit of background on the setup: We have a
storage device that exports raw storage as iSCSI. This is then
consumed/used by our main nfs server (nfs01). It’s using the device
with lvm2, and has a ext4 filesystem on it. It’s around 12TB in size.
This data is then exported to various other machines to use, including
builders, kojipkgs squid frontend for packages, koji hubs and release
engineering boxes that push updates. We also have a backup nfs server
(bnfs01) that has it’s own separate storage with a backup copy of the
primary data. 

On the morning of December 8th, the connection between
the iSCSI backend and nfs01 had a hiccup. It retried current in
progress writes, and then decided it could resume ok and kept going.
The filesystem had “Errors behavior: Continue” set so it kept going
(although no actual fs errors were logged, so that may not matter).
Shortly after this, NFS locks started failing and builds were getting
I/O errors. A lvm snapshot was made and a fsck run on that snapshot,
which completed after around 2 hours. A fsck was then run on the actual
volume itself, but that took around 8 hours and showed a great deal
more corruption than the snapshot had. In order to get things into a
good state, we then did a rsync of the snapshot off to our backup
storage (which took around 8 hours), and merged that snapshot back as
the master fs on the volume (which took around 30min to complete).
Then, a reboot and we were back up ok, but there were some small number
of builds that were made after the issue started. We purged them from
the koji database and re-ran them with the current/valid/repaired
filesystem. After that builders were brought back on-line and queued up
builds processed and things were back to normal. 

So, some lessons/ideas here, in no particular order: 

12TB means most anything you decide to do will take a while to finish.
On the plus side that gives you lots of time to think about the next
step. 

We should change the default on-error behavior to at least
'read-only' on that volume. Then errors would at least stop further
corruption, and best prevent the need for a lengthy fsck. It's not
entirely clear if the iSCSI errors would have made the fs hit error
condition or not however. 

We could do better about more regular backups of this data. A daily
snapshot and rsync off of that snapshot to backup storage could save us
time in the event of another backup sync being needed. We would also
then have the snapshot to go back to if needed Down the road 

Some of the cluster fses might be a good thing to investigate and
transition to. If we can spread the backend storage around and have
enough nodes, the failure of any one might not be as much impact.

Perhaps we could add monitoring for iscsi errors and note and react to
them quicker lvm and it's snapshots and ability to merge a snapshot
back in as primary really helped us out here.

Feel free to chime in with other thoughts or ideas. Hopefully it will
be quite some time since we have another outage like this one. 
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure