On Wed, Jan 25, 2017 at 10:44:34PM -0800, Darrick J. Wong wrote: > On Thu, Jan 26, 2017 at 01:08:38PM +0800, Eryu Guan wrote: > > On Sat, Jan 21, 2017 at 12:10:19AM -0800, Darrick J. Wong wrote: > > > Hi all, > > > > > > This is the fifth revision of a patchset that adds to XFS userland tools > > > support for online metadata scrubbing and repair. > > > > > > The new patches in this series do three things: first, they expand the > > > filesystem populate commands inside xfstests to be able to create all > > > types of XFS metadata. Second, they create a bunch of xfs_db wrapper > > > functions to iterate all fields present in a given metadata object and > > > fuzz them in various ways. Finally, for each metadata object type there > > > is a separate test that iteratively fuzzes all fields of that object and > > > runs it through the mount/scrub/repair loop to see what happens. > > > > > > If you're going to start using this mess, you probably ought to just > > > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3]. > > > > Are your github trees synced with kernel.org trees? Seems so, and I did > > my tests with your kernel.org trees. > > Yes, they are. (Or at least they should be, if I did it correctly.) > > > > The kernel patches in the git trees should apply to 4.10-rc4; xfsprogs > > > patches to for-next; and xfstest to master. > > > > > > The patches have survived all auto group xfstests both with scrub-only > > > mode and also a special debugging mode to xfs_scrub that forces it to > > > rebuild the metadata structures even if they're not damaged. > > > > I have trouble finishing running all the tests so far, the tests need > > long time to run and in some tests xfs_repair or xfs_scrub are just > > Yes, the amount of dmesg noise slows the tests wayyyyyy down. One of > the newer patches reduces the amount of spew when the scrubbers are > running. > > (FWIW when I run them I have a debug patch that shuts up all the > warnings.) > > > spinning there, sometimes I can kill them to make test continue, > > There are some undiagnosed deadlocks in xfs_repair, and some OOM > problems in xfs_db that didn't get fixed until recently. > > > sometimes I can't (e.g. xfs/1312, I tried to kill the xfs_scrub process, > > but it became <defunc>). > > That's odd. Next time that happens can you sysrq-t to find out where > the scrub threads are stuck, please? I still have it in zombie state, attachment is sysrq-t dump saved from /var/log/message (seems not easy to read..). > > > And in most tests I have run, I see such failures: > > > > +scrub didn't fail with length = ones. > > +scrub didn't fail with length = firstbit. > > +scrub didn't fail with length = middlebit. > > +scrub didn't fail with length = lastbit. > > .... > > > > Not sure if that's expected? > > Yes, that's expected. The scrub routines expect that the repairing > program (xfs_{scrub,repair}) will complain about the corrupt field, > repair it, and a subsequent re-run will exit cleanly. There are quite a > few fields like uid/gid and timestamps that have no inherent meaning to > XFS. As a result, there's no problem to be detected. Some of the > fuzzes will prevent the fs from mounting, which causes other error > messages. > > The rest could be undiagnosed problems in other parts of XFS (or scrub). > I've not had time to triage a lot of it. I've been recording exactly > what and where things fail and I'll have a look at them as time allows. > > > I also hit xfs_scrub and xfs_repair double free bug in xfs/1312 (perhaps > > that's why I can't kill it). > > Maybe. In theory the page refcounts get reset, I think, but I've seen > the VM crash with double-fault errors and other weirdness that seem to > go away when the refcount bugs go away. > > > OTOH, all these failures/issues seem like kernel or userspace bug, I > > went through all the patches and new tests and I didn't find anything > > wrong obviously. So I think it's fine to merge them in this week's > > update. Unless you have a second thought? > > Sure. I will never enable them in any of the heavily used groups, so > that should be fine. Though I do have a request -- the 13xx numbers are > set up so that if test (1300+x) fuzzes object X and tries to xfs_repair > it, then test (1340+x) fuzzes the same X but tries to xfs_scrub it. > Could you interweave them when you renumber the tests? Perhaps that explains why there's no 1336-1340 :) > > e.g. 1302 -> 510, 1342 -> 511, 1303 -> 512, 1343 -> 513? > > That'll help me to keep together the repair & scrub fuzz tests. Sure, I'll renumber the tests and let you review first before pushing them to upstream. Thanks, Eryu
Attachment:
xfs_scrub-hang-sysrq-t.log.gz
Description: application/gzip