On Tue, Apr 24, 2012 at 10:55:22AM +0200, Juerg Haefliger wrote: > On Tue, Apr 24, 2012 at 1:58 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Apr 23, 2012 at 05:33:40PM +0200, Juerg Haefliger wrote: > >> Hi Dave, > >> > >> > >> On Mon, Apr 23, 2012 at 4:38 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > >> > On Mon, Apr 23, 2012 at 02:09:53PM +0200, Juerg Haefliger wrote: > >> >> Hi, > >> >> > >> >> I have a test system that I'm using to try to force an XFS filesystem > >> >> hang since we're encountering that problem sporadically in production > >> >> running a 2.6.38-8 Natty kernel. The original idea was to use this > >> >> system to find the patches that fix the issue but I've tried a whole > >> >> bunch of kernels and they all hang eventually (anywhere from 5 to 45 > >> >> mins) with the stack trace shown below. > >> > > >> > If you kill the workload, does the file system recover normally? > >> > >> The workload can't be killed. > > > > OK. > > > >> >> Only an emergency flush will > >> >> bring the filesystem back. I tried kernels 3.0.29, 3.1.10, 3.2.15, > >> >> 3.3.2. From reading through the mail archives, I get the impression > >> >> that this should be fixed in 3.1. > >> > > >> > What you see is not necessarily a hang. It may just be that you've > >> > caused your IO subsystem to have so much IO queued up it's completely > >> > overwhelmed. How much RAM do you have in the machine? > >> > >> When it hangs, there are zero IOs going to the disk. The machine has > >> 100GB of RAM. > > > > Can you get an event trace across the period where the hang occurs? > > > > .... > > > >> >> I can't seem to hit the problem without the above modifications. > >> > > >> > How on earth did you come up with this configuration? > >> > >> Just plain ol' luck. I was looking for a configuration that would > >> allow me to reproduce the hangs and I accidentally picked a machine > >> with a faulty controller battery which disabled the cache. > > > > Wonderful. > > > >> >> For the IO workload I pre-create 8000 files with random content and > >> >> sizes between 1k and 128k on the test partition. Then I run a tool > >> >> that spawns a bunch of threads which just copy these files to a > >> >> different directory on the same partition. > >> > > >> > So, your workload also has a significant amount parallelism and > >> > concurrency on a filesytsem with only 4 AGs? > >> > >> Yes. Excuse my ignorance but what are AGs? > > > > Allocation groups. > > > >> >> At the same time there are > >> >> other threads that rename, remove and overwrite random files in the > >> >> destination directory keeping the file count at around 500. > >> > > >> > And you've added as much concurrent metadata modification as > >> > possible, too, which makes me wonder..... > >> > > >> >> Let me know what other information I can provide to pin this down. > >> > > >> > .... exactly what are you trying to acheive with this test? From my > >> > point of view, you're doing something completely and utterly insane. > >> > You filesystem config and workload is so far outside normal > >> > configurations and workloads that I'm not surprised you're seeing > >> > some kind of problem..... > >> > >> No objection from my side. It's a silly configuration but it's the > >> only one I've found that lets me reproduce a hang at will. > > > > Ok, that's fair enough - it's handy to tell us that up front, > > though. ;) > > Ah sorry for not being clear enough. I thought my intentions could be > deduced from the information that I provided :-) > > > > Alright, then I need all the usual information. I suspect an event > > trace is the only way I'm going to see what is happening. I just > > updated the FAQ entry, so all the necessary info for gathering a > > trace should be there now. > > > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > Very good. Will do. What kernel do you want me to run? I would prefer > our current production kernel (2.6.38-8-server) but I understand if > you want something newer. If you can reproduce it on a current kernel - 3.4-rc4 if possible, if not a 3.3.x stable kernel would be best. 2.6.38 is simply too old to be useful for debugging these sorts of problems... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs