Hi, On Wed, 2008-05-07 at 13:44 +0200, Jonas Björklund wrote: > Hello, > > I would like to know also... > > /Jonas > > On Wed, 7 May 2008, Vimal Gupta wrote: > > > Hi, > > > > I have the same question.??? > > Anybody has the answer Please.......??? > > > > Chris Picton wrote: > >> Hi All > >> > >> I am investigating a new cluster installation. > >> > >> Documentation from redhat indicates that GFS2 is not yet production ready. > >> Tests I have run show it is *much* faster that gfs for my workload. > >> > >> Is GFS2 not production-ready due to lack of testing, or due to known bugs? > >> > >> Any advice would be appreciated > >> > >> Chris > >> The answer is a bit of both. We are getting to the stage where the known bugs are mostly solved or will be very shortly. You can see the state of the bug list at any time by going to bugzilla.redhat.com and looking for any bug with gfs2 in the summary line. There are currently approx 70 such bugs, but please bear in mind that a large number of these are asking for new features, and some of them are duplicates of the same bug across different versions of RHEL and/or Fedora. We are currently at a stage where having a large number of people helping us in testing would be very helpful. If you have your own favourite filesystem test, or if you are in a position to run a test application, then we would be very interested in any reports of success/failure. If you do have any problems, then please do: o Check bugzilla to see if someone else has had the same problem o Report them (preferably via bugzilla, as that ensures that they won't get lost somewhere) o Report them as "Fedora, rawhide" if they relate to the upstream kernel (either Linus' tree or my -nmw git tree) and indicate in the comments section which of these kernels you were using o Send patches if you have them, but please don't let that stop you reporting bugs. All reports are useful. We might not be able to always fix each and every report right away, but sometimes patterns emerge via a number of reports which do allow us to home in on a particularly tricky issue. o If you experience a hang, then please include (if possible): - A glock lock dump from all nodes (via debugfs) - A dlm lock dump from all nodes (via debugfs) - A stack trace from all nodes (echo t >/proc/sysrq-trigger) o If you experience an oops, then please make sure that you include all the messages (including those which might have been logged just before the oops itself). The more people we have testing & reporting bugs, the quicker we can approach stability. There is one issue which I'm currently working on relating to a (fairly rare, but nonetheless possible) race. This happens when two threads calling ->readpage() race with each other. The reason that this is problematic is that its the one place left where we are using "try locks" to get around the page lock/glock lock ordering problem and the VFS's AOP_TRUNCATED_PAGE return code is not guaranteed to result in ->readpage() being called again if another ->readpage() has raced with it and brought the page uptodate. As a result "try locks" are the only option, but for long and complicated reasons when a "try lock" is queued it might end up triggering a demotion (if a request is pending from a remote node) which deadlocks due to page lock/glock ordering. The patch I'm working on at the moment, fixes that problem by failing the glock (GLR_TRYFAILED) if a demote is needed and scheduling the glock workqueue to deal with the demotion, thus avoiding the race. The try lock will then be retried at a later date when it can be successful. The bugzilla for this is #432057 if you want to follow my progress. Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster