Re: Performance Translators' Stability and Usefulness

Geoff Kassel <gkassel@xxxxxxxxxxxxxxxxxxxxx> · Tue, 7 Jul 2009 18:28:56 +1000

Hmm, that's an interesting point about the autoscaling, John. Thanks for 
mentioning this.

Would the GlusterFS devs care to comment on this?

Geoff.

On Tue, 7 Jul 2009, Alpha Electronics wrote:
> I recommended GlusterFS to my client without reservation, but got pissed
> off because bugs was found from time to time and wasted too much time to
> trace the source of problem - and also Gluster team is hiding the problem.
>
> For an actual example, check this link:
> http://www.gluster.org/docs/index.php?title=Translators_options&diff=4891&o
>ldid=4799 glusterFS has autoscaling issues, but Gluster team never made it
> open. they just hide it quitely by removing autoscaling part in wiki. For
> those of us following the previous document on wiki, we spent a lot time
> and energy and learned the problem from the hard lessons.
>
> John
>
> On Mon, Jul 6, 2009 at 10:20 AM, Geoff Kassel
> <gkassel@xxxxxxxxxxxxxxxxxxxxx
>
> > wrote:
> >
> > Hi Anand,
> >   Thank you for your explanation. I appreciate the circumstances you're
> > in -
> > I'm in a not-too-dissimilar environment myself.
> >
> >   If you don't mind taking some more advice - do you mind taking down
> > your current QA process document? It does not seem to be an accurate
> > representation of your QA process at all.
> >
> >   Alternatively, you could document what you really do, and then try to
> > improve on it - a technique common to many quality management
> > methodologies.
> > If that doesn't look so good at first - well, you don't have to publish
> > it openly. You're running an open source project, people are prepared for
> > things
> > to be a bit rough and ready. Just don't make representations that it's
> > otherwise. (Major version numbers and marketing spiel are what I'm
> > talking about here.)
> >
> >   Misleading people - intentionally or otherwise - kills community
> > support and commercial trust in your product fast. Open source projects
> > in particular
> > need to be more open than purely commercial efforts, because not only do
> > you
> > lose users, you lose current and potential developers when this happens.
> >
> >   On the code front - can you please start using code comments? It's
> > really hard to follow the purpose of some parts of the code otherwise,
> > and that makes it difficult for those in the community to help you fix
> > problems or provide new functionality. After all, isn't getting the
> > community to help write and debug the software part of the cost
> > effectiveness of the open source development technique?
> >
> >   (I understand that there may be language issues at stake here. But this
> > is
> > the era of automatic translation, after all - hackers like me will get
> > along
> > okay so long as we can get the gist :)
> >
> >   Please don't be afraid to use code quality analysis tools, even if they
> > do
> > insert some less-than-attractive comments. Tools like RATS and FlawFinder
> > are
> > free, they catch a lot of potential and actual stability and security
> > issues,
> > and can be partially automated as part of wider testing frameworks.
> >
> >   GlusterFS should be eligible to sign up to use Coverity's scan for
> > free. It's a highly recommended static analysis tool, and if you make use
> > of the results, there are some quite dramatic gains in stability and
> > reliability to
> > be made.
> >
> >   Also having a general look over the code every now and then will do
> > wonders
> > for these aspects as well - look at the security record of OpenBSD to see
> > how
> > effective code audits can be.
> >
> >   On the testing framework front - I know how hard it is to start writing
> > unit and regression tests for a project already under way. The answer
> > I've found to this is to get developers writing tests for the new
> > functionality they write, as they write it. (Leaving it to later - say,
> > for the QA team to
> > do - makes this process a lot more difficult, as I've found.) This
> > documents
> > in live code how the system should work, and if run whenever changes to
> > that
> > functionality are made, detects breakages fast.
> >
> >   When the QA team or the community uncovers a bug, get the QA team to
> > write
> > a test case covering that issue, documenting (again in live code) what
> > the correct behaviour should be. Between these two activities, the
> > coverage of the testing framework will improve in leaps and bounds.
> >
> >   Over time, you'll develop a full regression testing suite, which, if
> > run before major releases (if not before each repository commit), will
> > save a lot
> > of time and embarrassment when the occasional bug pops up to affect older
> > features negatively or cause known bugs to resurface.
> >
> >   Thank you for listening to me, and I hope this advice proves useful to
> > you.
> >
> > Geoff.
> >
> > On Mon, 6 Jul 2009, Anand Babu Periasamy wrote:
> > > Gordon, Geoff, Fillipe,
> > >
> > > We are sorry!. We admit we had a rough and difficult past.
> > >
> > > Here are the reasons, why it was difficult for us:
> > > * Limited staff and QA environment.
> > > * GlusterFS is a programmable file system. It supported many OS
> > > distros, applications, hardware and storage architecture. It was
> > > impossible to QA all possible combinations. What we declared as stable
> > > is just one of many such use-cases.
> > > * Poor documentation.
> > >
> > > We are now VC funded. We have increased the size of our team and
> > > hardware lab significantly. 2.0 is an outcome of this investment. 2.0.3
> > > scheduled for this week will be relatively lot more stable. A dedicated
> > > technical writer is now working on an improved version of our
> > > installation guide.
> >
> >  We
> >
> > > are going to templatize GlusterFS stable configurations through a tool
> >
> > for
> >
> > > generating and managing volume spec files. GlusterSP (storage platform)
> > > will completely automate the installation and management of a
> > > ruggedized release of GlusterFS in an embedded OS form. GlusterSP 2010
> > > first beta
> >
> > will
> >
> > > be out in 2 months. With its web based UI and pre-configured system
> >
> > image,
> >
> > > a number of error factors are reduced.
> > >
> > > We are constantly learning and improving. You are making a valuable
> > > contribution by constructively criticizing us with details and
> > > proposals. We take them seriously and positively.
> > >
> > > Happy Hacking,
> > > --
> > > Anand Babu Periasamy
> > > GPG Key ID: 0x62E15A31
> > > Blog [http://unlocksmith.org]
> > > GlusterFS [http://www.gluster.org]
> > > GNU/Linux [http://www.gnu.org]
> > >
> > > Geoff Kassel wrote:
> > > > Hi Gordan,
> > > >
> > > >> What is production unready (more than Gluster) about PeerFS or
> >
> > SeznamFS?
> >
> > > > Well, I'm mostly going by your email comparing these of a few months
> >
> > ago.
> >
> > > > Your needs are not that dissimilar to mine.
> > > >
> > > > I see on the project page for SeznamFS now that there's apparently
> > > > support for SeznamFS to do master-master replication 'MySQL' style -
> >
> > with
> >
> > > > the limitations of MySQL's master-master replication, apparently.
> > > >
> > > > However, I can't seem to find out exactly what those limitations
> > > > entail
> >
> > -
> >
> > > > or how to set it up in this mode. (And I am looking for a system that
> > > > would allow more than two masters/peers, which is why I passed over
> >
> > DRBD
> >
> > > > for GlusterFS originally.)
> > > >
> > > > I can't get even the PeerFS web page to load. That's a disturbing
> > > > sign
> >
> > to
> >
> > > > me.
> > > >
> > > >> You can fail over NFS servers. If the servers themselves are
> > > >> mirrored (DRBD) and/or have a shared file system NFS should be able
> > > >> to handle
> >
> > the
> >
> > > >> IP being migrated between servers. I've found it this tends to work
> > > >> better with NFS over UDP provided you have a network that doesn't
> > > >> normally suffer packet loss.
> > > >
> > > > Sorry, thought you were talking about NFS exports from just one local
> > > > drive/RAID array.
> > > >
> > > > My leading fallback option for when I give up on Gluster is pretty
> > > > much exactly what you've just described. However - I have the same
> >
> > (potential)
> >
> > > > issue as you with DRBD and WANs looming over my project i.e. the
> >
> > eventual
> >
> > > > need to run masters/peers in geographically distributed sites.
> > > >
> > > >> How do you mean? GFS1 has been in the vanilla kernel for a while.
> > > >
> > > > I don't use a vanilla kernel. I use a 'hardened' kernel patched with
> >
> > PaX
> >
> > > > and a few other security systems, to protect against stack smashing
> > > > attacks and other nasties. (Just a little bit of extra, relative
> > > > security, to make would-be attackers go after softer targets.)
> > > >
> > > > PaX is especially intolerant of memory faults in general, which is
> >
> > where
> >
> > > > my efforts in patching GlusterFS were focused. (And yes, I have
> >
> > disabled
> >
> > > > PaX features for Gluster. No, it didn't improve anything.)
> > > >
> > > > When I was looking into GFS, I found that the GFS patches (perhaps I
> >
> > was
> >
> > > > looking at v2) didn't work with the hardened patchset. GlusterFS had
> >
> > more
> >
> > > > promise than GFS anyway, so I went with GlusterFS.
> > > >
> > > >>> An older version of GlusterFS - as buggy as it is for me - is
> > > >>> unfortunately still the best option.
> > > >>
> > > >> Out of interest, what was the last version of Gluster did you deem
> > > >> completely stable?
> > > >
> > > > What works for me with only (only!) a few crashes a day, and no
> >
> > apparent
> >
> > > > data corruption is 1.4.0tla849. TLA 636 worked a little better for me
> > > > - only random crashes once in a while. (But again - backwards
> >
> > incompatible
> >
> > > > changes had crept in between the two versions, so I couldn't go
> > > > back.)
> > > >
> > > > I had much better stability with the earlier 1.3 releases. I can't
> > > > remember exactly which ones now. (I suspect it was 1.3.3, but I'm no
> > > > longer sure.) It's been quite a while.
> > > >
> > > >> I don't agree on that particular point, since the last outstanding
> > > >> bug I'm seeing with any significant frequency in my use case is the
> > > >> one of having to wait for a few seconds for the FS to settle after
> > > >> mounting before doing anything or the operation fails. And to top it
> > > >> off, I've just had it succeed without the wait. That seems quite
> >
> > heisenbuggy/recey
> >
> > > >> to me. :)
> > > >
> > > > Sorry, I was talking about the data corruption bugs. Not your
> > > > first-access issue.
> > > >
> > > >> That doesn't help - the first-access-settle-time bug has been around
> >
> > for
> >
> > > >> a very long time. ;)
> > > >
> > > > Indeed.
> > > >
> > > > It's my hope that once testing frameworks (and syslog logging, in
> > > > your case) are made available to the community, people like us can
> > > > attempt
> >
> > to
> >
> > > > debug our systems with some degree of confidence that we're not
> > > > causing other subtle issues with our patches.
> > > >
> > > > That's got to be better for the project as a whole.
> > > >
> > > > Geoff.
> > > >
> > > > On Sun, 5 Jul 2009, Gordan Bobic wrote:
> > > >> Geoff Kassel wrote:
> > > >>>> Sounds like a lot of effort and micro-downtime compared to a
> >
> > migration
> >
> > > >>>> to something else. Have you explored other options like PeerFS,
> > > >>>> GFS and SeznamFS? Or NFS exports with failover rather than Gluster
> > > >>>> clients, with Gluster only server-to-server?
> > > >>>
> > > >>> These options are not production ready (as I believe has been
> > > >>> pointed out already to the list) for what I need;
> > > >>
> > > >> What is production unready (more than Gluster) about PeerFS or
> >
> > SeznamFS?
> >
> > > >>> or in the case of NFS, defeating the
> > > >>> point of redundancy in the first place.
> > > >>
> > > >> You can fail over NFS servers. If the servers themselves are
> > > >> mirrored (DRBD) and/or have a shared file system NFS should be able
> > > >> to handle
> >
> > the
> >
> > > >> IP being migrated between servers. I've found it this tends to work
> > > >> better with NFS over UDP provided you have a network that doesn't
> > > >> normally suffer packet loss.
> > > >>
> > > >>> (Also, GFS is also not compatible
> > > >>> with the kernel patchset I need to use.)
> > > >>
> > > >> How do you mean? GFS1 has been in the vanilla kernel for a while.
> > > >>
> > > >>> I have tried AFR on the server side and the client side. Both
> > > >>> display similar issues.
> > > >>>
> > > >>> An older version of GlusterFS - as buggy as it is for me - is
> > > >>> unfortunately still the best option.
> > > >>
> > > >> Out of interest, what was the last version of Gluster did you deem
> > > >> completely stable?
> > > >>
> > > >>> (That doesn't mean I can't complain about the lack of progress
> >
> > towards
> >
> > > >>> stability and reliability, though :)
> > > >>
> > > >> Heh - and would you believe I just rebooted one of my
> >
> > root-on-glusterfs
> >
> > > >> nodes and it came up OK without the bail-out requiring manual
> > > >> intervention caused by the bug that causes first access after
> > > >> mounting to fail before things have settled.
> > > >>
> > > >>>> One of the problems is that some tests in this case are impossible
> >
> > to
> >
> > > >>>> carry out without having multiple nodes up and running, as a
> > > >>>> number
> >
> > of
> >
> > > >>>> bugs have been arising in cases where nodes join/leave or cause
> > > >>>> race conditions. It would require a distributed test harness which
> > > >>>> would
> >
> > be
> >
> > > >>>> difficult to implement so that they run on any client that builds
> >
> > the
> >
> > > >>>> binaries. Just because the test harness doesn't ship with the
> >
> > sources
> >
> > > >>>> doesn't mean it doesn't exist on a test rig the developers use
> > > >>>
> > > >>> Okay, so what about the volume of test cases that can be tested
> >
> > without
> >
> > > >>> a distributed test harness? I don't see any sign of testing
> >
> > mechanisms
> >
> > > >>> for that.
> > > >>
> > > >> That point is hard to argue against. :)
> > > >>
> > > >>> And wouldn't it be prudent anyway - giving how often the GlusterFS
> >
> > devs
> >
> > > >>> do not have access to the platform with the reported problem - to
> > > >>> provide this harness so that people can generate the appropriate
> > > >>> test results the devs need for themselves? (Giving a complete
> > > >>> stranger
> >
> > from
> >
> > > >>> overseas root access is a legal minefield to those who have to work
> > > >>> with data held in-confidence.)
> > > >>
> > > >> Indeed. And shifting test-case VM images tends to be impractical
> > > >> (even though I have provided both to the gluster developers in the
> > > >> past for specific error-case analysis).
> > > >>
> > > >>> It's been my impression, though, that the relevant bugs are not
> > > >>> heisenbugs or race conditions.
> > > >>
> > > >> I don't agree on that particular point, since the last outstanding
> > > >> bug I'm seeing with any significant frequency in my use case is the
> > > >> one of having to wait for a few seconds for the FS to settle after
> > > >> mounting before doing anything or the operation fails. And to top it
> > > >> off, I've just had it succeed without the wait. That seems quite
> >
> > heisenbuggy/recey
> >
> > > >> to me. :)
> > > >>
> > > >>> (I'm judging that on the speed of the follow up patch, by the way -
> > > >>> race conditions notoriously can take a long time to track down.)
> > > >>
> > > >> That doesn't help - the first-access-settle-time bug has been around
> >
> > for
> >
> > > >> a very long time. ;)
> > > >>
> > > >> Gordan
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> Gluster-devel mailing list
> > > >> Gluster-devel@xxxxxxxxxx
> > > >> http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > >
> > > > _______________________________________________
> > > > Gluster-devel mailing list
> > > > Gluster-devel@xxxxxxxxxx
> > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel