Hmm, that's an interesting point about the autoscaling, John. Thanks for mentioning this. Would the GlusterFS devs care to comment on this? Geoff. On Tue, 7 Jul 2009, Alpha Electronics wrote: > I recommended GlusterFS to my client without reservation, but got pissed > off because bugs was found from time to time and wasted too much time to > trace the source of problem - and also Gluster team is hiding the problem. > > For an actual example, check this link: > http://www.gluster.org/docs/index.php?title=Translators_options&diff=4891&o >ldid=4799 glusterFS has autoscaling issues, but Gluster team never made it > open. they just hide it quitely by removing autoscaling part in wiki. For > those of us following the previous document on wiki, we spent a lot time > and energy and learned the problem from the hard lessons. > > John > > On Mon, Jul 6, 2009 at 10:20 AM, Geoff Kassel > <gkassel@xxxxxxxxxxxxxxxxxxxxx > > > wrote: > > > > Hi Anand, > > Thank you for your explanation. I appreciate the circumstances you're > > in - > > I'm in a not-too-dissimilar environment myself. > > > > If you don't mind taking some more advice - do you mind taking down > > your current QA process document? It does not seem to be an accurate > > representation of your QA process at all. > > > > Alternatively, you could document what you really do, and then try to > > improve on it - a technique common to many quality management > > methodologies. > > If that doesn't look so good at first - well, you don't have to publish > > it openly. You're running an open source project, people are prepared for > > things > > to be a bit rough and ready. Just don't make representations that it's > > otherwise. (Major version numbers and marketing spiel are what I'm > > talking about here.) > > > > Misleading people - intentionally or otherwise - kills community > > support and commercial trust in your product fast. Open source projects > > in particular > > need to be more open than purely commercial efforts, because not only do > > you > > lose users, you lose current and potential developers when this happens. > > > > On the code front - can you please start using code comments? It's > > really hard to follow the purpose of some parts of the code otherwise, > > and that makes it difficult for those in the community to help you fix > > problems or provide new functionality. After all, isn't getting the > > community to help write and debug the software part of the cost > > effectiveness of the open source development technique? > > > > (I understand that there may be language issues at stake here. But this > > is > > the era of automatic translation, after all - hackers like me will get > > along > > okay so long as we can get the gist :) > > > > Please don't be afraid to use code quality analysis tools, even if they > > do > > insert some less-than-attractive comments. Tools like RATS and FlawFinder > > are > > free, they catch a lot of potential and actual stability and security > > issues, > > and can be partially automated as part of wider testing frameworks. > > > > GlusterFS should be eligible to sign up to use Coverity's scan for > > free. It's a highly recommended static analysis tool, and if you make use > > of the results, there are some quite dramatic gains in stability and > > reliability to > > be made. > > > > Also having a general look over the code every now and then will do > > wonders > > for these aspects as well - look at the security record of OpenBSD to see > > how > > effective code audits can be. > > > > On the testing framework front - I know how hard it is to start writing > > unit and regression tests for a project already under way. The answer > > I've found to this is to get developers writing tests for the new > > functionality they write, as they write it. (Leaving it to later - say, > > for the QA team to > > do - makes this process a lot more difficult, as I've found.) This > > documents > > in live code how the system should work, and if run whenever changes to > > that > > functionality are made, detects breakages fast. > > > > When the QA team or the community uncovers a bug, get the QA team to > > write > > a test case covering that issue, documenting (again in live code) what > > the correct behaviour should be. Between these two activities, the > > coverage of the testing framework will improve in leaps and bounds. > > > > Over time, you'll develop a full regression testing suite, which, if > > run before major releases (if not before each repository commit), will > > save a lot > > of time and embarrassment when the occasional bug pops up to affect older > > features negatively or cause known bugs to resurface. > > > > Thank you for listening to me, and I hope this advice proves useful to > > you. > > > > Geoff. > > > > On Mon, 6 Jul 2009, Anand Babu Periasamy wrote: > > > Gordon, Geoff, Fillipe, > > > > > > We are sorry!. We admit we had a rough and difficult past. > > > > > > Here are the reasons, why it was difficult for us: > > > * Limited staff and QA environment. > > > * GlusterFS is a programmable file system. It supported many OS > > > distros, applications, hardware and storage architecture. It was > > > impossible to QA all possible combinations. What we declared as stable > > > is just one of many such use-cases. > > > * Poor documentation. > > > > > > We are now VC funded. We have increased the size of our team and > > > hardware lab significantly. 2.0 is an outcome of this investment. 2.0.3 > > > scheduled for this week will be relatively lot more stable. A dedicated > > > technical writer is now working on an improved version of our > > > installation guide. > > > > We > > > > > are going to templatize GlusterFS stable configurations through a tool > > > > for > > > > > generating and managing volume spec files. GlusterSP (storage platform) > > > will completely automate the installation and management of a > > > ruggedized release of GlusterFS in an embedded OS form. GlusterSP 2010 > > > first beta > > > > will > > > > > be out in 2 months. With its web based UI and pre-configured system > > > > image, > > > > > a number of error factors are reduced. > > > > > > We are constantly learning and improving. You are making a valuable > > > contribution by constructively criticizing us with details and > > > proposals. We take them seriously and positively. > > > > > > Happy Hacking, > > > -- > > > Anand Babu Periasamy > > > GPG Key ID: 0x62E15A31 > > > Blog [http://unlocksmith.org] > > > GlusterFS [http://www.gluster.org] > > > GNU/Linux [http://www.gnu.org] > > > > > > Geoff Kassel wrote: > > > > Hi Gordan, > > > > > > > >> What is production unready (more than Gluster) about PeerFS or > > > > SeznamFS? > > > > > > Well, I'm mostly going by your email comparing these of a few months > > > > ago. > > > > > > Your needs are not that dissimilar to mine. > > > > > > > > I see on the project page for SeznamFS now that there's apparently > > > > support for SeznamFS to do master-master replication 'MySQL' style - > > > > with > > > > > > the limitations of MySQL's master-master replication, apparently. > > > > > > > > However, I can't seem to find out exactly what those limitations > > > > entail > > > > - > > > > > > or how to set it up in this mode. (And I am looking for a system that > > > > would allow more than two masters/peers, which is why I passed over > > > > DRBD > > > > > > for GlusterFS originally.) > > > > > > > > I can't get even the PeerFS web page to load. That's a disturbing > > > > sign > > > > to > > > > > > me. > > > > > > > >> You can fail over NFS servers. If the servers themselves are > > > >> mirrored (DRBD) and/or have a shared file system NFS should be able > > > >> to handle > > > > the > > > > > >> IP being migrated between servers. I've found it this tends to work > > > >> better with NFS over UDP provided you have a network that doesn't > > > >> normally suffer packet loss. > > > > > > > > Sorry, thought you were talking about NFS exports from just one local > > > > drive/RAID array. > > > > > > > > My leading fallback option for when I give up on Gluster is pretty > > > > much exactly what you've just described. However - I have the same > > > > (potential) > > > > > > issue as you with DRBD and WANs looming over my project i.e. the > > > > eventual > > > > > > need to run masters/peers in geographically distributed sites. > > > > > > > >> How do you mean? GFS1 has been in the vanilla kernel for a while. > > > > > > > > I don't use a vanilla kernel. I use a 'hardened' kernel patched with > > > > PaX > > > > > > and a few other security systems, to protect against stack smashing > > > > attacks and other nasties. (Just a little bit of extra, relative > > > > security, to make would-be attackers go after softer targets.) > > > > > > > > PaX is especially intolerant of memory faults in general, which is > > > > where > > > > > > my efforts in patching GlusterFS were focused. (And yes, I have > > > > disabled > > > > > > PaX features for Gluster. No, it didn't improve anything.) > > > > > > > > When I was looking into GFS, I found that the GFS patches (perhaps I > > > > was > > > > > > looking at v2) didn't work with the hardened patchset. GlusterFS had > > > > more > > > > > > promise than GFS anyway, so I went with GlusterFS. > > > > > > > >>> An older version of GlusterFS - as buggy as it is for me - is > > > >>> unfortunately still the best option. > > > >> > > > >> Out of interest, what was the last version of Gluster did you deem > > > >> completely stable? > > > > > > > > What works for me with only (only!) a few crashes a day, and no > > > > apparent > > > > > > data corruption is 1.4.0tla849. TLA 636 worked a little better for me > > > > - only random crashes once in a while. (But again - backwards > > > > incompatible > > > > > > changes had crept in between the two versions, so I couldn't go > > > > back.) > > > > > > > > I had much better stability with the earlier 1.3 releases. I can't > > > > remember exactly which ones now. (I suspect it was 1.3.3, but I'm no > > > > longer sure.) It's been quite a while. > > > > > > > >> I don't agree on that particular point, since the last outstanding > > > >> bug I'm seeing with any significant frequency in my use case is the > > > >> one of having to wait for a few seconds for the FS to settle after > > > >> mounting before doing anything or the operation fails. And to top it > > > >> off, I've just had it succeed without the wait. That seems quite > > > > heisenbuggy/recey > > > > > >> to me. :) > > > > > > > > Sorry, I was talking about the data corruption bugs. Not your > > > > first-access issue. > > > > > > > >> That doesn't help - the first-access-settle-time bug has been around > > > > for > > > > > >> a very long time. ;) > > > > > > > > Indeed. > > > > > > > > It's my hope that once testing frameworks (and syslog logging, in > > > > your case) are made available to the community, people like us can > > > > attempt > > > > to > > > > > > debug our systems with some degree of confidence that we're not > > > > causing other subtle issues with our patches. > > > > > > > > That's got to be better for the project as a whole. > > > > > > > > Geoff. > > > > > > > > On Sun, 5 Jul 2009, Gordan Bobic wrote: > > > >> Geoff Kassel wrote: > > > >>>> Sounds like a lot of effort and micro-downtime compared to a > > > > migration > > > > > >>>> to something else. Have you explored other options like PeerFS, > > > >>>> GFS and SeznamFS? Or NFS exports with failover rather than Gluster > > > >>>> clients, with Gluster only server-to-server? > > > >>> > > > >>> These options are not production ready (as I believe has been > > > >>> pointed out already to the list) for what I need; > > > >> > > > >> What is production unready (more than Gluster) about PeerFS or > > > > SeznamFS? > > > > > >>> or in the case of NFS, defeating the > > > >>> point of redundancy in the first place. > > > >> > > > >> You can fail over NFS servers. If the servers themselves are > > > >> mirrored (DRBD) and/or have a shared file system NFS should be able > > > >> to handle > > > > the > > > > > >> IP being migrated between servers. I've found it this tends to work > > > >> better with NFS over UDP provided you have a network that doesn't > > > >> normally suffer packet loss. > > > >> > > > >>> (Also, GFS is also not compatible > > > >>> with the kernel patchset I need to use.) > > > >> > > > >> How do you mean? GFS1 has been in the vanilla kernel for a while. > > > >> > > > >>> I have tried AFR on the server side and the client side. Both > > > >>> display similar issues. > > > >>> > > > >>> An older version of GlusterFS - as buggy as it is for me - is > > > >>> unfortunately still the best option. > > > >> > > > >> Out of interest, what was the last version of Gluster did you deem > > > >> completely stable? > > > >> > > > >>> (That doesn't mean I can't complain about the lack of progress > > > > towards > > > > > >>> stability and reliability, though :) > > > >> > > > >> Heh - and would you believe I just rebooted one of my > > > > root-on-glusterfs > > > > > >> nodes and it came up OK without the bail-out requiring manual > > > >> intervention caused by the bug that causes first access after > > > >> mounting to fail before things have settled. > > > >> > > > >>>> One of the problems is that some tests in this case are impossible > > > > to > > > > > >>>> carry out without having multiple nodes up and running, as a > > > >>>> number > > > > of > > > > > >>>> bugs have been arising in cases where nodes join/leave or cause > > > >>>> race conditions. It would require a distributed test harness which > > > >>>> would > > > > be > > > > > >>>> difficult to implement so that they run on any client that builds > > > > the > > > > > >>>> binaries. Just because the test harness doesn't ship with the > > > > sources > > > > > >>>> doesn't mean it doesn't exist on a test rig the developers use > > > >>> > > > >>> Okay, so what about the volume of test cases that can be tested > > > > without > > > > > >>> a distributed test harness? I don't see any sign of testing > > > > mechanisms > > > > > >>> for that. > > > >> > > > >> That point is hard to argue against. :) > > > >> > > > >>> And wouldn't it be prudent anyway - giving how often the GlusterFS > > > > devs > > > > > >>> do not have access to the platform with the reported problem - to > > > >>> provide this harness so that people can generate the appropriate > > > >>> test results the devs need for themselves? (Giving a complete > > > >>> stranger > > > > from > > > > > >>> overseas root access is a legal minefield to those who have to work > > > >>> with data held in-confidence.) > > > >> > > > >> Indeed. And shifting test-case VM images tends to be impractical > > > >> (even though I have provided both to the gluster developers in the > > > >> past for specific error-case analysis). > > > >> > > > >>> It's been my impression, though, that the relevant bugs are not > > > >>> heisenbugs or race conditions. > > > >> > > > >> I don't agree on that particular point, since the last outstanding > > > >> bug I'm seeing with any significant frequency in my use case is the > > > >> one of having to wait for a few seconds for the FS to settle after > > > >> mounting before doing anything or the operation fails. And to top it > > > >> off, I've just had it succeed without the wait. That seems quite > > > > heisenbuggy/recey > > > > > >> to me. :) > > > >> > > > >>> (I'm judging that on the speed of the follow up patch, by the way - > > > >>> race conditions notoriously can take a long time to track down.) > > > >> > > > >> That doesn't help - the first-access-settle-time bug has been around > > > > for > > > > > >> a very long time. ;) > > > >> > > > >> Gordan > > > >> > > > >> > > > >> _______________________________________________ > > > >> Gluster-devel mailing list > > > >> Gluster-devel@xxxxxxxxxx > > > >> http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@xxxxxxxxxx > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > http://lists.nongnu.org/mailman/listinfo/gluster-devel