Hi Mickey, > Wow, you really hit my biggest fear, the one thing I try to test for... > data corruption. Yep, it's mine too. Hence all this ranting and raving of mine :) > I'm doing a simplified version of the first set of testing you mentioned Sounds good :) > I would add a few but I haven't had the time to google how to do the > following without writing a C prog: > check flock() > check mmap writing Perl and Python can do these, but I'd rather not script in Perl if I can possibly avoid it. (Python's much nicer :) > I'm thinking a simple collection of bash or perl scripts would work for > a first pass at this. Do you have any suggestions on a good colab site > for scripting? If we came up with a basic format we could create and > then mix and match them as we saw fit. We just need them all to be > called with the same args, then have a master run that executes all of > them in a tests dir. The devs claim to have an unpublished, incomplete framework underway. I've asked for this to be published - no matter how incomplete it may be. Maybe we could take some cues from that, once we get a look at it? (After all, we want the devs to ultimately use what we produce. We should use the same underlying tools.) As for a collaboration place - there's the GlusterFS traditional location for random code dumps - http://glusterfs.pastebin.com > It would also be nice if there was a sort of > standard output both for giving to devels as well as rolling up nicely > if we get 1E3 of these things. Agreed. I've been using PyUnit a lot lately (I've become more of a Python programmer than a C programmer these days) so anything that aggregates test suites and produces results in a similar manner wins big according to me. Geoff. On Wed, 8 Jul 2009, Mickey Mazarick wrote: > Wow, you really hit my biggest fear, the one thing I try to test for... > data corruption. > That's what I wake up afraid of at night... > > I'm doing a simplified version of the first set of testing you mentioned > but nothing as detailed. Really creating a random file and doing an md5 > check on it, but now that you mention all the possabilities of files > moving in from the back end I'm really doing nothing to test the dht or > namespace distrobution at all.... > I would add a few but I haven't had the time to google how to do the > following without writing a C prog: > check flock() > check mmap writing > Also I have yet to get this to work all the time but starting a large > write, and losing a brick under afr.. (usually term the write) > > I'm thinking a simple collection of bash or perl scripts would work for > a first pass at this. Do you have any suggestions on a good colab site > for scripting? If we came up with a basic format we could create and > then mix and match them as we saw fit. We just need them all to be > called with the same args, then have a master run that executes all of > them in a tests dir. It would also be nice if there was a sort of > standard output both for giving to devels as well as rolling up nicely > if we get 1E3 of these things. > > -Mic > > Geoff Kassel wrote: > > Hi Mickey, > > > >> Thanks I am well versed in unit testing but probably disagree on level > >> of use in a development cycle. Instead of writing a long email back > >> about testing theory, nondeterministic problems, highly connected > >> dependent systems blah blah > > > > Sorry, I was just trying to make sure we were all on the same page - > > define some common terminology, etc for anyone else who wanted to join > > in. > > > > I'm well aware of the limits of testing, having most of a PhD in related > > formal methods topics and having taught Uni subjects in this area. (But > > consider me optimistic anyway :) > > > > It's just about improving confidence, after all. Not about achieving some > > nebulous notion of perfection. > > > >> I'll just say that most of the problems that > >> have plagued me have been because of interactions between translators, > >> kernel mods etc which unit testing doesn't really approach. > > > > That's the focus of integration testing, not unit tests... I did mention > > integration testing. > > > >> Since I'm running my setup as a storage farm it just doesn't matter to > >> me if there's a memory leak of if a server daemon crashes, I have cron > >> jobs that restart it and I barely take notice. > > > > You're very lucky that a crash doesn't cause you much annoyance. My > > annoyances in this area are well documented in the list, so I won't > > repeat them again :) > > > >> I would rather encourage the dev team to add hotadd > >> upgrade and hotadd features. These things would keep my cluster going > >> even if there were catastrophic problems. > > > > These are good features to have, yes. However, I'd like to make sure > > there's something incorrupted to recover first. > > > > If a feature freeze was necessary to get a proper QA framework put in > > place and working towards avoiding more data corruption bugs, then I > > would vote for the feature freeze over more features, no matter how > > useful. > > > >> What I'm saying is that a good top down testing system is something we > >> can discuss here, spec out and perhaps create independently of the > >> development team. I think what most people want is a more stable product > >> and I think a top down approach will get it there faster than trying to > >> implement a given UT system from the bottom up. It will defiantly answer > >> the question "should I upgrade to this release?" > > > > Alright. We'll let the devs concentrate on bottom up testing (they know > > the code better anyway), and we in the wider community can look at top > > down testing. > > > >> You mentioned that you had outlines some integration and function tests > >> previously, perhaps you could paste some into this thread so that we > >> could expand on them. > > > > Okay. The test I outlined was for checking for data corruption bugs for > > AFR and Unity with cryptographic hashes. The idea actually expands into a > > class of test cases. I'll flesh those out a bit more now. > > > > Generate a number of files of varying length (zero size, single byte, > > transfer block size - 1, transfer block size, transfer block size + 1, > > multiple meg, multiple gig etc) in a directory tree of varying depths. > > Take the cryptographic hash of each file. > > > > One test can be starting with an empty set of GlusterFS back end data > > blocks. Insert the files and directories through the client - check the > > hashes of the files stored on the back ends, and as read back through > > each of the client(s). If the hashes mismatch the original computed > > hashes at any point, the test has failed. > > > > Another test can be starting with the files already on the back end. (But > > without having had Gluster assign metadata attributes yet.) Start the > > server, read the files through each of the client(s) and directly from > > the back end. As before, if the hashes mismatch at any point - failure. > > > > A third test - start another set of back ends with a partially populated > > back end. Start the server, read the existing files off, compare hashes. > > Add the remaining files. Compare the hashes of all files through the > > client(s), and as they end up on the back end. > > > > I don't know if 2.0.x Gluster supports this any more, but you used to be > > able to have one back end populated and the other empty, so long as a > > namespace block on all servers had zero-length file entries for all of > > the replicated files. (This being how you could add a node to your > > cluster originally.) Start back ends in this one populated, others empty > > configuration - read all the files through from a client connected only > > to a server with an empty back end. Check the hashes read through the > > client, and the hashes of the files that end up 'healed' onto the > > formerly empty back ends. > > > > Then there's a multitude of overwrite tests that could be done in this > > vein, as well as concurrent read and write tests to check atomicity etc. > > > > All these tests could be done under different performance translators, > > with different numbers of servers and clients. All just a matter of > > different configuration files given, and different scripts to set up > > different test environments. > > > > All of these functional tests can be automated, can be done on a single > > system with some clever configuration files, or performed across a > > network to try to detect issues caused by networking. > > > > (I believe there are open source network simulation tools that might be > > able to be used to simulate lag, noise, congestion etc, and so reduce > > this network testing to being run on a single machine. Network simulation > > is not an area of expertise for me, so I don't know how effective or > > comparable this is to the real thing.) > > > > If the files in the tests are algorithmically generated (say, sourced > > from a pseudo random number generator, or the various patterns favoured > > by memory testers), the back end test data sets can be quite small in > > size. > > > > (Hopefully this will all be small enough to add to the repository without > > adding much bulk to a check out.) > > > > What do you think? > > > > Geoff. > > > > On Wed, 8 Jul 2009, Mickey Mazarick wrote: > >> Geoff, > >> Thanks I am well versed in unit testing but probably disagree on level > >> of use in a development cycle. Instead of writing a long email back > >> about testing theory, nondeterministic problems, highly connected > >> dependent systems blah blah I'll just say that most of the problems that > >> have plagued me have been because of interactions between translators, > >> kernel mods etc which unit testing doesn't really approach. > >> > >> Since I'm running my setup as a storage farm it just doesn't matter to > >> me if there's a memory leak of if a server daemon crashes, I have cron > >> jobs that restart it and I barely take notice. True a regression testing > >> would get rid of the memory leak you hate but if they have to start from > >> the ground up I would rather encourage the dev team to add hotadd > >> upgrade and hotadd features. These things would keep my cluster going > >> even if there were catastrophic problems. > >> > >> What I'm saying is that a good top down testing system is something we > >> can discuss here, spec out and perhaps create independently of the > >> development team. I think what most people want is a more stable product > >> and I think a top down approach will get it there faster than trying to > >> implement a given UT system from the bottom up. It will defiantly answer > >> the question "should I upgrade to this release?" > >> > >> You mentioned that you had outlines some integration and function tests > >> previously, perhaps you could paste some into this thread so that we > >> could expand on them. > >> > >> Thanks! > >> -Mickey Mazarick > >> > >> Geoff Kassel wrote: > >>> Hi Mickey, > >>> Just so that we're all on the same page here - a regression test > >>> suite at its most basic just has to include test cases (i.e. a set of > >>> inputs) that can trigger a previously known fault in the code if that > >>> fault is present. (i.e it can see if the code has 'regressed' into a > >>> condition where a fault is present.) > >>> > >>> What it's also taken to mean (and typically includes) is a set of > >>> tests cases covering corner cases and normal modes of operation, as > >>> expressed in a set of inputs to code paired with a set of expected > >>> outputs that may or may not include error messages. > >>> > >>> Test cases aimed at particular levels of the code have specific > >>> terminology associated with those levels. At the lowest level, the > >>> method level, they're called unit tests. At the module/API level - > >>> integration tests. At the system/user interface level - system aka > >>> function aka functional aka functionality tests. > >>> > >>> When new functionality is introduced or a bug is patched, the > >>> regression test suite (which in the case of unit tests is typically > >>> fully automated) is run to see whether the expected behaviour occurs, > >>> and none of the old faults recur. > >>> > >>> A lot of the tests you've described fall into the category of > >>> function tests - and from my background in automated testing, I know we > >>> need a bit more than that to get the stability and reliability results > >>> we want. (Simply because you cannot test every corner case within a > >>> project the size and complexity of GlusterFS reliably from the command > >>> line.) > >>> > >>> Basically, what GlusterFS needs is a fairly even coverage of test > >>> cases at all the levels I've just mentioned. > >>> > >>> What I want to see particularly - and what the devs stated nearly a > >>> year ago was already in existence - is unit tests. Particularly the > >>> kind that can be run automatically. > >>> > >>> This is so that developers (inside the GlusterFS team or otherwise) > >>> can hack on a piece of code to fix a bug or implement new > >>> functionality, then run the unit tests to see that they (mostly likely) > >>> haven't caused a regression with their new code. > >>> > >>> (It's somewhat difficult for outsiders to write unit and integration > >>> tests, because typically only the original developers have the in-depth > >>> knowledge of the expected behaviour of the code in the low level detail > >>> required.) > >>> > >>> Perhaps developed in parallel should be integration and function > >>> tests. Tests like these (I've outlined elsewhere specifically what > >>> kind) would have quite likely picked up the data corruption bugs before > >>> they made their way into the first 2.0.x releases. > >>> > >>> (Pretty much anyone familiar with the goal of the project can write > >>> function tests, documenting in live code their expectations for how the > >>> system should work.) > >>> > >>> Long running stability and load tests like you've proposed are also > >>> kinds of function tests, but without the narrowly defined inputs and > >>> outputs of specific test cases. They're basically the equivalent of > >>> mine shaft canaries - they signal the presence of race conditions, > >>> memory leaks, design flaws, and other subtle issues, but often without > >>> specifics as to what 'killed' the canary. Once the cause is found > >>> though, a new, more specific test case can be added at the appropriate > >>> level. > >>> > >>> (Useful, yes, but mostly as a starting point for more intensive QA > >>> efforts.) > >>> > >>> The POSIX compliance tests you mentioned are more traditional > >>> function level tests - but I think the GlusterFS devs have wandered a > >>> little away from full POSIX compliance on some points, so these tests > >>> may not be 100% relevant. > >>> > >>> (This is not necessarily a bad thing - the POSIX standard is > >>> apparently ambiguous at times, and there is some wider community > >>> feeling that improvements to the standard are overdue. And I'm not sure > >>> the POSIX standard was ever written with massively scalable, plugable, > >>> distributed file systems in mind, either :) > >>> > >>> I hope my extremely long winded rant here :) has explained > >>> adequately what I feel GlusterFS needs to have in a regression testing > >>> system. > >>> > >>> Geoff. > >>> > >>> On Tue, 7 Jul 2009, Mickey Mazarick wrote: > >>>> What kind of requirements does everyone see as necessary for a > >>>> regression test system? > >>>> Ultimately the best testing system would use the tracing translator > >>>> and be able to run tests and generate traces for any problems that > >>>> occurs, giving us something very concrete to provide the developers. > >>>> That's a few steps ahead however, initially we should start to outline > >>>> some must haves in terms of how a test setup is run. obviously we want > >>>> something we can run for many hours or days to test longterm > >>>> stability, and it would be nice if there was some central way to spin > >>>> up new clients to test reliability under a load. > >>>> > >>>> For basic file operation tests I use the below: > >>>> An initial look would be to use some tools like > >>>> http://www.ntfs-3g.org/pjd-fstest.html > >>>> I've seen it mentioned before but it's a good start to test anything > >>>> posix. Here's a simple script that will download and build it if it's > >>>> missing, and run a test on a given mount point. > >>>> > >>>> > >>>> #!/bin/bash > >>>> if [ "$#" -lt 1 ] > >>>> then > >>>> echo "usage: $0 gluster_mount" > >>>> exit 65 > >>>> fi > >>>> GLUSTER_MOUNT=$1 > >>>> INSTALL_DIR="/usr" > >>>> if [ ! -d $INSTALL_DIR/fstest ]; then > >>>> cd $INSTALL_DIR > >>>> wget http://www.ntfs-3g.org/sw/qa/pjd-fstest-20080816.tgz > >>>> tar -xzf pjd-fstest-20080816.tgz > >>>> mv pjd-fstest-20080816 fstest > >>>> cd fstest > >>>> make > >>>> vi tests/conf > >>>> fi > >>>> cd $GLUSTER_MOUNT > >>>> prove -r $INSTALL_DIR/fstest/ > >>>> > >>>> Jacques Mattheij wrote: > >>>>> hello Anand, Geoff & others, > >>>>> > >>>>> This pretty much parallels my interaction with the team about a > >>>>> year ago, lots of really good intentions but no actual follow up. > >>>>> > >>>>> We agreed that an automated test suite was a must and that a > >>>>> whole bunch of other things would have to be done to get > >>>>> glusterfs out of the experimental stage and into production > >>>>> grade. > >>>>> > >>>>> It's a real pity because I still feel that glusterfs is one of the > >>>>> major contenders to become *the* cluster file system. > >>>>> > >>>>> A lot of community goodwill has been lost, I've kept myself > >>>>> subscribed to this mailing list because I hoped that at some > >>>>> point we'd move past this endless cat and mouse game with > >>>>> stability issues but for some reason that never happend. > >>>>> > >>>>> Anand, you have a very capable team of developers, you have > >>>>> a once-in-a-lifetime opportunity to make this happen please > >>>>> take Geoff's comments to hart and get serious about Q&A and > >>>>> community support because that is the key to any successful > >>>>> foss project. Fan that fire and you can't go wrong, lose the > >>>>> community support and your project might as well be dead. > >>>>> > >>>>> I realize this may come across as harsh but it is intended to > >>>>> make it painfully obvious that the most staunch supporters > >>>>> of glusterfs are getting discouraged and that is a loss no > >>>>> serious project can afford. > >>>>> > >>>>> Jacques > >>>>> > >>>>> Geoff Kassel wrote: > >>>>>> Hi Anand, > >>>>>> If you look back through the list archives, no one other than me > >>>>>> replied to the original QA thread where I first posted my patches. > >>>>>> Nor to the Savannah patch tracker thread where I also posted my > >>>>>> patches. (Interesting how those trackers have been disabled now...) > >>>>>> > >>>>>> It took me pressing the issue after discovering yet another bug > >>>>>> that we even started talking about my patches. So yes, my patches > >>>>>> were effectively ignored. > >>>>>> > >>>>>> At the time, you did mention that the code the patches were to be > >>>>>> applied against was being reworked, in addition to your comments > >>>>>> about my code comments. > >>>>>> > >>>>>> I explained the comments as being necessary to avoid the > >>>>>> automated tool flagging potential issues again on reuse of that tool > >>>>>> - other comments for future QA work. There was no follow up on that > >>>>>> from you, nor suggestion on how I might improve these comments to > >>>>>> your standards. > >>>>>> > >>>>>> I continued to supply patches in the Savannah tracker against the > >>>>>> latest stable 1.3 branch - which included some refactoring for your > >>>>>> reworked code, IIRC - for some time after that discussion. All of my > >>>>>> patches were in sync with the code from publically available 1.3 > >>>>>> branch repository within days of a new TLA patchset. > >>>>>> > >>>>>> None of these were adopted either. > >>>>>> > >>>>>> I simply ran out of spare time to maintain this patchset, and I > >>>>>> got tired of pressing an issue (QA) that you and the dev team > >>>>>> clearly weren't interested in. > >>>>>> > >>>>>> I don't have the kind of spare time needed to do the sort of > >>>>>> in-depth re-audit your code from scratch (as would be needed) in the > >>>>>> manner that I did back then. So I can't meet your request at this > >>>>>> time, sorry. > >>>>>> > >>>>>> As I've suggested elsewhere, now that you apparently have the > >>>>>> resources for a stand-alone QA team - this team might want to at > >>>>>> least use the tools I've used to generate these patches - RATS and > >>>>>> FlawFinder. > >>>>>> > >>>>>> That way you can generate the kind of QA work I was producing > >>>>>> with the kind of comment style you prefer. > >>>>>> > >>>>>> The only way I can conceive of being able to help now is in > >>>>>> patching individual issues. However, I can really only feasibly do > >>>>>> that with my time constraints if I've got regression tests to make > >>>>>> sure I'm not inadvertently breaking other functionality. > >>>>>> > >>>>>> Hence my continued requests for these. > >>>>>> > >>>>>> Geoff. > >>>>>> > >>>>>> On Tue, 7 Jul 2009, Anand Avati wrote: > >>>>>>>> I've also gone one better than just advice - I've given up > >>>>>>>> significant > >>>>>>>> portions of my limited spare time to audit and patch a > >>>>>>>> not-insignificant > >>>>>>>> portion of the GlusterFS code, in order to deal with the stability > >>>>>>>> issues > >>>>>>>> I and others were encountering. My patches were ignored, on the > >>>>>>>> grounds > >>>>>>>> that it contained otherwise unobtrusive comments which were quite > >>>>>>>> necessary to the audit. > >>>>>>> > >>>>>>> Geoff, we really appreciate your efforts, both on the fronts of > >>>>>>> your patch submissions and for voicing your opinions freely. We > >>>>>>> also acknowledge the positive intentions behind this thread. As far > >>>>>>> as your patch submissions are concerned, there is probably a > >>>>>>> misunderstanding. Your patches were not ignored. We do value your > >>>>>>> efforts. The patches which you submitted, even at the time of your > >>>>>>> submission were not applicable to the codebase. > >>>>>>> > >>>>>>> Patch 1 (in glusterfsd.c) -- this file was reworked and almost > >>>>>>> rewritten from scratch to work as both client and server. > >>>>>>> > >>>>>>> Patch 2 (glusterfs-fuse/src/glusterfs.c) -- this module was > >>>>>>> reimplemented as a new translator (since a separate client was no > >>>>>>> more needed). > >>>>>>> > >>>>>>> Patch 3 (protocol.c) -- with the introduction of non blocking IO > >>>>>>> and binary protocol, nothing of this file remained. > >>>>>>> > >>>>>>> What I am hoping to convey is that, the reason your patches did not > >>>>>>> make it to the repository was because it needed significant > >>>>>>> reworking to even apply. I did indeed comment about code comments > >>>>>>> of the style /* FlawFinder: */ but then, that definitely was _not_ > >>>>>>> the reason they weren't included. Please understand that nothing > >>>>>>> was ignored intentionally. > >>>>>>> > >>>>>>> This being said, I can totally understand the efforts which you > >>>>>>> have been putting to maintain patchsets by yourself and keeping > >>>>>>> them up to date with the repository. I request you to resubmit them > >>>>>>> (with git format-patch) against the HEAD of the repository. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Avati > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Gluster-devel mailing list > >>>>>> Gluster-devel@xxxxxxxxxx > >>>>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel