Hello, Michael and I ran a battery of testing today and closed out the two issues identified below (of March 11). FYI RE the "background-self-heal-only" patch; It has been tested now to our satisfaction and works as described/intended. http://midnightcode.org/projects/saturn/code/glusterfs-3.2.6-background-only.patch FYI RE the 2GB replicate error; >>> 2) Of the file that were replicated, not all were >>> corrupted (capped at 2G -- note that we >>> confirmed that this was the first 2G of the >>> source file contents). >>> >>> So is there a known replicate issue with files >>> greater than 2GB? We have confirmed this issue and the referenced patch appears to correct the problem. We were able to get one particular file to reliably fail at 2GB under GlusterFS 3.2.6, and then correctly transfer it and many other >2GB files, after applying this patch. The error stems from putting the off_t (64bit) offset value into a void * cookie value typecast as long (unsigned 32bit) and then restoring it into an off_t again. The tip-off was a recurring offset of 18446744071562067968 seen in the logs. The effect is described well here; http://stackoverflow.com/questions/5628484/unexpected-behavior-from-unsigned-int64 We can't explain why this issue was intermittent, and we're not sure if the rw_sh->offset is the correct 64bit offset to use. However that offset appeared to match the cookie value in all tested pre-failure states. Please advise if there is a better (more correct) off_t offset to use. http://midnightcode.org/projects/saturn/code/glusterfs-3.2.6-2GB.patch Thanks for your help, ----- Original Message ----- >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> >Subject: Re: replicate background threads >Date: Tue, 03 Apr 2012 20:41:48 +1000 > > > Pizza reveals all ;-) > > There's an error in there with the LOCK going > without a paired UNLOCK in the afr-common > test. Revised (untested) patch attached. > > > > > ----- Original Message ----- > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > >Subject: Re: replicate background threads > >Date: Tue, 03 Apr 2012 19:46:51 +1000 > > > > > > FYI - untested patch attached. > > > > > > > > ----- Original Message ----- > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > >Subject: Re: replicate background threads > > >Date: Tue, 03 Apr 2012 18:50:11 +1000 > > > > > > > > > FYI - I can see that this option doesn't exist, I'm > adding it > > > now. > > > > > > > > > ----- Original Message ----- > > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > >Subject: Re: replicate background threads > > > >Date: Mon, 02 Apr 2012 18:02:26 +1000 > > > > > > > > > > > > Hello Pranith, > > > > > > > > > > > > Michael has come back from his business trip and > > > > we're about to start testing again (though now under > > > > kernel 3.2.13 and GlusterFS 3.2.6). > > > > > > > > I've published the 32bit (i586) client on the Saturn > > > > project site if anyone is chasing it; > > > > http://midnightcode.org/projects/saturn/ > > > > > > > > One quick question, is there a tune-able parameter > > > > that will allow a stat to be non blocking (i.e. to stop > > > > self-heal going foreground) when the background > > > > self heal count is reached? > > > > I.e. rather than having the stat hang for 2 days > > > > while the files are replicated, we'd rather it fell > > > > through and allowed subsequent stats to attempt > > > > background self healing (perhaps at a time when > > > > background self heal slots are available). > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > > >Subject: Re: replicate background > threads > > > > >Date: Wed, 14 Mar 2012 19:36:24 +1000 > > > > > > > > > > Hello, > > > > > > > > > > > hi Ian, > > > > > > Maintaining a queue of files that need to be > > > > > > self-healed does not scale in practice, in cases > > > > > > where there are millions of files that need self- > > > > > > heal. So such a thing is not implemented. The > > > > > > idea is to make self-heal foreground after a > > > > > > certain-limit (background-self-heal-count) so > > > > > > there is no necessity for such a queue. > > > > > > > > > > > > Pranith. > > > > > > > > > > Ok, I understand - it will be interesting to observe > > > > > the system with the new knowledge from your > > > > > messages - thanks for your help, appreciate it. > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > ----- Original Message ----- > > > > > >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > > > >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > > > >Subject: Re: replicate background > > threads > > > > > >Date: Wed, 14 Mar 2012 07:33:32 +0530 > > > > > > > > > > > > On 03/14/2012 01:47 AM, Ian Latter wrote: > > > > > > > Thanks for the info Pranith; > > > > > > > > > > > > > > <pranithk> the option to increase the max num of > > > > background > > > > > > > self-heals > > > > > > > is cluster.background-self-heal-count. Default > > value of > > > > > > > which is 16. I > > > > > > > assume you know what you are doing to the > performance > > > > of the > > > > > > > system by > > > > > > > increasing this number. > > > > > > > > > > > > > > > > > > > > > I didn't know this. Is there a queue length for > what > > > > > > > is yet to be handled by the background self heal > > > > > > > count? If so, can it also be adjusted? > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > >> From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx> > > > > > > >> To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > > > > > > >> Subject: Re: replicate background > > > > threads > > > > > > >> Date: Tue, 13 Mar 2012 21:07:53 +0530 > > > > > > >> > > > > > > >> On 03/13/2012 07:52 PM, Ian Latter wrote: > > > > > > >>> Hello, > > > > > > >>> > > > > > > >>> > > > > > > >>> Well we've been privy to our first true > error in > > > > > > >>> Gluster now, and we're not sure of the cause. > > > > > > >>> > > > > > > >>> The SaturnI machine with 1Gbyte of RAM was > > > > > > >>> exhausting its memory and crashing and we saw > > > > > > >>> core dumps on SaturnM and MMC. Replacing > > > > > > >>> the SaturnI hardware with identical hardware to > > > > > > >>> SaturnM, but retaining SaturnI's original disks, > > > > > > >>> (so fixing the memory capacity problem) we saw > > > > > > >>> crashes randomly at all nodes. > > > > > > >>> > > > > > > >>> Looking for irregularities at the file system > > > > > > >>> we noticed that (we'd estimate) about 60% of > > > > > > >>> the files at the OS/EXT3 layer of SaturnI > > > > > > >>> (sourced via replicate from SaturnM) were of > > > > > > >>> size 2147483648 (2^31) where they should > > > > > > >>> have been substantially larger. While we would > > > > > > >>> happily accept "you shouldn't be using a 32bit > > > > > > >>> gluster package" as the answer, we note two > > > > > > >>> deltas; > > > > > > >>> 1) All files used in testing were copied > on from > > > > > > >>> 32 bit clients to 32 bit servers, with no > > > > > > >>> observable errors > > > > > > >>> 2) Of the file that were replicated, not all > > were > > > > > > >>> corrupted (capped at 2G -- note that we > > > > > > >>> confirmed that this was the first 2G > of the > > > > > > >>> source file contents). > > > > > > >>> > > > > > > >>> > > > > > > >>> So is there a known replicate issue with files > > > > > > >>> greater than 2GB? Has anyone done any > > > > > > >>> serious testing with significant numbers of files > > > > > > >>> of this size? Are there configurations specific > > > > > > >>> to files/structures of these dimensions? > > > > > > >>> > > > > > > >>> We noted that reversing the configuration, such > > > > > > >>> that SaturnI provides the replicate Brick amongst > > > > > > >>> a local distribute and a remote map to SaturnM > > > > > > >>> where SaturnM simply serves a local distribute; > > > > > > >>> that the data served to MMC is accurate (it > > > > > > >>> continues to show 15GB files, even where there > > > > > > >>> is a local 2GB copy). Further, a client "cp" at > > > > > > >>> MMC, of a file with a 2GB local replicate of a > > > > > > >>> 15GB file, will result in a 15GB file being > > > > > > >>> created and replicated via Gluster (i.e. the > > > > > > >>> correct specification at both server nodes). > > > > > > >>> > > > > > > >>> So my other question is; Is it possible that we've > > > > > > >>> managed to corrupt something in this > > > > > > >>> environment? I.e. during the initial memory > > > > > > >>> exhaustion events? And is there a robust way > > > > > > >>> to have the replicate files revalidated by gluster > > > > > > >>> as a stat doesn't seem to be correcting files in > > > > > > >>> this state (i.e. replicate on SaturnM results in > > > > > > >>> daemon crashes, replicate on SaturnI results > > > > > > >>> in files being left in the bad state). > > > > > > >>> > > > > > > >>> > > > > > > >>> Also, I'm not a member of the users list; if these > > > > > > >>> questions are better posed there then let me > > > > > > >>> know and I'll re-post them there. > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> Thanks, > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> ----- Original Message ----- > > > > > > >>>> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > > > > > > >>>> To:<gluster-devel@xxxxxxxxxx> > > > > > > >>>> Subject: replicate background > > > threads > > > > > > >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000 > > > > > > >>>> > > > > > > >>>> Hello, > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> My mate Michael and I have been steadily > > > > > > >>>> advancing our Gluster testing and today we > finally > > > > > > >>>> reached some heavier conditions. The outcome > > > > > > >>>> was different from expectations built from > our more > > > > > > >>>> basic testing so I think we have a couple of > > > > > > >>>> questions regarding the AFR/Replicate background > > > > > > >>>> threads that may need a developer's contribution. > > > > > > >>>> Any help appreciated. > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> The setup is a 3 box environment, but lets > > start > > > > > > >>>> with two; > > > > > > >>>> > > > > > > >>>> SaturnM (Server) > > > > > > >>>> - 6core CPU, 16GB RAM, 1Gbps net > > > > > > >>>> - 3.2.6 Kernel (custom distro) > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > >>>> - 3x2TB drives, CFQ, EXT3 > > > > > > >>>> - Bricked up into a single local 6TB > > > > > > >>>> "distribute" brick > > > > > > >>>> - "brick" served to the network > > > > > > >>>> > > > > > > >>>> MMC (Client) > > > > > > >>>> - 4core CPU, 8GB RAM, 1Gbps net > > > > > > >>>> - Ubuntu > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > >>>> - Don't recall the disk space locally > > > > > > >>>> - "brick" from SaturnM mounted > > > > > > >>>> > > > > > > >>>> 500 x 15Gbyte files were copied from MMC > > > > > > >>>> to a single sub-directory on the brick served > from > > > > > > >>>> SaturnM, all went well and dandy. So then we > > > > > > >>>> moved on to a 3 box environment; > > > > > > >>>> > > > > > > >>>> SaturnI (Server) > > > > > > >>>> = 1core CPU, 1GB RAM, 1Gbps net > > > > > > >>>> = 3.2.6 Kernel (custom distro) > > > > > > >>>> = 3.2.5 Gluster (32bit) > > > > > > >>>> = 4x2TB drives, CFQ, EXT3 > > > > > > >>>> = Bricked up into a single local 8TB > > > > > > >>>> "distribute" brick > > > > > > >>>> = "brick" served to the network > > > > > > >>>> > > > > > > >>>> SaturnM (Server/Client) > > > > > > >>>> - 6core CPU, 16GB RAM, 1Gbps net > > > > > > >>>> - 3.2.6 Kernel (custom distro) > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > >>>> - 3x2TB drives, CFQ, EXT3 > > > > > > >>>> - Bricked up into a single local 6TB > > > > > > >>>> "distribute" brick > > > > > > >>>> = Replicate brick added to sit over > > > > > > >>>> the local distribute brick and a > > > > > > >>>> client "brick" mapped from SaturnI > > > > > > >>>> - Replicate "brick" served to the > network > > > > > > >>>> > > > > > > >>>> MMC (Client) > > > > > > >>>> - 4core CPU, 8GB RAM, 1Gbps net > > > > > > >>>> - Ubuntu > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > >>>> - Don't recall the disk space locally > > > > > > >>>> - "brick" from SaturnM mounted > > > > > > >>>> = "brick" from SaturnI mounted > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> Now, in lesser testing in this scenario > all was > > > > > > >>>> well - any files on SaturnI would appear on > SaturnM > > > > > > >>>> (not a functional part of our test) and the > > > content on > > > > > > >>>> SaturnM would appear on SaturnI (the real > > > > > > >>>> objective). > > > > > > >>>> > > > > > > >>>> Earlier testing used a handful of smaller > files > > > > (10s > > > > > > >>>> to 100s of Mbytes) and a single 15Gbyte file. > The > > > > > > >>>> 15Gbyte file would be "stat" via an "ls", which > > would > > > > > > >>>> kick off a background replication (ls > appeared un- > > > > > > >>>> blocked) and the file would be transferred. > Also, > > > > > > >>>> interrupting the transfer (pulling the LAN cable) > > > > > > >>>> would result in a partial 15Gbyte file being > > > corrected > > > > > > >>>> in a subsequent background process on another > > > > > > >>>> stat. > > > > > > >>>> > > > > > > >>>> *However* .. when confronted with 500 x > 15Gbyte > > > > > > >>>> files, in a single directory (but not the root > > > > directory) > > > > > > >>>> things don't quite work out as nicely. > > > > > > >>>> - First, the "ls" (at MMC against the SaturnM > > > > brick) > > > > > > >>>> is blocking and hangs the terminal (ctrl-c > > > > doesn't > > > > > > >>>> kill it). > > > > > > >> <pranithk> At max 16 files can be self-healed > in the > > > > > > > back-ground in > > > > > > >> parallel. 17th file self-heal will happen in the > > > > > foreground. > > > > > > >>>> - Then, looking from MMC at the SaturnI file > > > > > > >>>> system (ls -s) once per second, and then > > > > > > >>>> comparing the output (diff ls1.txt > ls2.txt | > > > > > > >>>> grep -v '>') we can see that between 10 > > and 17 > > > > > > >>>> files are being updated simultaneously > > by the > > > > > > >>>> background process > > > > > > >> <pranithk> This is expected. > > > > > > >>>> - Further, a request at MMC for a single file > > > that > > > > > > >>>> was originally in the 500 x 15Gbyte > > sub-dir on > > > > > > >>>> SaturnM (which should return unblocked with > > > > > > >>>> correct results) will; > > > > > > >>>> a) work as expected if there are less > > than 17 > > > > > > >>>> active background file tasks > > > > > > >>>> b) block/hang if there are more > > > > > > >>>> - Where-as a stat (ls) outside of the 500 > x 15 > > > > > > >>>> sub-directory, such as the root of that > > brick, > > > > > > >>>> would always work as expected (return > > > > > > >>>> immediately, unblocked). > > > > > > >> <pranithk> stat on the directory will only > > create the > > > > > > > files with '0' > > > > > > >> file size. Then when you ls/stat the actual > file the > > > > > > > self-heal for the > > > > > > >> file gets triggered. > > > > > > >>>> > > > > > > >>>> Thus, to us, it appears as though there is a > > > > > > >>>> queue feeding a set of (around) 16 worker threads > > > > > > >>>> in AFR. If your request was to the loaded > > directory > > > > > > >>>> then you would be blocked until a worker was > > > > > > >>>> available, and if your request was to any other > > > > > > >>>> location, it would return unblocked regardless of > > > > > > >>>> the worker pool state. > > > > > > >>>> > > > > > > >>>> The only thread metric that we could find to > > > tweak > > > > > > >>>> was performance/io-threads (which was set to > > > > > > >>>> 16 per physical disk; well per locks + posix > brick > > > > > > >>>> stacks) but increasing this to 64 per stack > didn't > > > > > > >>>> change the outcome (10 to 17 active background > > > > > > >>>> transfers). > > > > > > >> <pranithk> the option to increase the max num of > > > > > > > background self-heals > > > > > > >> is cluster.background-self-heal-count. Default > > value of > > > > > > > which is 16. I > > > > > > >> assume you know what you are doing to the > > > performance of > > > > > > > the system by > > > > > > >> increasing this number. > > > > > > >>>> > > > > > > >>>> So, given the above, is our analysis > sound, and > > > > > > >>>> if so, is there a way to increase the size of the > > > > > > >>>> pool of active worker threads? The objective > > > > > > >>>> being to allow unblocked access to an existing > > > > > > >>>> repository of files (on SaturnM) while a > > > > > > >>>> secondary/back-up is being filled, via GlusterFS? > > > > > > >>>> > > > > > > >>>> Note that I understand that performance > > > > > > >>>> (through-put) will be an issue in the described > > > > > > >>>> environment: this replication process is > > > > > > >>>> estimated to run for between 10 and 40 hours, > > > > > > >>>> which is acceptable so long as it isn't blocking > > > > > > >>>> (there's a production-capable file set in place). > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> Any help appreciated. > > > > > > >>>> > > > > > > >> Please let us know how it goes. > > > > > > >>>> Thanks, > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> -- > > > > > > >>>> Ian Latter > > > > > > >>>> Late night coder .. > > > > > > >>>> http://midnightcode.org/ > > > > > > >>>> > > > > > > >>>> _______________________________________________ > > > > > > >>>> Gluster-devel mailing list > > > > > > >>>> Gluster-devel@xxxxxxxxxx > > > > > > >>>> > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > >>>> > > > > > > >>> -- > > > > > > >>> Ian Latter > > > > > > >>> Late night coder .. > > > > > > >>> http://midnightcode.org/ > > > > > > >>> > > > > > > >>> _______________________________________________ > > > > > > >>> Gluster-devel mailing list > > > > > > >>> Gluster-devel@xxxxxxxxxx > > > > > > >>> > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > >> hi Ian, > > > > > > >> inline replies with<pranithk>. > > > > > > >> > > > > > > >> Pranith. > > > > > > >> > > > > > > > > > > > > > > -- > > > > > > > Ian Latter > > > > > > > Late night coder .. > > > > > > > http://midnightcode.org/ > > > > > > hi Ian, > > > > > > Maintaining a queue of files that need to be > > > > > self-healed does not > > > > > > scale in practice, in cases where there are > millions of > > > > > files that need > > > > > > self-heal. So such a thing is not implemented. The > > idea is > > > > > to make > > > > > > self-heal foreground after a certain-limit > > > > > (background-self-heal-count) > > > > > > so there is no necessity for such a queue. > > > > > > > > > > > > Pranith. > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ian Latter > > > > > Late night coder .. > > > > > http://midnightcode.org/ > > > > > > > > > > _______________________________________________ > > > > > Gluster-devel mailing list > > > > > Gluster-devel@xxxxxxxxxx > > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > -- > > > > Ian Latter > > > > Late night coder .. > > > > http://midnightcode.org/ > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@xxxxxxxxxx > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > -- > > > Ian Latter > > > Late night coder .. > > > http://midnightcode.org/ > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxx > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/