Sorry; That "long (unsigned 32bit)" should have been "long (signed 32bit)" ... so that's twice that bug has bitten ;-) Cheers, ----- Original Message ----- >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> >Subject: SOLVED - Re: replicate background threads >Date: Wed, 04 Apr 2012 21:51:11 +1000 > > Hello, > > > Michael and I ran a battery of testing today and > closed out the two issues identified below (of March > 11). > > > FYI RE the "background-self-heal-only" patch; > > It has been tested now to our satisfaction and > works as described/intended. > > > http://midnightcode.org/projects/saturn/code/glusterfs-3.2.6-background-only.patch > > > > FYI RE the 2GB replicate error; > > >>> 2) Of the file that were replicated, not all were > >>> corrupted (capped at 2G -- note that we > >>> confirmed that this was the first 2G of the > >>> source file contents). > >>> > >>> So is there a known replicate issue with files > >>> greater than 2GB? > > We have confirmed this issue and the referenced > patch appears to correct the problem. We were > able to get one particular file to reliably fail at 2GB > under GlusterFS 3.2.6, and then correctly > transfer it and many other >2GB files, after > applying this patch. > > The error stems from putting the off_t (64bit) > offset value into a void * cookie value typecast > as long (unsigned 32bit) and then restoring it into > an off_t again. The tip-off was a recurring offset > of 18446744071562067968 seen in the logs. The > effect is described well here; > > http://stackoverflow.com/questions/5628484/unexpected-behavior-from-unsigned-int64 > > We can't explain why this issue was intermittent, > and we're not sure if the rw_sh->offset is the > correct 64bit offset to use. However that offset > appeared to match the cookie value in all tested > pre-failure states. Please advise if there is a > better (more correct) off_t offset to use. > > > http://midnightcode.org/projects/saturn/code/glusterfs-3.2.6-2GB.patch > > > > Thanks for your help, > > > > > ----- Original Message ----- > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > >Subject: Re: replicate background threads > >Date: Tue, 03 Apr 2012 20:41:48 +1000 > > > > > > Pizza reveals all ;-) > > > > There's an error in there with the LOCK going > > without a paired UNLOCK in the afr-common > > test. Revised (untested) patch attached. > > > > > > > > > > ----- Original Message ----- > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > >Subject: Re: replicate background threads > > >Date: Tue, 03 Apr 2012 19:46:51 +1000 > > > > > > > > > FYI - untested patch attached. > > > > > > > > > > > > ----- Original Message ----- > > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > >Subject: Re: replicate background threads > > > >Date: Tue, 03 Apr 2012 18:50:11 +1000 > > > > > > > > > > > > FYI - I can see that this option doesn't exist, I'm > > adding it > > > > now. > > > > > > > > > > > > ----- Original Message ----- > > > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > > >Subject: Re: replicate background > threads > > > > >Date: Mon, 02 Apr 2012 18:02:26 +1000 > > > > > > > > > > > > > > > Hello Pranith, > > > > > > > > > > > > > > > Michael has come back from his business trip and > > > > > we're about to start testing again (though now under > > > > > kernel 3.2.13 and GlusterFS 3.2.6). > > > > > > > > > > I've published the 32bit (i586) client on the Saturn > > > > > project site if anyone is chasing it; > > > > > http://midnightcode.org/projects/saturn/ > > > > > > > > > > One quick question, is there a tune-able parameter > > > > > that will allow a stat to be non blocking (i.e. to stop > > > > > self-heal going foreground) when the background > > > > > self heal count is reached? > > > > > I.e. rather than having the stat hang for 2 days > > > > > while the files are replicated, we'd rather it fell > > > > > through and allowed subsequent stats to attempt > > > > > background self healing (perhaps at a time when > > > > > background self heal slots are available). > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > >From: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > > > >To: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > > > >Subject: Re: replicate background > > threads > > > > > >Date: Wed, 14 Mar 2012 19:36:24 +1000 > > > > > > > > > > > > Hello, > > > > > > > > > > > > > hi Ian, > > > > > > > Maintaining a queue of files that need to be > > > > > > > self-healed does not scale in practice, in cases > > > > > > > where there are millions of files that need self- > > > > > > > heal. So such a thing is not implemented. The > > > > > > > idea is to make self-heal foreground after a > > > > > > > certain-limit (background-self-heal-count) so > > > > > > > there is no necessity for such a queue. > > > > > > > > > > > > > > Pranith. > > > > > > > > > > > > Ok, I understand - it will be interesting to observe > > > > > > the system with the new knowledge from your > > > > > > messages - thanks for your help, appreciate it. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > ----- Original Message ----- > > > > > > >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> > > > > > > >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> > > > > > > >Subject: Re: replicate background > > > threads > > > > > > >Date: Wed, 14 Mar 2012 07:33:32 +0530 > > > > > > > > > > > > > > On 03/14/2012 01:47 AM, Ian Latter wrote: > > > > > > > > Thanks for the info Pranith; > > > > > > > > > > > > > > > > <pranithk> the option to increase the max num of > > > > > background > > > > > > > > self-heals > > > > > > > > is cluster.background-self-heal-count. Default > > > value of > > > > > > > > which is 16. I > > > > > > > > assume you know what you are doing to the > > performance > > > > > of the > > > > > > > > system by > > > > > > > > increasing this number. > > > > > > > > > > > > > > > > > > > > > > > > I didn't know this. Is there a queue length for > > what > > > > > > > > is yet to be handled by the background self heal > > > > > > > > count? If so, can it also be adjusted? > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > >> From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx> > > > > > > > >> To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > > > > > > > >> Subject: Re: replicate > background > > > > > threads > > > > > > > >> Date: Tue, 13 Mar 2012 21:07:53 +0530 > > > > > > > >> > > > > > > > >> On 03/13/2012 07:52 PM, Ian Latter wrote: > > > > > > > >>> Hello, > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> Well we've been privy to our first true > > error in > > > > > > > >>> Gluster now, and we're not sure of the cause. > > > > > > > >>> > > > > > > > >>> The SaturnI machine with 1Gbyte of RAM was > > > > > > > >>> exhausting its memory and crashing and we saw > > > > > > > >>> core dumps on SaturnM and MMC. Replacing > > > > > > > >>> the SaturnI hardware with identical hardware to > > > > > > > >>> SaturnM, but retaining SaturnI's original disks, > > > > > > > >>> (so fixing the memory capacity problem) we saw > > > > > > > >>> crashes randomly at all nodes. > > > > > > > >>> > > > > > > > >>> Looking for irregularities at the file > system > > > > > > > >>> we noticed that (we'd estimate) about 60% of > > > > > > > >>> the files at the OS/EXT3 layer of SaturnI > > > > > > > >>> (sourced via replicate from SaturnM) were of > > > > > > > >>> size 2147483648 (2^31) where they should > > > > > > > >>> have been substantially larger. While we would > > > > > > > >>> happily accept "you shouldn't be using a 32bit > > > > > > > >>> gluster package" as the answer, we note two > > > > > > > >>> deltas; > > > > > > > >>> 1) All files used in testing were copied > > on from > > > > > > > >>> 32 bit clients to 32 bit servers, > with no > > > > > > > >>> observable errors > > > > > > > >>> 2) Of the file that were replicated, not all > > > were > > > > > > > >>> corrupted (capped at 2G -- note that we > > > > > > > >>> confirmed that this was the first 2G > > of the > > > > > > > >>> source file contents). > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> So is there a known replicate issue with files > > > > > > > >>> greater than 2GB? Has anyone done any > > > > > > > >>> serious testing with significant numbers of > files > > > > > > > >>> of this size? Are there configurations specific > > > > > > > >>> to files/structures of these dimensions? > > > > > > > >>> > > > > > > > >>> We noted that reversing the configuration, such > > > > > > > >>> that SaturnI provides the replicate Brick > amongst > > > > > > > >>> a local distribute and a remote map to SaturnM > > > > > > > >>> where SaturnM simply serves a local distribute; > > > > > > > >>> that the data served to MMC is accurate (it > > > > > > > >>> continues to show 15GB files, even where there > > > > > > > >>> is a local 2GB copy). Further, a client "cp" at > > > > > > > >>> MMC, of a file with a 2GB local replicate of a > > > > > > > >>> 15GB file, will result in a 15GB file being > > > > > > > >>> created and replicated via Gluster (i.e. the > > > > > > > >>> correct specification at both server nodes). > > > > > > > >>> > > > > > > > >>> So my other question is; Is it possible that > we've > > > > > > > >>> managed to corrupt something in this > > > > > > > >>> environment? I.e. during the initial memory > > > > > > > >>> exhaustion events? And is there a robust way > > > > > > > >>> to have the replicate files revalidated by > gluster > > > > > > > >>> as a stat doesn't seem to be correcting files in > > > > > > > >>> this state (i.e. replicate on SaturnM results in > > > > > > > >>> daemon crashes, replicate on SaturnI results > > > > > > > >>> in files being left in the bad state). > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> Also, I'm not a member of the users list; if > these > > > > > > > >>> questions are better posed there then let me > > > > > > > >>> know and I'll re-post them there. > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> Thanks, > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> ----- Original Message ----- > > > > > > > >>>> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > > > > > > > >>>> To:<gluster-devel@xxxxxxxxxx> > > > > > > > >>>> Subject: replicate background > > > > threads > > > > > > > >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000 > > > > > > > >>>> > > > > > > > >>>> Hello, > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> My mate Michael and I have been steadily > > > > > > > >>>> advancing our Gluster testing and today we > > finally > > > > > > > >>>> reached some heavier conditions. The outcome > > > > > > > >>>> was different from expectations built from > > our more > > > > > > > >>>> basic testing so I think we have a couple of > > > > > > > >>>> questions regarding the AFR/Replicate > background > > > > > > > >>>> threads that may need a developer's > contribution. > > > > > > > >>>> Any help appreciated. > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> The setup is a 3 box environment, but lets > > > start > > > > > > > >>>> with two; > > > > > > > >>>> > > > > > > > >>>> SaturnM (Server) > > > > > > > >>>> - 6core CPU, 16GB RAM, 1Gbps net > > > > > > > >>>> - 3.2.6 Kernel (custom distro) > > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > > >>>> - 3x2TB drives, CFQ, EXT3 > > > > > > > >>>> - Bricked up into a single local 6TB > > > > > > > >>>> "distribute" brick > > > > > > > >>>> - "brick" served to the network > > > > > > > >>>> > > > > > > > >>>> MMC (Client) > > > > > > > >>>> - 4core CPU, 8GB RAM, 1Gbps net > > > > > > > >>>> - Ubuntu > > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > > >>>> - Don't recall the disk space locally > > > > > > > >>>> - "brick" from SaturnM mounted > > > > > > > >>>> > > > > > > > >>>> 500 x 15Gbyte files were copied from MMC > > > > > > > >>>> to a single sub-directory on the brick served > > from > > > > > > > >>>> SaturnM, all went well and dandy. So then we > > > > > > > >>>> moved on to a 3 box environment; > > > > > > > >>>> > > > > > > > >>>> SaturnI (Server) > > > > > > > >>>> = 1core CPU, 1GB RAM, 1Gbps net > > > > > > > >>>> = 3.2.6 Kernel (custom distro) > > > > > > > >>>> = 3.2.5 Gluster (32bit) > > > > > > > >>>> = 4x2TB drives, CFQ, EXT3 > > > > > > > >>>> = Bricked up into a single local 8TB > > > > > > > >>>> "distribute" brick > > > > > > > >>>> = "brick" served to the network > > > > > > > >>>> > > > > > > > >>>> SaturnM (Server/Client) > > > > > > > >>>> - 6core CPU, 16GB RAM, 1Gbps net > > > > > > > >>>> - 3.2.6 Kernel (custom distro) > > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > > >>>> - 3x2TB drives, CFQ, EXT3 > > > > > > > >>>> - Bricked up into a single local 6TB > > > > > > > >>>> "distribute" brick > > > > > > > >>>> = Replicate brick added to sit over > > > > > > > >>>> the local distribute brick and a > > > > > > > >>>> client "brick" mapped from SaturnI > > > > > > > >>>> - Replicate "brick" served to the > > network > > > > > > > >>>> > > > > > > > >>>> MMC (Client) > > > > > > > >>>> - 4core CPU, 8GB RAM, 1Gbps net > > > > > > > >>>> - Ubuntu > > > > > > > >>>> - 3.2.5 Gluster (32bit) > > > > > > > >>>> - Don't recall the disk space locally > > > > > > > >>>> - "brick" from SaturnM mounted > > > > > > > >>>> = "brick" from SaturnI mounted > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> Now, in lesser testing in this scenario > > all was > > > > > > > >>>> well - any files on SaturnI would appear on > > SaturnM > > > > > > > >>>> (not a functional part of our test) and the > > > > content on > > > > > > > >>>> SaturnM would appear on SaturnI (the real > > > > > > > >>>> objective). > > > > > > > >>>> > > > > > > > >>>> Earlier testing used a handful of smaller > > files > > > > > (10s > > > > > > > >>>> to 100s of Mbytes) and a single 15Gbyte file. > > The > > > > > > > >>>> 15Gbyte file would be "stat" via an "ls", which > > > would > > > > > > > >>>> kick off a background replication (ls > > appeared un- > > > > > > > >>>> blocked) and the file would be transferred. > > Also, > > > > > > > >>>> interrupting the transfer (pulling the LAN > cable) > > > > > > > >>>> would result in a partial 15Gbyte file being > > > > corrected > > > > > > > >>>> in a subsequent background process on another > > > > > > > >>>> stat. > > > > > > > >>>> > > > > > > > >>>> *However* .. when confronted with 500 x > > 15Gbyte > > > > > > > >>>> files, in a single directory (but not the root > > > > > directory) > > > > > > > >>>> things don't quite work out as nicely. > > > > > > > >>>> - First, the "ls" (at MMC against the > SaturnM > > > > > brick) > > > > > > > >>>> is blocking and hangs the terminal > (ctrl-c > > > > > doesn't > > > > > > > >>>> kill it). > > > > > > > >> <pranithk> At max 16 files can be self-healed > > in the > > > > > > > > back-ground in > > > > > > > >> parallel. 17th file self-heal will happen in the > > > > > > foreground. > > > > > > > >>>> - Then, looking from MMC at the SaturnI > file > > > > > > > >>>> system (ls -s) once per second, and then > > > > > > > >>>> comparing the output (diff ls1.txt > > ls2.txt | > > > > > > > >>>> grep -v '>') we can see that between 10 > > > and 17 > > > > > > > >>>> files are being updated simultaneously > > > by the > > > > > > > >>>> background process > > > > > > > >> <pranithk> This is expected. > > > > > > > >>>> - Further, a request at MMC for a > single file > > > > that > > > > > > > >>>> was originally in the 500 x 15Gbyte > > > sub-dir on > > > > > > > >>>> SaturnM (which should return > unblocked with > > > > > > > >>>> correct results) will; > > > > > > > >>>> a) work as expected if there are less > > > than 17 > > > > > > > >>>> active background file tasks > > > > > > > >>>> b) block/hang if there are more > > > > > > > >>>> - Where-as a stat (ls) outside of the 500 > > x 15 > > > > > > > >>>> sub-directory, such as the root of that > > > brick, > > > > > > > >>>> would always work as expected (return > > > > > > > >>>> immediately, unblocked). > > > > > > > >> <pranithk> stat on the directory will only > > > create the > > > > > > > > files with '0' > > > > > > > >> file size. Then when you ls/stat the actual > > file the > > > > > > > > self-heal for the > > > > > > > >> file gets triggered. > > > > > > > >>>> > > > > > > > >>>> Thus, to us, it appears as though there > is a > > > > > > > >>>> queue feeding a set of (around) 16 worker > threads > > > > > > > >>>> in AFR. If your request was to the loaded > > > directory > > > > > > > >>>> then you would be blocked until a worker was > > > > > > > >>>> available, and if your request was to any other > > > > > > > >>>> location, it would return unblocked > regardless of > > > > > > > >>>> the worker pool state. > > > > > > > >>>> > > > > > > > >>>> The only thread metric that we could > find to > > > > tweak > > > > > > > >>>> was performance/io-threads (which was set to > > > > > > > >>>> 16 per physical disk; well per locks + posix > > brick > > > > > > > >>>> stacks) but increasing this to 64 per stack > > didn't > > > > > > > >>>> change the outcome (10 to 17 active background > > > > > > > >>>> transfers). > > > > > > > >> <pranithk> the option to increase the max num of > > > > > > > > background self-heals > > > > > > > >> is cluster.background-self-heal-count. Default > > > value of > > > > > > > > which is 16. I > > > > > > > >> assume you know what you are doing to the > > > > performance of > > > > > > > > the system by > > > > > > > >> increasing this number. > > > > > > > >>>> > > > > > > > >>>> So, given the above, is our analysis > > sound, and > > > > > > > >>>> if so, is there a way to increase the size > of the > > > > > > > >>>> pool of active worker threads? The objective > > > > > > > >>>> being to allow unblocked access to an existing > > > > > > > >>>> repository of files (on SaturnM) while a > > > > > > > >>>> secondary/back-up is being filled, via > GlusterFS? > > > > > > > >>>> > > > > > > > >>>> Note that I understand that performance > > > > > > > >>>> (through-put) will be an issue in the described > > > > > > > >>>> environment: this replication process is > > > > > > > >>>> estimated to run for between 10 and 40 hours, > > > > > > > >>>> which is acceptable so long as it isn't > blocking > > > > > > > >>>> (there's a production-capable file set in > place). > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> Any help appreciated. > > > > > > > >>>> > > > > > > > >> Please let us know how it goes. > > > > > > > >>>> Thanks, > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> -- > > > > > > > >>>> Ian Latter > > > > > > > >>>> Late night coder .. > > > > > > > >>>> http://midnightcode.org/ > > > > > > > >>>> > > > > > > > >>>> _______________________________________________ > > > > > > > >>>> Gluster-devel mailing list > > > > > > > >>>> Gluster-devel@xxxxxxxxxx > > > > > > > >>>> > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > >>>> > > > > > > > >>> -- > > > > > > > >>> Ian Latter > > > > > > > >>> Late night coder .. > > > > > > > >>> http://midnightcode.org/ > > > > > > > >>> > > > > > > > >>> _______________________________________________ > > > > > > > >>> Gluster-devel mailing list > > > > > > > >>> Gluster-devel@xxxxxxxxxx > > > > > > > >>> > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > >> hi Ian, > > > > > > > >> inline replies with<pranithk>. > > > > > > > >> > > > > > > > >> Pranith. > > > > > > > >> > > > > > > > > > > > > > > > > -- > > > > > > > > Ian Latter > > > > > > > > Late night coder .. > > > > > > > > http://midnightcode.org/ > > > > > > > hi Ian, > > > > > > > Maintaining a queue of files that need to be > > > > > > self-healed does not > > > > > > > scale in practice, in cases where there are > > millions of > > > > > > files that need > > > > > > > self-heal. So such a thing is not implemented. The > > > idea is > > > > > > to make > > > > > > > self-heal foreground after a certain-limit > > > > > > (background-self-heal-count) > > > > > > > so there is no necessity for such a queue. > > > > > > > > > > > > > > Pranith. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Ian Latter > > > > > > Late night coder .. > > > > > > http://midnightcode.org/ > > > > > > > > > > > > _______________________________________________ > > > > > > Gluster-devel mailing list > > > > > > Gluster-devel@xxxxxxxxxx > > > > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ian Latter > > > > > Late night coder .. > > > > > http://midnightcode.org/ > > > > > > > > > > _______________________________________________ > > > > > Gluster-devel mailing list > > > > > Gluster-devel@xxxxxxxxxx > > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > -- > > > > Ian Latter > > > > Late night coder .. > > > > http://midnightcode.org/ > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@xxxxxxxxxx > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > -- > > > Ian Latter > > > Late night coder .. > > > http://midnightcode.org/ > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxx > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > -- > Ian Latter > Late night coder .. > http://midnightcode.org/ > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/