Hello, > hi Ian, > Maintaining a queue of files that need to be > self-healed does not scale in practice, in cases > where there are millions of files that need self- > heal. So such a thing is not implemented. The > idea is to make self-heal foreground after a > certain-limit (background-self-heal-count) so > there is no necessity for such a queue. > > Pranith. Ok, I understand - it will be interesting to observe the system with the new knowledge from your messages - thanks for your help, appreciate it. Cheers, ----- Original Message ----- >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> >Subject: Re: replicate background threads >Date: Wed, 14 Mar 2012 07:33:32 +0530 > > On 03/14/2012 01:47 AM, Ian Latter wrote: > > Thanks for the info Pranith; > > > > <pranithk> the option to increase the max num of background > > self-heals > > is cluster.background-self-heal-count. Default value of > > which is 16. I > > assume you know what you are doing to the performance of the > > system by > > increasing this number. > > > > > > I didn't know this. Is there a queue length for what > > is yet to be handled by the background self heal > > count? If so, can it also be adjusted? > > > > > > ----- Original Message ----- > >> From: "Pranith Kumar K"<pranithk@xxxxxxxxxxx> > >> To: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > >> Subject: Re: replicate background threads > >> Date: Tue, 13 Mar 2012 21:07:53 +0530 > >> > >> On 03/13/2012 07:52 PM, Ian Latter wrote: > >>> Hello, > >>> > >>> > >>> Well we've been privy to our first true error in > >>> Gluster now, and we're not sure of the cause. > >>> > >>> The SaturnI machine with 1Gbyte of RAM was > >>> exhausting its memory and crashing and we saw > >>> core dumps on SaturnM and MMC. Replacing > >>> the SaturnI hardware with identical hardware to > >>> SaturnM, but retaining SaturnI's original disks, > >>> (so fixing the memory capacity problem) we saw > >>> crashes randomly at all nodes. > >>> > >>> Looking for irregularities at the file system > >>> we noticed that (we'd estimate) about 60% of > >>> the files at the OS/EXT3 layer of SaturnI > >>> (sourced via replicate from SaturnM) were of > >>> size 2147483648 (2^31) where they should > >>> have been substantially larger. While we would > >>> happily accept "you shouldn't be using a 32bit > >>> gluster package" as the answer, we note two > >>> deltas; > >>> 1) All files used in testing were copied on from > >>> 32 bit clients to 32 bit servers, with no > >>> observable errors > >>> 2) Of the file that were replicated, not all were > >>> corrupted (capped at 2G -- note that we > >>> confirmed that this was the first 2G of the > >>> source file contents). > >>> > >>> > >>> So is there a known replicate issue with files > >>> greater than 2GB? Has anyone done any > >>> serious testing with significant numbers of files > >>> of this size? Are there configurations specific > >>> to files/structures of these dimensions? > >>> > >>> We noted that reversing the configuration, such > >>> that SaturnI provides the replicate Brick amongst > >>> a local distribute and a remote map to SaturnM > >>> where SaturnM simply serves a local distribute; > >>> that the data served to MMC is accurate (it > >>> continues to show 15GB files, even where there > >>> is a local 2GB copy). Further, a client "cp" at > >>> MMC, of a file with a 2GB local replicate of a > >>> 15GB file, will result in a 15GB file being > >>> created and replicated via Gluster (i.e. the > >>> correct specification at both server nodes). > >>> > >>> So my other question is; Is it possible that we've > >>> managed to corrupt something in this > >>> environment? I.e. during the initial memory > >>> exhaustion events? And is there a robust way > >>> to have the replicate files revalidated by gluster > >>> as a stat doesn't seem to be correcting files in > >>> this state (i.e. replicate on SaturnM results in > >>> daemon crashes, replicate on SaturnI results > >>> in files being left in the bad state). > >>> > >>> > >>> Also, I'm not a member of the users list; if these > >>> questions are better posed there then let me > >>> know and I'll re-post them there. > >>> > >>> > >>> > >>> Thanks, > >>> > >>> > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > >>>> To:<gluster-devel@xxxxxxxxxx> > >>>> Subject: replicate background threads > >>>> Date: Sun, 11 Mar 2012 20:17:15 +1000 > >>>> > >>>> Hello, > >>>> > >>>> > >>>> My mate Michael and I have been steadily > >>>> advancing our Gluster testing and today we finally > >>>> reached some heavier conditions. The outcome > >>>> was different from expectations built from our more > >>>> basic testing so I think we have a couple of > >>>> questions regarding the AFR/Replicate background > >>>> threads that may need a developer's contribution. > >>>> Any help appreciated. > >>>> > >>>> > >>>> The setup is a 3 box environment, but lets start > >>>> with two; > >>>> > >>>> SaturnM (Server) > >>>> - 6core CPU, 16GB RAM, 1Gbps net > >>>> - 3.2.6 Kernel (custom distro) > >>>> - 3.2.5 Gluster (32bit) > >>>> - 3x2TB drives, CFQ, EXT3 > >>>> - Bricked up into a single local 6TB > >>>> "distribute" brick > >>>> - "brick" served to the network > >>>> > >>>> MMC (Client) > >>>> - 4core CPU, 8GB RAM, 1Gbps net > >>>> - Ubuntu > >>>> - 3.2.5 Gluster (32bit) > >>>> - Don't recall the disk space locally > >>>> - "brick" from SaturnM mounted > >>>> > >>>> 500 x 15Gbyte files were copied from MMC > >>>> to a single sub-directory on the brick served from > >>>> SaturnM, all went well and dandy. So then we > >>>> moved on to a 3 box environment; > >>>> > >>>> SaturnI (Server) > >>>> = 1core CPU, 1GB RAM, 1Gbps net > >>>> = 3.2.6 Kernel (custom distro) > >>>> = 3.2.5 Gluster (32bit) > >>>> = 4x2TB drives, CFQ, EXT3 > >>>> = Bricked up into a single local 8TB > >>>> "distribute" brick > >>>> = "brick" served to the network > >>>> > >>>> SaturnM (Server/Client) > >>>> - 6core CPU, 16GB RAM, 1Gbps net > >>>> - 3.2.6 Kernel (custom distro) > >>>> - 3.2.5 Gluster (32bit) > >>>> - 3x2TB drives, CFQ, EXT3 > >>>> - Bricked up into a single local 6TB > >>>> "distribute" brick > >>>> = Replicate brick added to sit over > >>>> the local distribute brick and a > >>>> client "brick" mapped from SaturnI > >>>> - Replicate "brick" served to the network > >>>> > >>>> MMC (Client) > >>>> - 4core CPU, 8GB RAM, 1Gbps net > >>>> - Ubuntu > >>>> - 3.2.5 Gluster (32bit) > >>>> - Don't recall the disk space locally > >>>> - "brick" from SaturnM mounted > >>>> = "brick" from SaturnI mounted > >>>> > >>>> > >>>> Now, in lesser testing in this scenario all was > >>>> well - any files on SaturnI would appear on SaturnM > >>>> (not a functional part of our test) and the content on > >>>> SaturnM would appear on SaturnI (the real > >>>> objective). > >>>> > >>>> Earlier testing used a handful of smaller files (10s > >>>> to 100s of Mbytes) and a single 15Gbyte file. The > >>>> 15Gbyte file would be "stat" via an "ls", which would > >>>> kick off a background replication (ls appeared un- > >>>> blocked) and the file would be transferred. Also, > >>>> interrupting the transfer (pulling the LAN cable) > >>>> would result in a partial 15Gbyte file being corrected > >>>> in a subsequent background process on another > >>>> stat. > >>>> > >>>> *However* .. when confronted with 500 x 15Gbyte > >>>> files, in a single directory (but not the root directory) > >>>> things don't quite work out as nicely. > >>>> - First, the "ls" (at MMC against the SaturnM brick) > >>>> is blocking and hangs the terminal (ctrl-c doesn't > >>>> kill it). > >> <pranithk> At max 16 files can be self-healed in the > > back-ground in > >> parallel. 17th file self-heal will happen in the foreground. > >>>> - Then, looking from MMC at the SaturnI file > >>>> system (ls -s) once per second, and then > >>>> comparing the output (diff ls1.txt ls2.txt | > >>>> grep -v '>') we can see that between 10 and 17 > >>>> files are being updated simultaneously by the > >>>> background process > >> <pranithk> This is expected. > >>>> - Further, a request at MMC for a single file that > >>>> was originally in the 500 x 15Gbyte sub-dir on > >>>> SaturnM (which should return unblocked with > >>>> correct results) will; > >>>> a) work as expected if there are less than 17 > >>>> active background file tasks > >>>> b) block/hang if there are more > >>>> - Where-as a stat (ls) outside of the 500 x 15 > >>>> sub-directory, such as the root of that brick, > >>>> would always work as expected (return > >>>> immediately, unblocked). > >> <pranithk> stat on the directory will only create the > > files with '0' > >> file size. Then when you ls/stat the actual file the > > self-heal for the > >> file gets triggered. > >>>> > >>>> Thus, to us, it appears as though there is a > >>>> queue feeding a set of (around) 16 worker threads > >>>> in AFR. If your request was to the loaded directory > >>>> then you would be blocked until a worker was > >>>> available, and if your request was to any other > >>>> location, it would return unblocked regardless of > >>>> the worker pool state. > >>>> > >>>> The only thread metric that we could find to tweak > >>>> was performance/io-threads (which was set to > >>>> 16 per physical disk; well per locks + posix brick > >>>> stacks) but increasing this to 64 per stack didn't > >>>> change the outcome (10 to 17 active background > >>>> transfers). > >> <pranithk> the option to increase the max num of > > background self-heals > >> is cluster.background-self-heal-count. Default value of > > which is 16. I > >> assume you know what you are doing to the performance of > > the system by > >> increasing this number. > >>>> > >>>> So, given the above, is our analysis sound, and > >>>> if so, is there a way to increase the size of the > >>>> pool of active worker threads? The objective > >>>> being to allow unblocked access to an existing > >>>> repository of files (on SaturnM) while a > >>>> secondary/back-up is being filled, via GlusterFS? > >>>> > >>>> Note that I understand that performance > >>>> (through-put) will be an issue in the described > >>>> environment: this replication process is > >>>> estimated to run for between 10 and 40 hours, > >>>> which is acceptable so long as it isn't blocking > >>>> (there's a production-capable file set in place). > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Any help appreciated. > >>>> > >> Please let us know how it goes. > >>>> Thanks, > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Ian Latter > >>>> Late night coder .. > >>>> http://midnightcode.org/ > >>>> > >>>> _______________________________________________ > >>>> Gluster-devel mailing list > >>>> Gluster-devel@xxxxxxxxxx > >>>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >>>> > >>> -- > >>> Ian Latter > >>> Late night coder .. > >>> http://midnightcode.org/ > >>> > >>> _______________________________________________ > >>> Gluster-devel mailing list > >>> Gluster-devel@xxxxxxxxxx > >>> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> hi Ian, > >> inline replies with<pranithk>. > >> > >> Pranith. > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > hi Ian, > Maintaining a queue of files that need to be self-healed does not > scale in practice, in cases where there are millions of files that need > self-heal. So such a thing is not implemented. The idea is to make > self-heal foreground after a certain-limit (background-self-heal-count) > so there is no necessity for such a queue. > > Pranith. > -- Ian Latter Late night coder .. http://midnightcode.org/