Thanks for the info Pranith; <pranithk> the option to increase the max num of background self-heals is cluster.background-self-heal-count. Default value of which is 16. I assume you know what you are doing to the performance of the system by increasing this number. I didn't know this. Is there a queue length for what is yet to be handled by the background self heal count? If so, can it also be adjusted? ----- Original Message ----- >From: "Pranith Kumar K" <pranithk@xxxxxxxxxxx> >To: "Ian Latter" <ian.latter@xxxxxxxxxxxxxxxx> >Subject: Re: replicate background threads >Date: Tue, 13 Mar 2012 21:07:53 +0530 > > On 03/13/2012 07:52 PM, Ian Latter wrote: > > Hello, > > > > > > Well we've been privy to our first true error in > > Gluster now, and we're not sure of the cause. > > > > The SaturnI machine with 1Gbyte of RAM was > > exhausting its memory and crashing and we saw > > core dumps on SaturnM and MMC. Replacing > > the SaturnI hardware with identical hardware to > > SaturnM, but retaining SaturnI's original disks, > > (so fixing the memory capacity problem) we saw > > crashes randomly at all nodes. > > > > Looking for irregularities at the file system > > we noticed that (we'd estimate) about 60% of > > the files at the OS/EXT3 layer of SaturnI > > (sourced via replicate from SaturnM) were of > > size 2147483648 (2^31) where they should > > have been substantially larger. While we would > > happily accept "you shouldn't be using a 32bit > > gluster package" as the answer, we note two > > deltas; > > 1) All files used in testing were copied on from > > 32 bit clients to 32 bit servers, with no > > observable errors > > 2) Of the file that were replicated, not all were > > corrupted (capped at 2G -- note that we > > confirmed that this was the first 2G of the > > source file contents). > > > > > > So is there a known replicate issue with files > > greater than 2GB? Has anyone done any > > serious testing with significant numbers of files > > of this size? Are there configurations specific > > to files/structures of these dimensions? > > > > We noted that reversing the configuration, such > > that SaturnI provides the replicate Brick amongst > > a local distribute and a remote map to SaturnM > > where SaturnM simply serves a local distribute; > > that the data served to MMC is accurate (it > > continues to show 15GB files, even where there > > is a local 2GB copy). Further, a client "cp" at > > MMC, of a file with a 2GB local replicate of a > > 15GB file, will result in a 15GB file being > > created and replicated via Gluster (i.e. the > > correct specification at both server nodes). > > > > So my other question is; Is it possible that we've > > managed to corrupt something in this > > environment? I.e. during the initial memory > > exhaustion events? And is there a robust way > > to have the replicate files revalidated by gluster > > as a stat doesn't seem to be correcting files in > > this state (i.e. replicate on SaturnM results in > > daemon crashes, replicate on SaturnI results > > in files being left in the bad state). > > > > > > Also, I'm not a member of the users list; if these > > questions are better posed there then let me > > know and I'll re-post them there. > > > > > > > > Thanks, > > > > > > > > > > > > ----- Original Message ----- > >> From: "Ian Latter"<ian.latter@xxxxxxxxxxxxxxxx> > >> To:<gluster-devel@xxxxxxxxxx> > >> Subject: replicate background threads > >> Date: Sun, 11 Mar 2012 20:17:15 +1000 > >> > >> Hello, > >> > >> > >> My mate Michael and I have been steadily > >> advancing our Gluster testing and today we finally > >> reached some heavier conditions. The outcome > >> was different from expectations built from our more > >> basic testing so I think we have a couple of > >> questions regarding the AFR/Replicate background > >> threads that may need a developer's contribution. > >> Any help appreciated. > >> > >> > >> The setup is a 3 box environment, but lets start > >> with two; > >> > >> SaturnM (Server) > >> - 6core CPU, 16GB RAM, 1Gbps net > >> - 3.2.6 Kernel (custom distro) > >> - 3.2.5 Gluster (32bit) > >> - 3x2TB drives, CFQ, EXT3 > >> - Bricked up into a single local 6TB > >> "distribute" brick > >> - "brick" served to the network > >> > >> MMC (Client) > >> - 4core CPU, 8GB RAM, 1Gbps net > >> - Ubuntu > >> - 3.2.5 Gluster (32bit) > >> - Don't recall the disk space locally > >> - "brick" from SaturnM mounted > >> > >> 500 x 15Gbyte files were copied from MMC > >> to a single sub-directory on the brick served from > >> SaturnM, all went well and dandy. So then we > >> moved on to a 3 box environment; > >> > >> SaturnI (Server) > >> = 1core CPU, 1GB RAM, 1Gbps net > >> = 3.2.6 Kernel (custom distro) > >> = 3.2.5 Gluster (32bit) > >> = 4x2TB drives, CFQ, EXT3 > >> = Bricked up into a single local 8TB > >> "distribute" brick > >> = "brick" served to the network > >> > >> SaturnM (Server/Client) > >> - 6core CPU, 16GB RAM, 1Gbps net > >> - 3.2.6 Kernel (custom distro) > >> - 3.2.5 Gluster (32bit) > >> - 3x2TB drives, CFQ, EXT3 > >> - Bricked up into a single local 6TB > >> "distribute" brick > >> = Replicate brick added to sit over > >> the local distribute brick and a > >> client "brick" mapped from SaturnI > >> - Replicate "brick" served to the network > >> > >> MMC (Client) > >> - 4core CPU, 8GB RAM, 1Gbps net > >> - Ubuntu > >> - 3.2.5 Gluster (32bit) > >> - Don't recall the disk space locally > >> - "brick" from SaturnM mounted > >> = "brick" from SaturnI mounted > >> > >> > >> Now, in lesser testing in this scenario all was > >> well - any files on SaturnI would appear on SaturnM > >> (not a functional part of our test) and the content on > >> SaturnM would appear on SaturnI (the real > >> objective). > >> > >> Earlier testing used a handful of smaller files (10s > >> to 100s of Mbytes) and a single 15Gbyte file. The > >> 15Gbyte file would be "stat" via an "ls", which would > >> kick off a background replication (ls appeared un- > >> blocked) and the file would be transferred. Also, > >> interrupting the transfer (pulling the LAN cable) > >> would result in a partial 15Gbyte file being corrected > >> in a subsequent background process on another > >> stat. > >> > >> *However* .. when confronted with 500 x 15Gbyte > >> files, in a single directory (but not the root directory) > >> things don't quite work out as nicely. > >> - First, the "ls" (at MMC against the SaturnM brick) > >> is blocking and hangs the terminal (ctrl-c doesn't > >> kill it). > <pranithk> At max 16 files can be self-healed in the back-ground in > parallel. 17th file self-heal will happen in the foreground. > >> - Then, looking from MMC at the SaturnI file > >> system (ls -s) once per second, and then > >> comparing the output (diff ls1.txt ls2.txt | > >> grep -v '>') we can see that between 10 and 17 > >> files are being updated simultaneously by the > >> background process > <pranithk> This is expected. > >> - Further, a request at MMC for a single file that > >> was originally in the 500 x 15Gbyte sub-dir on > >> SaturnM (which should return unblocked with > >> correct results) will; > >> a) work as expected if there are less than 17 > >> active background file tasks > >> b) block/hang if there are more > >> - Where-as a stat (ls) outside of the 500 x 15 > >> sub-directory, such as the root of that brick, > >> would always work as expected (return > >> immediately, unblocked). > <pranithk> stat on the directory will only create the files with '0' > file size. Then when you ls/stat the actual file the self-heal for the > file gets triggered. > >> > >> > >> Thus, to us, it appears as though there is a > >> queue feeding a set of (around) 16 worker threads > >> in AFR. If your request was to the loaded directory > >> then you would be blocked until a worker was > >> available, and if your request was to any other > >> location, it would return unblocked regardless of > >> the worker pool state. > >> > >> The only thread metric that we could find to tweak > >> was performance/io-threads (which was set to > >> 16 per physical disk; well per locks + posix brick > >> stacks) but increasing this to 64 per stack didn't > >> change the outcome (10 to 17 active background > >> transfers). > <pranithk> the option to increase the max num of background self-heals > is cluster.background-self-heal-count. Default value of which is 16. I > assume you know what you are doing to the performance of the system by > increasing this number. > >> > >> > >> So, given the above, is our analysis sound, and > >> if so, is there a way to increase the size of the > >> pool of active worker threads? The objective > >> being to allow unblocked access to an existing > >> repository of files (on SaturnM) while a > >> secondary/back-up is being filled, via GlusterFS? > >> > >> Note that I understand that performance > >> (through-put) will be an issue in the described > >> environment: this replication process is > >> estimated to run for between 10 and 40 hours, > >> which is acceptable so long as it isn't blocking > >> (there's a production-capable file set in place). > >> > >> > >> > >> > >> > >> Any help appreciated. > >> > Please let us know how it goes. > >> > >> Thanks, > >> > >> > >> > >> > >> > >> > >> -- > >> Ian Latter > >> Late night coder .. > >> http://midnightcode.org/ > >> > >> _______________________________________________ > >> Gluster-devel mailing list > >> Gluster-devel@xxxxxxxxxx > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > > > -- > > Ian Latter > > Late night coder .. > > http://midnightcode.org/ > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > https://lists.nongnu.org/mailman/listinfo/gluster-devel > hi Ian, > inline replies with <pranithk>. > > Pranith. > -- Ian Latter Late night coder .. http://midnightcode.org/