Thanks Brent. Others, The setup has two server machines, each has 16 volumes of type "storage/posix" each volume on 1st node has an AFR mirror on the other node. And problem is mostly/easily seen when performance translators are used. Krishna > I did a quick test, converting my storage into one giant stripe, just to > see what would happen. It, too, would die after awhile. Just watching it > with top, I realized that glusterfs's memory consumption was growing > rapidly (at the same rate it was reading data with dd) until it > probably couldn't allocate any more RAM and died. > > Wondering if this might account for what I was experiencing on my > mirroring configuration (in this case, with all performance translators > except for io-threads, which dies immediately), sure enough: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 7622 root 25 0 1768m 1.7g 776 R 100 44.9 8:37.00 glusterfs > > It appears that something triggers a memory leak when reading lots of > data. Once it starts, it keeps growing until it can't get anymore memory > and dies. > > I don't know if this would account for io-threads, though, which causes > glusterfs to die instantly... > > Thanks, > > Brent > > On Mon, 5 Mar 2007, Brent A Nelson wrote: > >> I should have mentioned that my dds are like this: >> dd if=/dev/zero of=/phys/blah4 bs=10M count=1024 >> and this: >> dd if=/phys/blah4 of=/dev/null bs=10M >> >> Both nodes are either writing or reading at the same time. >> >> When applying all performance translators, the glusterfs process dies >> quickly >> with the least bit of activity. Without io-threads (but with all >> others), >> dds that are writing succeed but reading will knock out glusterfs or the >> dd >> may hang (and I also got incomplete du output once during the writing). >> I >> tried glusterfs -s thebe /phys -l DEBUG -N, but nothing was reported in >> either situation before a segfault. I'll have to recompile with >> debugging to >> get more info. >> >> Thanks, >> >> Brent >> >> On Mon, 5 Mar 2007, Brent A Nelson wrote: >> >>> Attached are my spec files. Below are some details to start with; >>> sorry I >>> don't have something more coherent, yet: >>> >>> I have 2 servers, each with 16 disks shared out individually. The >>> disks >>> from one node are mirrors of the other node. My clients are the same >>> machines. I test by running a 10GB dd read or write on one node while >>> doing the same thing on the other node. Then, if the filesystem is >>> still >>> running, I may throw in a du of a 200MB copy of /usr that is on the >>> GlusterFS at the same time the dd processes are running to see how >>> metadata >>> is handled while things are busy. >>> >>> pre2.2 does not help. I've been tinkering over the weekend, and it >>> seems >>> that the client stays alive when I don't use any performance >>> translators, >>> although I still get the error below. Without performance translators, >>> the >>> error seems to result in an abnormal "du", where the du complains that >>> it >>> can't find a few directories (a second du, under the same >>> circumstances, >>> may work just fine, though), but the dd processes succeed. With >>> performance >>> translators, I get other breakage, either the dd processes hang forever >>> or >>> the glusterfs processes die outright. Glusterfsd has never died on me. >>> >>> Here are some other types of errors I can get with the performance >>> translators (alas, I can't tell you which translators cause which >>> errors): >>> >>> glusterfs: >>> [Mar 04 00:53:35] [ERROR/common-utils.c:107/full_rwv()] >>> libglusterfs:full_rwv: 73996 bytes r/w instead of 74151 (Bad address) >>> [Mar 04 00:53:35] [ERROR/client-protocol.c:183/client_protocol_xfer()] >>> protocol/client: client_protocol_xfer: :transport_submit failed >>> [Mar 04 17:22:27] [ERROR/tcp.c:38/tcp_recieve()] >>> ERROR:../../../../transport/tcp/tcp.c: tcp_recieve: ((buf) == NULL) is >>> true >>> [Mar 04 17:22:27] [ERROR/tcp.c:38/tcp_recieve()] >>> ERROR:../../../../transport/tcp/tcp.c: tcp_recieve: ((buf) == NULL) is >>> true >>> >>> The errors above correspond to the usual glusterfsd error: >>> >>> glusterfsd: >>> [Mar 04 00:52:33] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 00:53:35] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 17:22:12] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> [Mar 04 17:22:27] [ERROR/common-utils.c:52/full_rw()] >>> libglusterfs:full_rw: >>> 0 bytes r/w instead of 113 >>> >>> Additional errors I've seen from glusterfsd (which are seen in >>> conjunction >>> with error messages from glusterfs in tcp_recieve, just like above): >>> >>> [Mar 04 18:46:32] [ERROR/common-utils.c:107/full_rwv()] >>> libglusterfs:full_rwv: 65328 bytes r/w instead of 65744 (Connection >>> reset >>> by peer) >>> [Mar 04 18:46:32] [ERROR/proto-srv.c:117/generic_reply()] >>> protocol/server:generic_reply: transport_writev failed >>> [Mar 04 20:05:18] [ERROR/common-utils.c:107/full_rwv()] >>> libglusterfs:full_rwv: 28376 bytes r/w instead of 65744 (Connection >>> reset >>> by peer) >>> [Mar 04 20:05:18] [ERROR/proto-srv.c:117/generic_reply()] >>> protocol/server:generic_reply: transport_writev failed >>> [Mar 04 20:05:30] [ERROR/common-utils.c:107/full_rwv()] >>> libglusterfs:full_rwv: 65326 bytes r/w instead of 65746 (Connection >>> reset >>> by peer) >>> [Mar 04 20:05:30] [ERROR/proto-srv.c:117/generic_reply()] >>> protocol/server:generic_reply: transport_writev failed >>> >>> Everything seems related to the error I mentioned originally, with or >>> without performance translators (0 bytes r/w instead of 113), though. >>> >>> Thanks, >>> >>> Brent >>> >>> >>> On Mon, 5 Mar 2007, Krishna Srinivas wrote: >>> >>>> Hi Brent, >>>> Can you help us get to the root cause of the problem? >>>> It will be of great help. >>>> Thanks >>>> Krishna >>>> >>>> On 3/3/07, Anand Avati <avati@xxxxxxxxxxxxx> wrote: >>>>> Brent, >>>>> first off, thank you for trying glusterfs. Can you give a few more >>>>> details - >>>>> >>>>> * is the log from server or client? >>>>> * the log message from the other one as well. >>>>> * if possible a backtrace from the core of the one which died. >>>>> >>>>> can you also tell what was the I/O pattern which made the crash? >>>>> was >>>>> it heavy I/O on a single file? creation of a lot of files? metadata >>>>> operations? and is it possible to reproduce it consistantly with >>>>> some >>>>> steps?? >>>>> >>>>> Also we recently uploaded pre2-1 release tarball. That had a couple >>>>> of >>>>> bug fixes, but I need to get your answers to say if the fixes apply >>>>> to you as well. >>>>> >>>>> Please attach your spec files as well. >>>>> >>>>> regards, >>>>> avati >>>>> >>>>> On Fri, Mar 02, 2007 at 04:05:17PM -0500, Brent A Nelson wrote: >>>>> > So, I compiled 1.3.0pre2 as soon as it came out (nice, trouble-free >>>>> > standard configure and make), and I found it very easy to set up a >>>>> > GlusterFS with one node mirroring 16 disks to another, all >>>>> optimizers >>>>> > loaded. >>>>> > >>>>> > However, it isn't stable under load. I get errors like the >>>>> following >>>>> and >>>>> > glusterfs exits: >>>>> > >>>>> > [Mar 02 14:23:29] [ERROR/common-utils.c:52/full_rw()] >>>>> > libglusterfs:full_rw: 0 bytes r/w instead of 113 >>>>> > >>>>> > I thought it might be because I was using the stock fuse module >>>>> with my >>>>> > kernel, but I replaced it with the 2.6.3 fuse module and it still >>>>> dies >>>>> in >>>>> > this way. >>>>> > >>>>> > Is this a bug or just that my setup is poor (one node serves 16 >>>>> > individual shares through a single glusterfsd, the mirror node does >>>>> the >>>>> > same, and the servers are also acting as my test clients) or that >>>>> I'm >>>>> not >>>>> > using the deadline scheduler (yet) or...? >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Brent >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > Gluster-devel mailing list >>>>> > Gluster-devel@xxxxxxxxxx >>>>> > http://lists.nongnu.org/mailman/listinfo/gluster-devel >>>>> > >>>>> >>>>> -- >>>>> Shaw's Principle: >>>>> Build a system that even a fool can use, >>>>> and only a fool will want to use it. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel@xxxxxxxxxx >>>>> http://lists.nongnu.org/mailman/listinfo/gluster-devel >>>>> >>> >> > >