Please hold back using 3.0.1. We found some issues and are making 3.0.2 very quickly. Apologies for all the inconvenience. Avati On Tue, Jan 26, 2010 at 6:30 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote: > I upgraded to 3.0.1 last night and it still doesn't seem as stable as 2.0.9. > Things I have bumped into since the upgrade: > > 1) I've had unfsd lock up hard when exporting the volume, it couldn't be > "kill -9"-ed. This happened just after a spurious disconnect (see 2). > > 2) Seeing random disconnects/timeouts between the servers that are on the > same switch (this was happening with 2.0.x as well, though, so not sure > what's going on). This is where the file clobbering/corruption used to occur > that causes contents of one file to be replaced with contents of a different > file, when the files are open. I HAVEN'T observed clobbering with 3.0.1 (yet > at least - it wasn't a particularly frequent occurrence, but the chances of > it were high on shared libraries during a big yum update when glfs is > rootfs), but the disconnects still happen occassionally, usually under > heavy-ish load. > > My main concern here is that open file self-healing may cover up the > underlying bug that causes the clobbering, and possibly make it occur in > even more heisenbuggy ways. > > ssh sessions to both servers don't show any problems/disconnections/dropouts > at the same time as the disconnects on glfs happen. Is there a setting to > set how many heartbeat packets have to be lost before the disconnect is > initiated? > > This is the sort of thing I see in the logs: > [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server: > 10.2.0.13:1010 disconnected > [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server: > 10.2.0.13:1013 disconnected > [2010-01-26 07:36:56] N [server-helpers.c:849:server_connection_destroy] > server: destroyed connection of > thor.winterhearth.co.uk-11823-2010/01/26-05:29:32:239464-home2 > [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3: > forced unwinding frame type(1) op(SETATTR) > [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3: > forced unwinding frame type(1) op(SETXATTR) > [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3: > forced unwinding frame type(2) op(PING) > [2010-01-26 07:37:25] N [client-protocol.c:6973:notify] home3: disconnected > [2010-01-26 07:38:19] E [client-protocol.c:415:client_ping_timer_expired] > home3: Server 10.2.0.13:6997 has not responded in the last 42 seconds, > disconnecting. > [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3: > forced unwinding frame type(2) op(SETVOLUME) > [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3: > forced unwinding frame type(2) op(SETVOLUME) > [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server: > accepted client from 10.2.0.13:1018 > [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server: > accepted client from 10.2.0.13:1017 > [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3: > Connected to 10.2.0.13:6997, attached to remote volume 'home3'. > [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3: > Connected to 10.2.0.13:6997, attached to remote volume 'home3'. > > > 3) Something that started off as not being able to ssh in using public keys > turned out to be due to my home directory somehow acquiring 777 permissions. > I certainly didn't do it, so at a guess it's a file corruption issue, > possibly during an unclean shutdown. Further, I've found that / directory > (I'm running glusterfs root on this cluster) had permissions 777, too, which > seems to have happened at the same time as the home directory getting 777 > permissions. If sendmail and ssh weren't failing to work properly because of > this, it's possible I wouldn't have noticed. It's potentially quite a > concerning problem, even if it is caused by an unclean shutdown (put it this > way - I've never seen it happen on any other file system). > > 4) This looks potentially a bit concerning: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > 5633 root 15 0 25.8g 119m 1532 S 36.7 3.0 36:25.42 > /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null > --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol > /mnt/newroot > > This is the rootfs daemon. 25.8GB of virtual address space mapped? Surely > that can't be right, even if the resident size looks reasonably sane. > > Worse - it's growing by about 100MB/minute during heavy compiling on the > system. I've just tried to test the nvidia driver installer to see if that > old bug report I filed is still valid, and it doesn't seem to get anywhere > (just makes glusterfsd and gcc use CPU time but doesn't ever finish - which > is certainly a different fail case from 2.0.9 - that at least finishes the > compile stage). > > The virtual memory bloat is rather reminiscent of the memory > fragmentation/leak problem that was fixed on 2.0.x branch a while back that > was arising when shared libraries were on glusterfs. A bit leaked every time > a shared library call was made. A regression, perhaps? Wasn't there a memory > consumption sanity check added to the test suite after this was fixed last > time? > > Other glfs daemons are exhibiting similar behaviour: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > 5633 root 15 0 26.1g 119m 1532 S 0.7 3.0 37:57.01 > /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null > --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol > /mnt/newroot > 12037 root 15 0 24.8g 68m 1072 S 0.0 1.7 3:21.41 > /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/shared.vol > /shared > > 11977 root 15 0 24.8g 67m 1092 S 0.7 1.7 3:59.11 > /usr/sbin/glusterfs --log-level=NORMAL --disable-direct-io-mode > --volfile=/etc/glusterfs/home.vol /home > > 11915 root 15 0 24.9g 32m 972 S 0.0 0.8 0:21.65 > /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/boot.vol > /boot > > The home, shared and boot volumes don't have any shared libraries on them, > and 24.9GB of virtual memory mapped for the /boot volume which is backed > with a 250MB file system also seems a bit excessive. > > Gordan > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel >