Re: 3.0.1

Anand Avati <avati@xxxxxxxxxxx> · Tue, 26 Jan 2010 18:53:44 +0530

Please hold back using 3.0.1. We found some issues and are making
3.0.2 very quickly. Apologies for all the inconvenience.

Avati

On Tue, Jan 26, 2010 at 6:30 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
> I upgraded to 3.0.1 last night and it still doesn't seem as stable as 2.0.9.
> Things I have bumped into since the upgrade:
>
> 1) I've had unfsd lock up hard when exporting the volume, it couldn't be
> "kill -9"-ed. This happened just after a spurious disconnect (see 2).
>
> 2) Seeing random disconnects/timeouts between the servers that are on the
> same switch (this was happening with 2.0.x as well, though, so not sure
> what's going on). This is where the file clobbering/corruption used to occur
> that causes contents of one file to be replaced with contents of a different
> file, when the files are open. I HAVEN'T observed clobbering with 3.0.1 (yet
> at least - it wasn't a particularly frequent occurrence, but the chances of
> it were high on shared libraries during a big yum update when glfs is
> rootfs), but the disconnects still happen occassionally, usually under
> heavy-ish load.
>
> My main concern here is that open file self-healing may cover up the
> underlying bug that causes the clobbering, and possibly make it occur in
> even more heisenbuggy ways.
>
> ssh sessions to both servers don't show any problems/disconnections/dropouts
> at the same time as the disconnects on glfs happen. Is there a setting to
> set how many heartbeat packets have to be lost before the disconnect is
> initiated?
>
> This is the sort of thing I see in the logs:
> [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:
> 10.2.0.13:1010 disconnected
> [2010-01-26 07:36:56] N [server-protocol.c:6780:notify] server:
> 10.2.0.13:1013 disconnected
> [2010-01-26 07:36:56] N [server-helpers.c:849:server_connection_destroy]
> server: destroyed connection of
> thor.winterhearth.co.uk-11823-2010/01/26-05:29:32:239464-home2
> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
> forced unwinding frame type(1) op(SETATTR)
> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
> forced unwinding frame type(1) op(SETXATTR)
> [2010-01-26 07:37:25] E [saved-frames.c:165:saved_frames_unwind] home3:
> forced unwinding frame type(2) op(PING)
> [2010-01-26 07:37:25] N [client-protocol.c:6973:notify] home3: disconnected
> [2010-01-26 07:38:19] E [client-protocol.c:415:client_ping_timer_expired]
> home3: Server 10.2.0.13:6997 has not responded in the last 42 seconds,
> disconnecting.
> [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:
> forced unwinding frame type(2) op(SETVOLUME)
> [2010-01-26 07:38:19] E [saved-frames.c:165:saved_frames_unwind] home3:
> forced unwinding frame type(2) op(SETVOLUME)
> [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:
> accepted client from 10.2.0.13:1018
> [2010-01-26 08:06:17] N [server-protocol.c:5811:mop_setvolume] server:
> accepted client from 10.2.0.13:1017
> [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3:
> Connected to 10.2.0.13:6997, attached to remote volume 'home3'.
> [2010-01-26 08:06:17] N [client-protocol.c:6225:client_setvolume_cbk] home3:
> Connected to 10.2.0.13:6997, attached to remote volume 'home3'.
>
>
> 3) Something that started off as not being able to ssh in using public keys
> turned out to be due to my home directory somehow acquiring 777 permissions.
> I certainly didn't do it, so at a guess it's a file corruption issue,
> possibly during an unclean shutdown. Further, I've found that / directory
> (I'm running glusterfs root on this cluster) had permissions 777, too, which
> seems to have happened at the same time as the home directory getting 777
> permissions. If sendmail and ssh weren't failing to work properly because of
> this, it's possible I wouldn't have noticed. It's potentially quite a
> concerning problem, even if it is caused by an unclean shutdown (put it this
> way - I've never seen it happen on any other file system).
>
> 4) This looks potentially a bit concerning:
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>
>  5633 root      15   0 25.8g 119m 1532 S 36.7  3.0  36:25.42
> /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null
> --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol
> /mnt/newroot
>
> This is the rootfs daemon. 25.8GB of virtual address space mapped? Surely
> that can't be right, even if the resident size looks reasonably sane.
>
> Worse - it's growing by about 100MB/minute during heavy compiling on the
> system. I've just tried to test the nvidia driver installer to see if that
> old bug report I filed is still valid, and it doesn't seem to get anywhere
> (just makes glusterfsd and gcc use CPU time but doesn't ever finish - which
> is certainly a different fail case from 2.0.9 - that at least finishes the
> compile stage).
>
> The virtual memory bloat is rather reminiscent of the memory
> fragmentation/leak problem that was fixed on 2.0.x branch a while back that
> was arising when shared libraries were on glusterfs. A bit leaked every time
> a shared library call was made. A regression, perhaps? Wasn't there a memory
> consumption sanity check added to the test suite after this was fixed last
> time?
>
> Other glfs daemons are exhibiting similar behaviour:
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>
>  5633 root      15   0 26.1g 119m 1532 S  0.7  3.0  37:57.01
> /usr/sbin/glusterfs --log-level=NONE --log-file=/dev/null
> --disable-direct-io-mode --volfile=/etc/glusterfs.root/root2.vol
> /mnt/newroot
> 12037 root      15   0 24.8g  68m 1072 S  0.0  1.7   3:21.41
> /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/shared.vol
> /shared
>
> 11977 root      15   0 24.8g  67m 1092 S  0.7  1.7   3:59.11
> /usr/sbin/glusterfs --log-level=NORMAL --disable-direct-io-mode
> --volfile=/etc/glusterfs/home.vol /home
>
> 11915 root      15   0 24.9g  32m  972 S  0.0  0.8   0:21.65
> /usr/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/boot.vol
> /boot
>
> The home, shared and boot volumes don't have any shared libraries on them,
> and 24.9GB of virtual memory mapped for the /boot volume which is backed
> with a 250MB file system also seems a bit excessive.
>
> Gordan
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>