Re: Sparse files and heal full bug fix backport to 3.6.x

Steve Dainard <sdainard@xxxxxxxx> · Wed, 10 Feb 2016 13:40:42 -0800

Most recently this happened on Gluster 3.6.6, I know it happened on
another earlier minor release of 3.6, maybe 3.6.4. Currently on 3.6.8,
I can try to re-create on another replica volume.

Which logs would give some useful info, under which logging level?

>From host with brick down. 2016-02-06 00:40 was approximately when I
got restarted glusterd to get the brick to start properly.
glfsheal-vm-storage.log
...
[2015-11-30 20:37:17.348673] I
[glfs-resolve.c:836:__glfs_active_subvol] 0-vm-storage: switched to
graph 676c7573-7465-7230-312e-706369632e75 (0)
[2016-02-06 00:27:15.282280] E
[client-handshake.c:1496:client_query_portmap_cbk]
0-vm-storage-client-0: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.
[2016-02-06 00:27:49.797465] E
[client-handshake.c:1496:client_query_portmap_cbk]
0-vm-storage-client-0: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.
[2016-02-06 00:27:54.126627] E
[client-handshake.c:1496:client_query_portmap_cbk]
0-vm-storage-client-0: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.
[2016-02-06 00:27:58.449801] E
[client-handshake.c:1496:client_query_portmap_cbk]
0-vm-storage-client-0: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.
[2016-02-06 00:31:56.139278] E
[client-handshake.c:1496:client_query_portmap_cbk]
0-vm-storage-client-0: failed to get the port number for remote
subvolume. Please run 'gluster volume status' on server to see if
brick process is running.
<nothing newer in logs>

The brick log, which has a massive amount of these errors
(https://dl.dropboxusercontent.com/u/21916057/mnt-lv-vm-storage-vm-storage.log-20160207.tar.gz):
[2016-02-06 00:43:43.280048] E [socket.c:1972:__socket_read_frag]
0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710
[2016-02-06 00:43:43.280159] E [socket.c:1972:__socket_read_frag]
0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710
[2016-02-06 00:43:43.280325] E [socket.c:1972:__socket_read_frag]
0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710

But I only peer and mount gluster on a private subnet so its a bit
odd.. but I don't know if its related.

On Tue, Feb 9, 2016 at 5:38 PM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
> Hi Steve,
> The patch already went in for 3.6.3
> (https://bugzilla.redhat.com/show_bug.cgi?id=1187547). What version are you
> using? If it is 3.6.3 or newer, can you share the logs if this happens
> again? (or possibly try if you can reproduce the issue on your setup).
> Thanks,
> Ravi
>
>
> On 02/10/2016 02:25 AM, FNU Raghavendra Manjunath wrote:
>
>
> Adding Pranith, maintainer of the replicate feature.
>
>
> Regards,
> Raghavendra
>
>
> On Tue, Feb 9, 2016 at 3:33 PM, Steve Dainard <sdainard@xxxxxxxx> wrote:
>>
>> There is a thread from 2014 mentioning that the heal process on a
>> replica volume was de-sparsing sparse files.(1)
>>
>> I've been experiencing the same issue on Gluster 3.6.x. I see there is
>> a bug closed for a fix on Gluster 3.7 (2) and I'm wondering if this
>> fix can be back-ported to Gluster 3.6.x?
>>
>> My experience has been:
>> Replica 3 volume
>> 1 brick went offline
>> Brought brick back online
>> Heal full on volume
>> My 500G vm-storage volume went from ~280G used to >400G used.
>>
>> I've experienced this a couple times previously, and used fallocate to
>> re-sparse files but this is cumbersome at best, and lack of proper
>> heal support on sparse files could be disastrous if I didn't have
>> enough free space and ended up crashing my VM's when my storage domain
>> ran out of space.
>>
>> Seeing as 3.6 is still a supported release, and 3.7 feels too bleeding
>> edge for production systems, I think it makes sense to back-port this
>> fix if possible.
>>
>> Thanks,
>> Steve
>>
>>
>>
>> 1.
>> https://www.gluster.org/pipermail/gluster-users/2014-November/019512.html
>> 2. https://bugzilla.redhat.com/show_bug.cgi?id=1166020
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users