Re: gNFS service management from glusterd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Fri, Feb 23, 2018 at 1:04 PM, Niels de Vos <ndevos@xxxxxxxxxx> wrote:
On Wed, Feb 21, 2018 at 08:25:21PM +0530, Atin Mukherjee wrote:
> On Wed, Feb 21, 2018 at 4:24 PM, Xavi Hernandez <jahernan@xxxxxxxxxx> wrote:
>
> > Hi all,
> >
> > currently glusterd sends a SIGKILL to stop gNFS, while all other services
> > are stopped with a SIGTERM signal first (this can be seen in
> > glusterd_svc_stop() function of mgmt/glusterd xlator).
> >
>
> > The question is why it cannot be stopped with SIGTERM as all other
> > services. Using SIGKILL blindly while write I/O is happening can cause
> > multiple inconsistencies at the same time. For a replicated volume this is
> > not a problem because it will take one of the replicas as the "good" one
> > and continue, but for a disperse volume, if the number of inconsistencies
> > is bigger than the redundancy value, a serious problem could appear.
> >
> > The probability of this is very small (I've tried to reproduce this
> > problem on my laptop but I've been unable), but it exists.
> >
> > Is there any known issue that prevents gNFS to be stopped with a SIGTERM ?
> > or can it be changed safely ?
> >
>
> I firmly believe that we need to send SIGTERM as that's the right way to
> gracefully shutdown a running process but what I'd request from NFS folks
> to confirm if there's any background on why it was done with SIGKILL.

No background about this is known to me. I had a quick look through the
git logs, but could not find an explanation.

I agree that SIGTERM would be more appropriate.



I think there were two reasons for replacing SIGTERM with SIGKILL in gNFS:

1.  To avoid races in the graceful shutdown path that would affect the restart of gNFS process. 

2.  Graceful shutdown of gNFS might have caused clients to return errors to applications.

Improvements done for gracefully shutting down GlusterFS might have already addressed 1. I am not entirely certain if 2. was an issue or if it still is one. If we attempt replacing SIGKILL with SIGTERM, it would be worth testing out these scenarios carefully.

I also see references to other SIGKILLs in glusterd and other components:

xlators/mgmt/glusterd/src/glusterd-bitd-svc.c:1
xlators/mgmt/glusterd/src/glusterd-geo-rep.c:3
xlators/mgmt/glusterd/src/glusterd-nfs-svc.c:1
xlators/mgmt/glusterd/src/glusterd-proc-mgmt.c:1
xlators/mgmt/glusterd/src/glusterd-quota.c:1
xlators/mgmt/glusterd/src/glusterd-scrub-svc.c:1
xlators/mgmt/glusterd/src/glusterd-svc-helper.c:1
xlators/mgmt/glusterd/src/glusterd-utils.c:2
xlators/nfs/server/src/nlm4.c:1

It might be worth analyzing why we need SIGKILLs and document the reason if they are indeed necessary.

HTH,
Vijay
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux