Re: [PATCH] libceph: don't set memalloc flags in loopback case

Mel Gorman <mgorman@xxxxxxx> · Fri, 3 Apr 2015 11:34:23 +0100

On Thu, Apr 02, 2015 at 11:35:35AM +0300, Ilya Dryomov wrote:
> On Thu, Apr 2, 2015 at 8:41 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote:
> >> On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> >> > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote:
> >> >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc
> >> >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and
> >> >> cephfs.  However it turned out to not play nice with loopback scenario,
> >> >> leading to lockups with a full socket send-q and empty recv-q.
> >> >>
> >> >> While we always advised against colocating kernel client and ceph
> >> >> servers on the same box, a few people are doing it and it's also useful
> >> >> for light development testing, so rather than reverting make sure to
> >> >> not set those flags in the loopback case.
> >> >>
> >> >
> >> > This does not clarify why the non-loopback case needs access to pfmemalloc
> >> > reserves. Granted, I've spent zero time on this but it's really unclear
> >> > what problem was originally tried to be solved and why dirty page limiting
> >> > was insufficient. Swap over NFS was always a very special case minimally
> >> > because it's immune to dirty page throttling.
> >>
> >> I don't think there was any particular problem tried to be solved,
> >
> > Then please go back and look at why dirty page limiting is insufficient
> > for ceph.
> >
> >> certainly not one we hit and fixed with 89baaa570ab0.  Mike is out this
> >> week, but I'm pretty sure he said he copied this for iscsi from nbd
> >> because you nudged him to (and you yourself did this for nbd as part of
> >> swap-over-NFS series).
> >
> > In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I
> > stated that if ceph insisted on using using nbd as justification for ceph
> > using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In
> > commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was
> > the swap-over-nbd case and I regret I didn't have userspace explicitly
> > tell the kernel that NBD was being used as a swap device.
> 
> OK, it all starts to make sense now.  So ideally nbd would only use
> __GFP_MEMALLOC if nbd-client was invoked with -swap - you just didn't
> implement that. 

Yes.

> I think ceph is fine with dirty page limiting in general,

Then I suggest removing ceph's usage of __GFP_MEMALLOC until there is a
genuine problem that dirty page limiting is unable to handle.  Dirty page
limiting might stall in some cases but worst case for __GFP_MEMALLOC abuse
is a livelocked machine.

> so it's only
> if we wanted to support swap-over-rbd (cephfs is a bit of a weak link
> currently, so I'm not going there) would we need to enable
> SOCK_MEMALLOC/PF_MEMALLOC and only for that ceph_client instance.

Yes.

> Sounds like that will require a "swap" libceph option, which will also
> implicitly enable "noshare" to make sure __GFP_MEMALLOC ceph_client is
> not shared with anything else - luckily we don't have a userspace
> process a la nbd-client we need to worry about.
> 

I'm not familiar enough with the ins and outs of rbd to know what sort
of implementation hazards might be encountered.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html