Re: [PATCH] libceph: don't set memalloc flags in loopback case

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 2 Apr 2015 11:35:35 +0300

On Thu, Apr 2, 2015 at 8:41 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote:
>> On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
>> > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote:
>> >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc
>> >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and
>> >> cephfs.  However it turned out to not play nice with loopback scenario,
>> >> leading to lockups with a full socket send-q and empty recv-q.
>> >>
>> >> While we always advised against colocating kernel client and ceph
>> >> servers on the same box, a few people are doing it and it's also useful
>> >> for light development testing, so rather than reverting make sure to
>> >> not set those flags in the loopback case.
>> >>
>> >
>> > This does not clarify why the non-loopback case needs access to pfmemalloc
>> > reserves. Granted, I've spent zero time on this but it's really unclear
>> > what problem was originally tried to be solved and why dirty page limiting
>> > was insufficient. Swap over NFS was always a very special case minimally
>> > because it's immune to dirty page throttling.
>>
>> I don't think there was any particular problem tried to be solved,
>
> Then please go back and look at why dirty page limiting is insufficient
> for ceph.
>
>> certainly not one we hit and fixed with 89baaa570ab0.  Mike is out this
>> week, but I'm pretty sure he said he copied this for iscsi from nbd
>> because you nudged him to (and you yourself did this for nbd as part of
>> swap-over-NFS series).
>
> In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I
> stated that if ceph insisted on using using nbd as justification for ceph
> using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In
> commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was
> the swap-over-nbd case and I regret I didn't have userspace explicitly
> tell the kernel that NBD was being used as a swap device.

OK, it all starts to make sense now.  So ideally nbd would only use
__GFP_MEMALLOC if nbd-client was invoked with -swap - you just didn't
implement that.  I guess I should have gone deeper into the history of
your nbd patch when Mike cited it as a reason he did this for ceph.

I think ceph is fine with dirty page limiting in general, so it's only
if we wanted to support swap-over-rbd (cephfs is a bit of a weak link
currently, so I'm not going there) would we need to enable
SOCK_MEMALLOC/PF_MEMALLOC and only for that ceph_client instance.
Sounds like that will require a "swap" libceph option, which will also
implicitly enable "noshare" to make sure __GFP_MEMALLOC ceph_client is
not shared with anything else - luckily we don't have a userspace
process a la nbd-client we need to worry about.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html