On Thu, Apr 02, 2015 at 11:35:35AM +0300, Ilya Dryomov wrote: > On Thu, Apr 2, 2015 at 8:41 AM, Mel Gorman <mgorman@xxxxxxx> wrote: > > On Thu, Apr 02, 2015 at 02:40:19AM +0300, Ilya Dryomov wrote: > >> On Thu, Apr 2, 2015 at 2:03 AM, Mel Gorman <mgorman@xxxxxxx> wrote: > >> > On Wed, Apr 01, 2015 at 08:19:20PM +0300, Ilya Dryomov wrote: > >> >> Following nbd and iscsi, commit 89baaa570ab0 ("libceph: use memalloc > >> >> flags for net IO") set SOCK_MEMALLOC and PF_MEMALLOC flags for rbd and > >> >> cephfs. However it turned out to not play nice with loopback scenario, > >> >> leading to lockups with a full socket send-q and empty recv-q. > >> >> > >> >> While we always advised against colocating kernel client and ceph > >> >> servers on the same box, a few people are doing it and it's also useful > >> >> for light development testing, so rather than reverting make sure to > >> >> not set those flags in the loopback case. > >> >> > >> > > >> > This does not clarify why the non-loopback case needs access to pfmemalloc > >> > reserves. Granted, I've spent zero time on this but it's really unclear > >> > what problem was originally tried to be solved and why dirty page limiting > >> > was insufficient. Swap over NFS was always a very special case minimally > >> > because it's immune to dirty page throttling. > >> > >> I don't think there was any particular problem tried to be solved, > > > > Then please go back and look at why dirty page limiting is insufficient > > for ceph. > > > >> certainly not one we hit and fixed with 89baaa570ab0. Mike is out this > >> week, but I'm pretty sure he said he copied this for iscsi from nbd > >> because you nudged him to (and you yourself did this for nbd as part of > >> swap-over-NFS series). > > > > In http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/23708 I > > stated that if ceph insisted on using using nbd as justification for ceph > > using __GFP_MEMALLOC that it was preferred that nbd be broken instead. In > > commit 7f338fe4540b1d0600b02314c7d885fd358e9eca, the use case in mind was > > the swap-over-nbd case and I regret I didn't have userspace explicitly > > tell the kernel that NBD was being used as a swap device. > > OK, it all starts to make sense now. So ideally nbd would only use > __GFP_MEMALLOC if nbd-client was invoked with -swap - you just didn't > implement that. Yes. > I think ceph is fine with dirty page limiting in general, Then I suggest removing ceph's usage of __GFP_MEMALLOC until there is a genuine problem that dirty page limiting is unable to handle. Dirty page limiting might stall in some cases but worst case for __GFP_MEMALLOC abuse is a livelocked machine. > so it's only > if we wanted to support swap-over-rbd (cephfs is a bit of a weak link > currently, so I'm not going there) would we need to enable > SOCK_MEMALLOC/PF_MEMALLOC and only for that ceph_client instance. Yes. > Sounds like that will require a "swap" libceph option, which will also > implicitly enable "noshare" to make sure __GFP_MEMALLOC ceph_client is > not shared with anything else - luckily we don't have a userspace > process a la nbd-client we need to worry about. > I'm not familiar enough with the ins and outs of rbd to know what sort of implementation hazards might be encountered. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html