SOCK_MEMALLOC vs loopback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

A short while ago Mike added a patch to libceph to set SOCK_MEMALLOC on
libceph sockets and PF_MEMALLOC around send/receive paths (commit
89baaa570ab0, "libceph: use memalloc flags for net IO").  rbd is much
like nbd and is succeptible to all the same memory allocation
deadlocks, so it seemed like a step in the right direction.

However that turned out to not play nice with loopback - such a simple
workload as 'dd if=/dev/zero of=/dev/rbd0 bs=4M' would now lock up in
no time if one or more ceph-osd (think nbd-server) processes are
running on the same box - as soon as memory gets tight and
__alloc_skb() dips into PF_MEMALLOC reserves and marks skb as
pfmemalloc, packets start being dropped on the receiving side:

int sk_filter(struct sock *sk, struct sk_buff *skb)
{
        ...

        /*
         * If the skb was allocated from pfmemalloc reserves, only
         * allow SOCK_MEMALLOC sockets to use it as this socket is
         * helping free memory
         */
        if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
                return -ENOMEM;

as the receiving ceph-osd socket is not a SOCK_MEMALLOC socket.

The motivation behind this is clear but this makes loopback rbd just
plain unusable and while we never recommended it to our users and
advised against it, we had a few "it worked for us for more than
a year" kind of reports.  It's also very useful for testing.

Some googling revealed that I'm not the first one to hit this.  SUSE
guys carried (are carrying?) a patch to sk_filter() to allow pfmemalloc
skbs through to make up for GPFS's misuse of PF_MEMALLOC [1], this was
mentioned tangentially by Eric in [2] and he suggested a possible fix
in [3].

"When I discussed with David on this issue, I said that one possibility
would be to accept a pfmemalloc skb on regular skb if no other packet is
in a receive queue, to get a chance to make progress (and limit memory
consumption to no more than one skb per TCP socket)"

Eric, was there any progress on this front?  We would like to work on
fixing this, but need some mm and net input.

(I also CC'ed Neil as he did the NFS loopback series recently and this
may touch on swap-on-nfs.)

[1] https://gitorious.org/opensuse/kernel-source/commit/a78bfd6
[2] http://article.gmane.org/gmane.linux.kernel/1418791
[3] http://article.gmane.org/gmane.linux.kernel.stable/46128

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux