Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

Michal Hocko <mhocko@xxxxxxxxxx> · Thu, 30 Mar 2017 08:25:00 +0200

On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> >> > [...]
> >> >> > ceph_con_workfn
> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
> >> >> >   try_write
> >> >> >     ceph_tcp_connect
> >> >> >       sock_create_kern
> >> >> >         GFP_KERNEL allocation
> >> >> >           allocator recurses into XFS, more I/O is issued
> >> >
> >> > One more note. So what happens if this is a GFP_NOIO request which
> >> > cannot make any progress? Your IO thread is blocked on con->mutex
> >> > as you write below but the above thread cannot proceed as well. So I am
> >> > _really_ not sure this acutally helps.
> >>
> >> This is not the only I/O worker.  A ceph cluster typically consists of
> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
> >> the reason we are calling sock_create_kern() on the writeback path in
> >> the first place: pre-opening thousands of sockets isn't feasible.
> >
> > Sorry for being dense here but what actually guarantees the forward
> > progress? My current understanding is that the deadlock is caused by
> > con->mutext being held while the allocation cannot make a forward
> > progress. I can imagine this would be possible if the other io flushers
> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
> > much difference. What am I missing?
> 
> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
> the real problem here because we need both on the submit side, at least
> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
> in theory, however I think it's very unlikely in practice.

No, it would just make such a bug more obscure. The real problem seems
to be that you rely on locks which cannot guarantee a forward progress
in the IO path. And that is a bug IMHO.

> We got rid of osdc->request_mutex in 4.7, so these workers are almost
> independent in newer kernels and should be able to free up memory for
> those blocked on GFP_NOIO retries with their respective con->mutex
> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
> for an AA deadlock on con->mutex OTOH, so it does make a difference.

You keep saying this but so far I haven't heard how the AA deadlock is
possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
of time and that would cause you problems AFAIU.

> I'm a little confused by this discussion because for me this patch was
> a no-brainer...

No, it is a brainer. Because recursion prevention should be carefully
thought through. The lack of this approach has caused that we have
thousands of GFP_NOFS uses all over the kernel without a clear or proper
justification. Adding more on top doesn't help long term
maintainability.

> Locking aside, you said it was the stack trace in the changelog that
> got your attention

No, it is the usage of the scope GFP_NOIO API usage without a proper
explanation which caught my attention.

> are you saying it's OK for a block
> device to recurse back into the filesystem when doing I/O, potentially
> generating more I/O?

No, block device has to make a forward progress guarantee when
allocating and so use mempools or other means to achieve the same.

-- 
Michal Hocko
SUSE Labs