Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >> >> >   try_write
>> >> >> >     ceph_tcp_connect
>> >> >> >       sock_create_kern
>> >> >> >         GFP_KERNEL allocation
>> >> >> >           allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker.  A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.

Just to be clear: the "may lock up" comment above goes for 3.18, which
is where these stack traces came from.  osdc->request_mutex which stood
in the way of other ceph_connection workers is no more.

>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.

Suppose we have an I/O for OSD X, which means it's got to go through
ceph_connection X:

ceph_con_workfn
  mutex_lock(&con->mutex)
    try_write
      ceph_tcp_connect
        sock_create_kern
          GFP_KERNEL allocation

Suppose that generates another I/O for OSD X and blocks on it.  Well,
it's got to go through the same ceph_connection:

rbd_queue_workfn
  ceph_osdc_start_request
    ceph_con_send
      mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out

Now if that was a GFP_NOIO allocation, we would simply block in the
allocator.  The placement algorithm distributes objects across the OSDs
in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
that OSD, some other I/Os for other OSDs would complete in the meantime
and free up memory.  If we are under the kind of memory pressure that
makes GFP_NOIO allocations block for an extended period of time, we are
bound to have a lot of pre-open sockets, as we would have done at least
some flushing by then.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux