On Thu 30-03-17 15:48:42, Ilya Dryomov wrote: > On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote: [...] > > familiar with Ceph at all but does any of its (slab) shrinkers generate > > IO to recurse back? > > We don't register any custom shrinkers. This is XFS on top of rbd, > a ceph-backed block device. OK, that was the part I was missing. So you depend on the XFS to make a forward progress here. > >> Well, > >> it's got to go through the same ceph_connection: > >> > >> rbd_queue_workfn > >> ceph_osdc_start_request > >> ceph_con_send > >> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out > >> > >> Now if that was a GFP_NOIO allocation, we would simply block in the > >> allocator. The placement algorithm distributes objects across the OSDs > >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for > >> that OSD, some other I/Os for other OSDs would complete in the meantime > >> and free up memory. If we are under the kind of memory pressure that > >> makes GFP_NOIO allocations block for an extended period of time, we are > >> bound to have a lot of pre-open sockets, as we would have done at least > >> some flushing by then. > > > > How is this any different from xfs waiting for its IO to be done? > > I feel like we are talking past each other here. If the worker in > question isn't deadlocked, it will eventually get its socket and start > flushing I/O. If it has deadlocked, it won't... But if the allocation is stuck then the holder of the lock cannot make a forward progress and it is effectivelly deadlocked because other IO depends on the lock it holds. Maybe I just ask bad questions but what makes GFP_NOIO different from GFP_KERNEL here. We know that the later might need to wait for an IO to finish in the shrinker but it itself doesn't get the lock in question directly. The former depends on the allocator forward progress as well and that in turn wait for somebody else to proceed with the IO. So to me any blocking allocation while holding a lock which blocks further IO to complete is simply broken. -- Michal Hocko SUSE Labs