Deadlock between block allocation and block truncation

Nikolay Borisov <n.borisov.lkml@xxxxxxxxx> · Wed, 12 Apr 2017 18:51:53 +0300

Hello, 

TL;DR 

2 process - one doing truncation the other allocation. Truncate process 
holds AGF0 buffer and waits on AGF1 buffer since it has to free 
an extent from that AG, whereas allocation process 
has failed doing an allocation the first time in xfs_bmap_btalloc 
(the first xfs_alloc_vextent) for a blkno falling in AGF1, and has returned 
with this buffer locked and is now trying a cyclic allocation 
beginning with AGF0. But both processes deadlock due to each holding 
the required AGF buffer. 

---- Detailed crash analysis follows ----

I've been investigating a rather peculiar deadlock between block 
allocation and block freeing. I've observed this on 3.12.72 as 
well as on 4.4.0 but not on 4.11-rc6 and 4.8 run is still pending.
 However I assume something in the code changed which made it 
harder to hit this race rather than a patch purposefully 
eliminating it. Essential running generic/299 sufficiently many 
times results in the following deadlock between 2 processes: 

ProcessA, doing allocation: 

PID: 4459   TASK: ffff8800a553d340  CPU: 3   COMMAND: "fio"
#0 [ffff8800b81cf218] __schedule at ffffffff8165be8b
 #1 [ffff8800b81cf270] schedule at ffffffff8165c80c
 #2 [ffff8800b81cf288] schedule_timeout at ffffffff816608ef
 #3 [ffff8800b81cf328] __down at ffffffff8165f791
 #4 [ffff8800b81cf378] down at ffffffff8109a4a1
 #5 [ffff8800b81cf398] xfs_buf_lock at ffffffff812e25e7
 #6 [ffff8800b81cf3c0] _xfs_buf_find at ffffffff812e28f4
 #7 [ffff8800b81cf410] xfs_buf_get_map at ffffffff812e2c8a
 #8 [ffff8800b81cf450] xfs_buf_read_map at ffffffff812e3f30
 #9 [ffff8800b81cf498] xfs_trans_read_buf_map at ffffffff81316a7c
#10 [ffff8800b81cf4d8] xfs_read_agf at ffffffff8129ee05
#11 [ffff8800b81cf538] xfs_alloc_read_agf at ffffffff8129ef54
#12 [ffff8800b81cf578] xfs_alloc_fix_freelist at ffffffff8129f3fc
#13 [ffff8800b81cf650] xfs_alloc_vextent at ffffffff8129f6e5   
#14 [ffff8800b81cf6a8] xfs_bmap_btalloc at ffffffff812b38b4 <- Called from if (args.fsbno == NULLFSBLOCK && nullfb) block
#15 [ffff8800b81cf788] xfs_bmap_alloc at ffffffff812b3a8e
#16 [ffff8800b81cf798] xfs_bmapi_write at ffffffff812b4476
#17 [ffff8800b81cf8e0] xfs_iomap_write_direct at ffffffff812f2eea
#18 [ffff8800b81cf980] __xfs_get_blocks at ffffffff812da63b
#19 [ffff8800b81cfa18] xfs_get_blocks_direct at ffffffff812db1a7
#20 [ffff8800b81cfa28] __blockdev_direct_IO at ffffffff811e0384
#21 [ffff8800b81cfc58] xfs_vm_direct_IO at ffffffff812d962b
#22 [ffff8800b81cfca0] xfs_file_dio_aio_write at ffffffff812e9b6f
#23 [ffff8800b81cfd38] xfs_file_write_iter at ffffffff812ea167
#24 [ffff8800b81cfd68] aio_run_iocb at ffffffff811ef6a2
#25 [ffff8800b81cfe60] do_io_submit at ffffffff811f1076
#26 [ffff8800b81cff40] sys_io_submit at ffffffff811f13e0
#27 [ffff8800b81cff50] entry_SYSCALL_64_fastpath at ffffffff81661eb6

So what happened here is that this process was called with the following 
xfs_bmalloca:

struct xfs_bmalloca {
  firstblock = 0xffff8800b81cf930, # When read this address shows (NULLFSBLOCK): ffff8800b81cf930:  ffffffffffffffff
  flist = 0xffff8800b81cf940, 
  tp = 0xffff880133c951d0, 
  ip = 0xffff88013a313400, 
  prev = {
    br_startoff = 1123392, 
    br_startblock = 88284, 
    br_blockcount = 32, 
    br_state = XFS_EXT_NORM
  }, 
  got = {
    br_startoff = 1124352, 
    br_startblock = 1102322, 
    br_blockcount = 32, 
    br_state = XFS_EXT_NORM
  }, 
  offset = 1124320, 
  length = 32, 
  blkno = 1102290, 
  cur = 0x0, 
  idx = 1655, 
  nallocs = 0, 
  logflags = 0, 
  total = 36, 
  minlen = 1, 
  minleft = 3, 
  eof = false, 
  wasdel = false, 
  aeof = false, 
  conv = false, 
  userdata = 1 '\001', 
  flags = 8
}
Based on that what likely happened is that we first called xfs_alloc_vextent with 
args->type = XFS_ALLOCTYPE_START_BNO set from xfs_bmap_btalloc_nullfb (this is 
is called since nullfb is true). In xfs_alloc_vextent we set args->agno = 1 
and args->type = XFS_ALLOCTYPE_THIS_AG and continue to call xfs_alloc_fix_freelist. 
It in turns reads AGN1 xfs_buf and eventually returns with this buffer locked and 
added to the transaction and args->agbp pointing to it. Subsequently we call 
xfs_alloc_ag_vextent which returns without an error and this breaks the out of the 
loop. However the following check in xfs_alloc_vextent triggers: 

if (args->agbno == NULLAGBLOCK)                                             
        args->fsbno = NULLFSBLOCK;

So despite xfs_alloc_ag_vextent not returning an error apparently it couldn't 
satisfy the block allocation. At this point we return to xfs_bmap_btalloc, which 
goes on to retry the allocation, this time iterating through every AG. The following
command confirms this: 

crash> dis -l ffffffff812b38b4
/home/nborisov/projects/kernel/source/fs/xfs/libxfs/xfs_bmap.c: 3850

Except this time xfs_alloc_vextent is entered with XFS_ALLOCTYPE_FIRST_AG and 
as soon as it starts iterating the AG's it block on trying to acquire AGF0 
xfs_buf. Here is the buf this process is actually waiting on: 
crash> struct xfs_buf.b_bn ffff8800a5f1c280
  b_bn = 1

At the same time it is holding AGF1 buf dirty. This is evident from the items in the 
transaction item list: 

crash> struct -ox xfs_trans.t_items 0xffff880133c951d0
struct xfs_trans {
  [ffff880133c95290] struct list_head t_items;
}
crash> list -s xfs_log_item_desc -l xfs_log_item_desc.lid_trans -H ffff880133c95290
ffff8800ba9955a8
struct xfs_log_item_desc {
  lid_item = 0xffff88013956d8e8, 
  lid_trans = {
    next = 0xffff8800ba995f88, 
    prev = 0xffff880133c95290
  }, 
  lid_flags = 0 '\000'
}
ffff8800ba995f88
struct xfs_log_item_desc {
  lid_item = 0xffff8800a60b1570, 
  lid_trans = {
    next = 0xffff880133c95290, 
    prev = 0xffff8800ba9955a8
  }, 
  lid_flags = 0 '\000'
}

crash> struct xfs_buf_log_item.bli_buf,bli_item 0xffff8800a60b1570
  bli_buf = 0xffff8800a5f1d900
  bli_item = {
    li_ail = {
      next = 0xffff8800b5d14640, 
      prev = 0xffff88013956d8e8
    }, 
    li_lsn = 8589944450, 
    li_desc = 0xffff8800ba995f80, 
    li_mountp = 0xffff8800a4e8b000, 
    li_ailp = 0xffff880139ef6f00, 
    li_type = 4668, 
    li_flags = 1, 
    li_bio_list = 0x0, 
    li_cb = 0xffffffff8130cbc0 <xfs_buf_iodone>, 
    li_ops = 0xffffffff8183d940 <xfs_buf_item_ops>, 
    li_cil = {
      next = 0xffff8800a60b15c0, 
      prev = 0xffff8800a60b15c0
    }, 
    li_lv = 0x0, 
    li_seq = 82
  }
crash> struct xfs_buf.b_bn 0xffff8800a5f1d900
  b_bn = 7680001

Checking what 7680001 corresponds to: 
xfs_db> agf 1
xfs_db> daddr
current daddr is 7680001

So this process is trully holding AGF1 locked. 

On the other hand I have the truncation process - ProcessB, whose 
call stack is: 

PID: 4532   TASK: ffff8800a5b03780  CPU: 3   COMMAND: "xfs_io"
 #0 [ffff8800a4dbb808] __schedule at ffffffff8165be8b
 #1 [ffff8800a4dbb860] schedule at ffffffff8165c80c
 #2 [ffff8800a4dbb878] schedule_timeout at ffffffff816608ef
 #3 [ffff8800a4dbb918] __down at ffffffff8165f791
 #4 [ffff8800a4dbb980] xfs_buf_lock at ffffffff812e25e7
 #5 [ffff8800a4dbb9a8] _xfs_buf_find at ffffffff812e28f4
 #6 [ffff8800a4dbb9f8] xfs_buf_get_map at ffffffff812e2c8a
 #7 [ffff8800a4dbba38] xfs_buf_read_map at ffffffff812e3f30
 #8 [ffff8800a4dbba80] xfs_trans_read_buf_map at ffffffff81316a7c
 #9 [ffff8800a4dbbac0] xfs_read_agf at ffffffff8129ee05
#10 [ffff8800a4dbbb20] xfs_alloc_read_agf at ffffffff8129ef54
#11 [ffff8800a4dbbb60] xfs_alloc_fix_freelist at ffffffff8129f3fc
#12 [ffff8800a4dbbc38] xfs_free_extent at ffffffff8129ff1d
#13 [ffff8800a4dbbcd8] xfs_trans_free_extent at ffffffff813177d6
#14 [ffff8800a4dbbd08] xfs_bmap_finish at ffffffff812dea09
#15 [ffff8800a4dbbd40] xfs_itruncate_extents at ffffffff812f903e
#16 [ffff8800a4dbbdd8] xfs_setattr_size at ffffffff812f541b
#17 [ffff8800a4dbbe28] xfs_vn_setattr at ffffffff812f54fb
#18 [ffff8800a4dbbe50] notify_change at ffffffff811bf152
#19 [ffff8800a4dbbe90] do_truncate at ffffffff8119fcd5
#20 [ffff8800a4dbbf00] do_sys_ftruncate.constprop.4 at ffffffff811a003b
#21 [ffff8800a4dbbf40] sys_ftruncate at ffffffff811a00ce
#22 [ffff8800a4dbbf50] entry_SYSCALL_64_fastpath at ffffffff81661eb6

So the first thing is inspecting what this process is waiting on:
crash> struct xfs_buf.b_bn ffff8800a5f1d900
  b_bn = 7680001

So a buffer describing AGF1 which is already held ProcessA.
Looking at the extent list it wants to truncate things add up: 

crash> struct xfs_bmap_free_t ffff8800a4dbbd98
struct xfs_bmap_free_t {
  xbf_first = 0xffff8800ba9555e8, 
  xbf_count = 1, 
  xbf_low = 0
}

crash> struct xfs_bmap_free_item 0xffff8800ba9555e8
struct xfs_bmap_free_item {
  xbfi_startblock = 1060329,  # >> 20 = AGF1
  xbfi_blockcount = 32, 
  xbfi_next = 0x0
}

Inspecting this process' transaction locked list I get the following
items: 

crash> list -s xfs_log_item_desc.lid_item,lid_flags -l xfs_log_item_desc.lid_trans -H ffff880133c95f40
ffff8800ba995f08
  lid_item = 0xffff8800b5d147d0
  lid_flags = 1 '\001'
ffff8800ba995248
  lid_item = 0xffff8801394ed2b8
  lid_flags = 1 '\001'
ffff8800ba9958a8
  lid_item = 0xffff8801394ed3a0
  lid_flags = 1 '\001'
ffff8800ba995448
  lid_item = 0xffff8801394ed488
  lid_flags = 1 '\001'

Printing each item's li_ops member reveals what they hold: 
crash> struct xfs_log_item.li_ops 0xffff8800b5d147d0
  li_ops = 0xffffffff8183d980 <xfs_efd_item_ops>
crash> struct xfs_log_item.li_ops 0xffff8801394ed2b8
  li_ops = 0xffffffff8183d940 <xfs_buf_item_ops>
crash> struct xfs_log_item.li_ops 0xffff8801394ed3a0
  li_ops = 0xffffffff8183d940 <xfs_buf_item_ops>
crash> struct xfs_log_item.li_ops 0xffff8801394ed488
  li_ops = 0xffffffff8183d940 <xfs_buf_item_ops>

So one xfs_efd_items_ops and 3 xfs_bufs. Printing each 
xfs_buf from the respective items: 

crash> struct xfs_buf_log_item.bli_buf 0xffff8801394ed2b8
  bli_buf = 0xffff8800a5f1c280
crash> struct xfs_buf_log_item.bli_buf 0xffff8801394ed3a0
  bli_buf = 0xffff8800b80a5180
crash> struct xfs_buf_log_item.bli_buf 0xffff8801394ed488
  bli_buf = 0xffff8800a5f1e300

crash> struct xfs_buf.b_bn 0xffff8800a5f1c280
  b_bn = 1    <= AGF0
crash> struct xfs_buf.b_bn 0xffff8800b80a5180
  b_bn = 8
crash> struct xfs_buf.b_bn 0xffff8800a5f1e300
  b_bn = 16

So we are holding on AGF0's buffer and 2 more buffers. At 
that point both processes are stuck each waiting for a buffer, held
by the opposite party. The only thing I'm worried about in the allocating
process is why doesn't it free the args->agbp if it fails allocating from 
that group. And under what conditions does the first execution of 
xfs_alloc_vextent return with no error and args->agbno = NULLAGBLOCK such that
args->fsbno is set to NULLFSBLOCK. 

On the other hand I haven't managed to figure out why the truncation process, 
wanting to free a block residing in AGF1 holds AGF0 locked. 

The only commit which contains anything remotely similar to this is 
e04426b9202b ("xfs: move allocation stack switch up to xfs_bmapi_allocate")

But it dates waaayy back in time, and this is already fixed in 4.4. 

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html