[XFS updates] XFS development tree branch, for-next, updated. v3.10-rc1-54-gddf6ad0

xfs@xxxxxxxxxxx · Thu, 27 Jun 2013 14:45:48 -0500 (CDT)

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "XFS development tree".

The branch, for-next has been updated
  ddf6ad0 xfs: Use inode create transaction
  28c8e41 xfs: Inode create item recovery
  b8402b4 xfs: Inode create transaction reservations
  3ebe7d2 xfs: Inode create log items
  5f6bed7 xfs: Introduce an ordered buffer item
  fd63875 xfs: Introduce ordered log vector support
  1baaed8 xfs: xfs_ifree doesn't need to modify the inode buffer
  cca9f93 xfs: don't do IO when creating an new inode
  133eeb1 xfs: don't use speculative prealloc for small files
  34eefc0 xfs: plug directory buffer readahead
  cbb2864 xfs: add pluging for bulkstat readahead
      from  80a4049813a2ae0977d8e5db78e711c7f21c420b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit ddf6ad01434e72bfc8423e1619abdaa0af9394a8
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date:   Thu Jun 27 16:04:56 2013 +1000

    xfs: Use inode create transaction

    Replace the use of buffer based logging of inode initialisation,
    uses the new logical form to describe the range to be initialised
    in recovery. We continue to "log" the inode buffers to push them
    into the AIL and ensure that the inode create transaction is not
    removed from the log before the inode buffers are written to disk.

    Update the transaction identifier and reservations to match the
    changed implementation.

    Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 28c8e41af693e4b5cd2d68218f144cf40ce15781
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date:   Thu Jun 27 16:04:55 2013 +1000

    xfs: Inode create item recovery

    When we find a icreate transaction, we need to get and initialise
    the buffers in the range that has been passed. Extract and verify
    the information in the item record, then loop over the range
    initialising and issuing the buffer writes delayed.

    Support an arbitrary size range to initialise so that in
    future when we allocate inodes in much larger chunks all kernels
    that understand this transaction can still recover them.

    Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit b8402b4729495ac719a3f532c2e33ac653b222a8
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date:   Thu Jun 27 16:04:54 2013 +1000

    xfs: Inode create transaction reservations

    Define the log and space transaction sizes. Factor the current
    create log reservation macro into the two logical halves and reuse
    one half for the new icreate transactions. The icreate transaction
    is transparent to all the high level create code - the
    pre-calculated reservations will correctly set the reservations
    dependent on whether the filesystem supports the icreate
    transaction.

    Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 3ebe7d2d73179c4874aee4f32e043eb5acd9fa0f
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date:   Thu Jun 27 16:04:53 2013 +1000

    xfs: Inode create log items

    Introduce the inode create log item type for logical inode create logging.
    Instead of logging the changes in buffers, pass the range to be
    initialised through the log by a new transaction type.  This reduces
    the amount of log space required to record initialisation during
    allocation from about 128 bytes per inode to a small fixed amount
    per inode extent to be initialised.

    This requires a new log item type to track it through the log
    and the AIL. This is a relatively simple item - most callbacks are
    noops as this item has the same life cycle as the transaction.

    Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 5f6bed76c0c85cb4d04885a5de00b629deee550b
Author: Dave Chinner <david@xxxxxxxxxxxxx>
Date:   Thu Jun 27 16:04:52 2013 +1000

    xfs: Introduce an ordered buffer item

    If we have a buffer that we have modified but we do not wish to
    physically log in a transaction (e.g. we've logged a logical
    change), we still need to ensure that transactional integrity is
    maintained. Hence we must not move the tail of the log past the
    transaction that the buffer is associated with before the buffer is
    written to disk.

    This means these special buffers still need to be included in the
    transaction and added to the AIL just like a normal buffer, but we
    do not want the modifications to the buffer written into the
    transaction. IOWs, what we want is an "ordered buffer" that
    maintains the same transactional life cycle as a physically logged
    buffer, just without the transcribing of the modifications to the
    log.

    Hence we need to flag the buffer as an "ordered buffer" to avoid
    including it in vector size calculations or formatting during the
    transaction. Once the transaction is committed, the buffer appears
    for all intents to be the same as a physically logged buffer as it
    transitions through the log and AIL.

    Relogging will also work just fine for such an ordered buffer - the
    logical transaction will be replayed before the subsequent
    modifications that relog the buffer, so everything will be
    reconstructed correctly by recovery.

    Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit fd63875cc4cd60b9e5c609c24d75eaaad3e6d1c4
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:51 2013 +1000

    xfs: Introduce ordered log vector support

    And "ordered log vector" is a log vector that is used for
    tracking a log item through the CIL and into the AIL as part of the
    log checkpointing. These ordered log vectors are special in that
    they are not written to to journal in any way, and are not accounted
    to the checkpoint being written.

    The reason for this behaviour is to allow operations to attach items
    to transactions and have them follow the normal transactional
    lifecycle without actually having to write them to the journal. This
    allows logging of items that track high level logical changes and
    writing them to the log, while the physical items being modified
    pass through into the AIL and pin the tail of the log (and therefore
    the logical item in the log) until all the modified items are
    physically written to disk.

    IOWs, it allows us to write metadata without physically logging
    every individual change but still maintain the full transactional
    integrity guarantees we currently have w.r.t. crash recovery.

    This change modifies some of the CIL item insertion loops, as
    ordered log vectors introduce some new constraints as they don't
    track any data. One advantage of this change is that it combines
    two log vector chain walks into a single pass, so there is less
    overhead in the transaction commit pass as well. It also kills some
    unused code in the log vector walk loop when committing the CIL.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 1baaed8fa955ab0d23aab24477dae566ed6a105b
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:50 2013 +1000

    xfs: xfs_ifree doesn't need to modify the inode buffer

    Long ago, bulkstat used to read inodes directly from the backing
    buffer for speed. This had the unfortunate problem of being cache
    incoherent with unlinks, and so xfs_ifree() had to mark the inode
    as free directly in the backing buffer. bulkstat was changed some
    time ago to use inode cache coherent lookups, and so will never see
    unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
    touch the inode backing buffer anymore.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit cca9f93a52d2ead50b5da59ca83d5f469ee4be5f
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:49 2013 +1000

    xfs: don't do IO when creating an new inode

    When we are allocating a new inode, we read the inode cluster off
    disk to increment the generation number. We are already using a
    random generation number for newly allocated inodes, so if we are not
    using the ikeep mode, we can just generate a new generation number
    when we initialise the newly allocated inode.

    This avoids the need for reading the inode buffer during inode
    creation. This will speed up allocation of inodes in cold, partially
    allocated clusters as they will no longer need to be read from disk
    during allocation. It will also reduce the CPU overhead of inode
    allocation by not having the process the buffer read, even on cache
    hits.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 133eeb1747c33b6d75483c074b27d4e5e02286dc
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:48 2013 +1000

    xfs: don't use speculative prealloc for small files

    Dedicated small file workloads have been seeing significant free
    space fragmentation causing premature inode allocation failure
    when large inode sizes are in use. A particular test case showed
    that a workload that runs to a real ENOSPC on 256 byte inodes would
    fail inode allocation with ENOSPC about about 80% full with 512 byte
    inodes, and at about 50% full with 1024 byte inodes.

    The same workload, when run with -o allocsize=4096 on 1024 byte
    inodes would run to being 100% full before giving ENOSPC. That is,
    no freespace fragmentation at all.

    The issue was caused by the specific IO pattern the application had
    - the framework it was using did not support direct IO, and so it
    was emulating it by using fadvise(DONT_NEED). The result was that
    the data was getting written back before the speculative prealloc
    had been trimmed from memory by the close(), and so small single
    block files were being allocated with 2 blocks, and then having one
    truncated away. The result was lots of small 4k free space extents,
    and hence each new 8k allocation would take another 8k from
    contiguous free space and turn it into 4k of allocated space and 4k
    of free space.

    Hence inode allocation, which requires contiguous, aligned
    allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
    (1024 byte inodes) can fail to find sufficiently large freespace and
    hence fail while there is still lots of free space available.

    There's a simple fix for this, and one that has precendence in the
    allocator code already - don't do speculative allocation unless the
    size of the file is larger than a certain size. In this case, that
    size is the minimum default preallocation size:
    mp->m_writeio_blocks. And to keep with the concept of being nice to
    people when the files are still relatively small, cap the prealloc
    to mp->m_writeio_blocks until the file goes over a stripe unit is
    size, at which point we'll fall back to the current behaviour based
    on the last extent size.

    This will effectively turn off speculative prealloc for very small
    files, keep preallocation low for small files, and behave as it
    currently does for any file larger than a stripe unit. This
    completely avoids the freespace fragmentation problem this
    particular IO pattern was causing.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit 34eefc06a06f496b92c3267a0601129a932c7174
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:47 2013 +1000

    xfs: plug directory buffer readahead

    Similar to bulkstat inode chunk readahead, we need to plug directory
    data buffer readahead during getdents to ensure that we can merge
    adjacent readahead requests and sort out of order requests optimally
    before they are dispatched. This improves the readahead efficiency
    and reduces the IO load it generates as the IO patterns are
    significantly better for both contiguous and fragmented directories.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

commit cbb2864aa48977205c76291ba5a23331393b2578
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date:   Thu Jun 27 16:04:46 2013 +1000

    xfs: add pluging for bulkstat readahead

    I was running some tests on bulkstat on CRC enabled filesystems when
    I noticed that all the IO being issued was 8k in size, regardless of
    the fact taht we are issuing sequential 8k buffers for inodes
    clusters. The IO size should be 16k for 256 byte inodes, and 32k for
    512 byte inodes, but this wasn't happening.

    blktrace showed that there was an explict plug and unplug happening
    around each readahead IO from _xfs_buf_ioapply, and the unplug was
    causing the IO to be issued immediately. Hence no opportunity was
    being given to the elevator to merge adjacent readahead requests and
    dispatch them as a single IO.

    Add plugging around the inode chunk readahead dispatch loop in
    bulkstat to ensure that we don't unplug the queue between adjacent
    inode buffer readahead IOs and so we get fewer, larger IO requests
    hitting the storage subsystem for bulkstat.

    Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
    Reviewed-by: Mark Tinguely <tinguely@xxxxxxx>
    Signed-off-by: Ben Myers <bpm@xxxxxxx>

-----------------------------------------------------------------------

Summary of changes:
 fs/xfs/Makefile           |   1 +
 fs/xfs/xfs_buf_item.c     |  87 ++++++++++++++-------
 fs/xfs/xfs_buf_item.h     |   4 +-
 fs/xfs/xfs_dir2_leaf.c    |   3 +
 fs/xfs/xfs_ialloc.c       |  67 ++++++++++++----
 fs/xfs/xfs_ialloc.h       |   8 ++
 fs/xfs/xfs_icreate_item.c | 195 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_icreate_item.h |  52 +++++++++++++
 fs/xfs/xfs_inode.c        |  68 ++++++++--------
 fs/xfs/xfs_iomap.c        |  13 ++++
 fs/xfs/xfs_itable.c       |   3 +
 fs/xfs/xfs_log.c          |  22 +++++-
 fs/xfs/xfs_log.h          |   5 +-
 fs/xfs/xfs_log_cil.c      |  75 ++++++++++++------
 fs/xfs/xfs_log_recover.c  | 114 +++++++++++++++++++++++++--
 fs/xfs/xfs_super.c        |   8 ++
 fs/xfs/xfs_trace.h        |   4 +
 fs/xfs/xfs_trans.c        | 118 ++++++++++++++++++----------
 fs/xfs/xfs_trans.h        |   5 +-
 fs/xfs/xfs_trans_buf.c    |  34 +++++++-
 20 files changed, 724 insertions(+), 162 deletions(-)
 create mode 100644 fs/xfs/xfs_icreate_item.c
 create mode 100644 fs/xfs/xfs_icreate_item.h

hooks/post-receive
-- 
XFS development tree

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs