Re: xfs Digest, Vol 53, Issue 111

Sotirios Torentsis <torentsis@xxxxxxxxx> · Wed, 21 Nov 2012 07:16:43 +0100

Am 21.11.2012 02:38 schrieb  <xfs-request@xxxxxxxxxxx>:

Send xfs mailing list submissions to

        xfs@xxxxxxxxxxx



To subscribe or unsubscribe via the World Wide Web, visit

        http://oss.sgi.com/mailman/listinfo/xfs

or, via email, send a message with subject or body 'help' to

        xfs-request@xxxxxxxxxxx



You can reach the person managing the list at

        xfs-owner@xxxxxxxxxxx



When replying, please edit your Subject line so it is more specific

than "Re: Contents of xfs digest..."


Today's Topics:



   1. [PATCH 0/2] discontiguous buffer patches (Mark Tinguely)

   2. Re: [PATCH 0/2] xfsprogs: fixes for release candidate.

      (Mark Tinguely)

   3. Re: [PATCH] xfs: Don't flush inodes when project quota

      exceeded (Jan Kara)

   4. Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous

      direct I/O requests (Jan Kara)

   5. [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync() (Jan Kara)

   6. Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync

      part      correctly (Jan Kara)

   7. Re: [PATCH] xfs: Don't flush inodes when project quota

      exceeded (Dave Chinner)



---------- Weitergeleitete Nachricht ----------
From: Mark Tinguely <tinguely@xxxxxxx>
To: xfs@xxxxxxxxxxx
Cc: 
Date: Tue, 20 Nov 2012 16:41:20 -0600

Subject: [PATCH 0/2] discontiguous buffer patches
Eric Sundeen's "userspace vs. fragmented multiblock dir2", xfstest 291

commit 2a4ed, causes a xfs_buf lock hang:



[<ffffffff81065d87>] down+0x47/0x50

[<ffffffffa03a25c6>] xfs_buf_lock+0x66/0xe0 [xfs]

[<ffffffffa03a33c9>] _xfs_buf_find+0x1f9/0x320 [xfs]

[<ffffffffa03a393f>] xfs_buf_get_map+0x2f/0x170 [xfs]

[<ffffffffa03a3f7a>] xfs_buf_read_map+0x2a/0x100 [xfs]

[<ffffffffa0411630>] xfs_trans_read_buf_map+0x3b0/0x590 [xfs]

[<ffffffffa03e1c5e>] xfs_da_read_buf+0xbe/0x230 [xfs]

[<ffffffffa03e78dc>] xfs_dir2_block_addname+0x7c/0x980 [xfs]

[<ffffffffa03f1468>] xfs_dir2_sf_addname+0x3e8/0x450 [xfs]

[<ffffffffa03e634c>] xfs_dir_createname+0x17c/0x1e0 [xfs]

[<ffffffffa03b9fe2>] xfs_create+0x4c2/0x5f0 [xfs]

[<ffffffffa03b0c8a>] xfs_vn_mknod+0x8a/0x1a0 [xfs]

[<ffffffffa03b0dce>] xfs_vn_create+0xe/0x10 [xfs]

[<ffffffff8115695c>] vfs_create+0xac/0xd0

[<ffffffff811587de>] do_last+0x8be/0x960

[<ffffffff8115940c>] path_openat+0xdc/0x410

[<ffffffff81159853>] do_filp_open+0x43/0xa0

[<ffffffff8114a502>] do_sys_open+0x152/0x1e0

[<ffffffff8114a5cc>] sys_open+0x1c/0x20

[<ffffffff81423df9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff



That bisect a problem to:



    commit 3605431fb9739a30ccd0c6380ae8e3c6f8e670a5

    Author: Dave Chinner <dchinner@xxxxxxxxxx>

    Date:   Fri Jun 22 18:50:13 2012 +1000

    xfs: use discontiguous xfs_buf support in dabuf wrappers



xfs_trans_buf_item_match() is looking for the block number of the

buffer in the single segment map area.



Futhermore, there are a couple issue with multi-segment buffer log

format.



Patch 1 cleans up the buffer map so that XFS always uses b_maps[].



Patch 2 fixes the buffer log format issues.



--Mark.









---------- Weitergeleitete Nachricht ----------
From: Mark Tinguely <tinguely@xxxxxxx>
To: Dave Chinner <david@xxxxxxxxxxxxx>

Cc: xfs@xxxxxxxxxxx
Date: Tue, 20 Nov 2012 17:17:36 -0600
Subject: Re: [PATCH 0/2] xfsprogs: fixes for release candidate.
On 11/09/12 01:02, Dave Chinner wrote:


Hi Ben,



These are two fixes for bugs that have shown up during the -rc

phase of the release. Please consider them for inclusion.



Cheers,



Dave.



_______________________________________________

xfs mailing list

xfs@xxxxxxxxxxx

http://oss.sgi.com/mailman/listinfo/xfs




This series has been committed to git://oss.sgi.com/xfs/cmds/xfsprogs master branch, commit 1adfff, 19473a and 36298.



--Mark.







---------- Weitergeleitete Nachricht ----------
From: Jan Kara <jack@xxxxxxx>
To: Dave Chinner <david@xxxxxxxxxxxxx>

Cc: Jan Kara <jack@xxxxxxx>, xfs@xxxxxxxxxxx
Date: Wed, 21 Nov 2012 01:24:59 +0100
Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded

On Tue 20-11-12 11:04:28, Dave Chinner wrote:

> On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:

> > On Tue 13-11-12 01:36:13, Jan Kara wrote:

> > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing

> > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.

> > > This makes handling of writes over project quota rather slow as a simple test

> > > program shows:

> > >   fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);

> > >   for (i = 0; i < 50000; i++)

> > >           pwrite(fd, buf, 4096, i*4096);

> > >

> > > Writing 200 MB like this into a directory with 100 MB project quota takes

> > > around 6 minutes while it takes about 2 seconds with this patch applied. This

> > > actually happens in a real world load when nfs pushes data into a directory

> > > which is over project quota.

> > >

> > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.

> > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when

> > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that

> > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)

> > > from exceeded project quota (not requiring flushing).

> > >

> > > As a side effect this patch fixes inconsistency where e.g. xfs_create()

> > > returned EDQUOT even when project quota was exceeded.

> >   Ping? Any opinions?

>

> FWIW, it doesn't look like it'll apply to a current XFs tree:

>

> > > @@ -441,8 +442,11 @@ retry:

> > >    */

> > >   if (nimaps == 0) {

> > >           trace_xfs_delalloc_enospc(ip, offset, count);

> > > -         if (flushed)

> > > -                 return XFS_ERROR(error ? error : ENOSPC);

> > > +         if (flushed) {

> > > +                 if (error == 0 || error == EPDQUOT)

> > > +                         error = ENOSPC;

> > > +                 return XFS_ERROR(error);

> > > +         }

> > >

> > >           if (error == ENOSPC) {

> > >                   xfs_iunlock(ip, XFS_ILOCK_EXCL);

>

> This xfs_iomap_write_delay() looks like this now:

>

>         /*

>          * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry

>          * without EOF preallocation.

>          */

>         if (nimaps == 0) {

>                 trace_xfs_delalloc_enospc(ip, offset, count);

>                 if (prealloc) {

>                         prealloc = 0;

>                         error = 0;

>                         goto retry;

>                 }

>                 return XFS_ERROR(error ? error : ENOSPC);

>         }

>

> The flushing is now way up in xfs_file_buffered_aio_write(), and the

> implementation of xfs_flush_inodes() has changed as well. Hence it

> may or may not behave differently not....

  OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing

xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to

~6 seconds which is good enough I believe. Thanks for the pointer! I was

thinking for a while why sync_inodes_sb() is so much faster than the

original XFS implementation and I believe it's because we don't force the

log on each sync now.



                                                                Honza

--

Jan Kara <jack@xxxxxxx>

SUSE Labs, CR







---------- Weitergeleitete Nachricht ----------
From: Jan Kara <jack@xxxxxxx>
To: Jeff Moyer <jmoyer@xxxxxxxxxx>
Cc: axboe@xxxxxxxxx, tytso@xxxxxxx, "Darrick J. Wong" <darrick.wong@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx, hch@xxxxxxxxxxxxx, bpm@xxxxxxx, viro@xxxxxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, Jan Kara <jack@xxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx

Date: Wed, 21 Nov 2012 01:56:26 +0100
Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests
On Tue 20-11-12 15:02:15, Jeff Moyer wrote:

> Jan Kara <jack@xxxxxxx> writes:

>

> >> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {

> >>    /* workqueue for dio unwritten */

> >>    struct workqueue_struct *dio_unwritten_wq;

> >>

> >> +  /* workqueue for aio+dio+o_sync disk cache flushing */

> >> +  struct workqueue_struct *aio_dio_flush_wq;

> >> +

> >   Umm, I'm not completely decided whether we really need a separate

> > workqueue. But it doesn't cost too much so I guess it makes some sense -

> > fsync() is rather heavy so syncing won't starve extent conversion...

>

> I'm assuming you'd like me to convert the names from flush to fsync,

> yes?

  Would be nicer, yes.



> >> +

> >> +  /*

> >> +   * If we are running in nojournal mode, just flush the disk

> >> +   * cache and return.

> >> +   */

> >> +  if (!journal)

> >> +          return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);

> >   And this is wrong as well - you need to do work similar to what

> > ext4_sync_file() does. Actually it would be *much* better if these two

> > sites used the same helper function. Which also poses an interesting

> > question about locking - do we need i_mutex or not? Forcing a transaction

> > commit is definitely OK without it, similarly as grabbing transaction ids

> > from inode or ext4_should_journal_data() test. __sync_inode() call seems

> > to be OK without i_mutex as well so I believe we can just get rid of it

> > (getting i_mutex from the workqueue is a locking nightmare we don't want to

> > return to).

>

> Just to be clear, are you saying you would like me to remove the

> mutex_lock/unlock pair from ext4_sync_file?  (I had already factored out

> the common code between this new code path and the fsync path in my tree.)

  Yes, after some thinking I came to that conclusion. We actually need to

keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the

rest doesn't need it. The change should be definitely a separate patch just

in case there's something subtle I missed and we need to bisect in

future... I've attached a patch for that so that blame for bugs goes my way

;) Compile tested only so far. I'll give it some more testing overnight.



                                                                Honza

--

Jan Kara <jack@xxxxxxx>

SUSE Labs, CR



---------- Weitergeleitete Nachricht ----------
From: Jan Kara <jack@xxxxxxx>
To: 
Cc: 
Date: Wed, 21 Nov 2012 01:46:51 +0100
Subject: [PATCH] ext4: Reduce i_mutex usage in ext4_file_sync()

ext4_file_sync() needs i_mutex only to avoid livelocks of

ext4_flush_unwritten_io() all other code doesn't need it. In particular

syncing of inode & metadata in non-journal case is safe (writeback doesn't

hold i_mutex either) and forcing of transaction commits doesn't need i_mutex

either (there's nothing inode specific in that code apart from grabbing

transaction ids from the inode). So shorten the span where i_mutex is held.



Signed-off-by: Jan Kara <jack@xxxxxxx>

---

 fs/ext4/fsync.c |    6 ++----

 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c

index be1d89f..2268114 100644

--- a/fs/ext4/fsync.c

+++ b/fs/ext4/fsync.c

@@ -113,8 +113,6 @@ static int __sync_inode(struct inode *inode, int datasync)

  *

  * What we do is just kick off a commit and wait on it.  This will snapshot the

  * inode to disk.

- *

- * i_mutex lock is held when entering and exiting this function

  */



 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)

@@ -133,12 +131,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)

        ret = filemap_write_and_wait_range(inode->i_mapping, start, end);

        if (ret)

                return ret;

-       mutex_lock(&inode->i_mutex);



        if (inode->i_sb->s_flags & MS_RDONLY)

                goto out;



+       mutex_lock(&inode->i_mutex);

        ret = ext4_flush_unwritten_io(inode);

+       mutex_unlock(&inode->i_mutex);

        if (ret < 0)

                goto out;



@@ -180,7 +179,6 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)

                        ret = err;

        }

  out:

-       mutex_unlock(&inode->i_mutex);

        trace_ext4_sync_file_exit(inode, ret);

        return ret;

 }

--

1.7.1





--k+w/mQv8wyuph6w0--







---------- Weitergeleitete Nachricht ----------
From: Jan Kara <jack@xxxxxxx>
To: Jeff Moyer <jmoyer@xxxxxxxxxx>
Cc: axboe@xxxxxxxxx, tytso@xxxxxxx, "Darrick J. Wong" <darrick.wong@xxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, xfs@xxxxxxxxxxx, hch@xxxxxxxxxxxxx, bpm@xxxxxxx, viro@xxxxxxxxxxxxxxxxxx, linux-fsdevel@xxxxxxxxxxxxxxx, Jan Kara <jack@xxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx

Date: Wed, 21 Nov 2012 01:57:39 +0100
Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly
On Tue 20-11-12 15:47:54, Jeff Moyer wrote:

> Jan Kara <jack@xxxxxxx> writes:

>

> > On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:

> >> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES

> >> flag so that flushes are issued /after/ the write completes, not before.

> >>

> >> Note, however, that for block devices, the DIO setup code ensures that a flush

> >> wq is attached to the superblock of the bdevfs filesystem, not the filesystem

> >> that the device node happens to reside in.  This means that unlike regular

> >> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb.  Therefore, adjust

> >> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.

> >>

> >> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>

> >> ---

> >>  fs/block_dev.c |    5 +++--

> >>  fs/direct-io.c |    3 ++-

> >>  2 files changed, 5 insertions(+), 3 deletions(-)

> >>

> >>

> >> diff --git a/fs/block_dev.c b/fs/block_dev.c

> >> index 1a1e5e3..05ff33a 100644

> >> --- a/fs/block_dev.c

> >> +++ b/fs/block_dev.c

> >> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

> >>    struct inode *inode = file->f_mapping->host;

> >>

> >>    return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,

> >> -                              nr_segs, blkdev_get_blocks, NULL, NULL, 0);

> >> +                              nr_segs, blkdev_get_blocks, NULL, NULL,

> >> +                              DIO_SYNC_WRITES);

> >>  }

> >>

> >>  int __sync_blockdev(struct block_device *bdev, int wait)

> >> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,

> >>    percpu_down_read(&bdev->bd_block_size_semaphore);

> >>

> >>    ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);

> >> -  if (ret > 0 || ret == -EIOCBQUEUED) {

> >> +  if (ret > 0) {

> >>            ssize_t err;

> >>

> >>            err = generic_write_sync(file, pos, ret);

> >> diff --git a/fs/direct-io.c b/fs/direct-io.c

> >> index b7391d4..c626c43 100644

> >> --- a/fs/direct-io.c

> >> +++ b/fs/direct-io.c

> >> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,

> >>            work->ret = ret;

> >>            work->offset = offset;

> >>            work->len = bytes;

> >> -          queue_work(inode->i_sb->s_dio_flush_wq, &work->work);

> >> +          queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,

> >> +                     &work->work);

> >   This should be folded into the original patch introducing the

> > s_dio_flush_wq. And please add a comment before this line saying that block

> > devices need a dereference exactly like this... Otherwise the patch looks

> > good so you can add:

> >   Reviewed-by: Jan Kara <jack@xxxxxxx>

>

> When you say, "This," do you mean that one change, or the whole patch?

  I meant the whole patch after that hung is folded ;)



                                                                Honza

--

Jan Kara <jack@xxxxxxx>

SUSE Labs, CR







---------- Weitergeleitete Nachricht ----------
From: Dave Chinner <david@xxxxxxxxxxxxx>
To: Jan Kara <jack@xxxxxxx>

Cc: xfs@xxxxxxxxxxx
Date: Wed, 21 Nov 2012 12:38:21 +1100
Subject: Re: [PATCH] xfs: Don't flush inodes when project quota exceeded
On Wed, Nov 21, 2012 at 01:24:59AM +0100, Jan Kara wrote:


> On Tue 20-11-12 11:04:28, Dave Chinner wrote:

> > On Mon, Nov 19, 2012 at 10:39:13PM +0100, Jan Kara wrote:

> > > On Tue 13-11-12 01:36:13, Jan Kara wrote:

> > > > When project quota gets exceeded xfs_iomap_write_delay() ends up flushing

> > > > inodes because ENOSPC gets returned from xfs_bmapi_delay() instead of EDQUOT.

> > > > This makes handling of writes over project quota rather slow as a simple test

> > > > program shows:

> > > >         fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644);

> > > >         for (i = 0; i < 50000; i++)

> > > >                 pwrite(fd, buf, 4096, i*4096);

> > > >

> > > > Writing 200 MB like this into a directory with 100 MB project quota takes

> > > > around 6 minutes while it takes about 2 seconds with this patch applied. This

> > > > actually happens in a real world load when nfs pushes data into a directory

> > > > which is over project quota.

> > > >

> > > > Fix the problem by replacing XFS_QMOPT_ENOSPC flag with XFS_QMOPT_EPDQUOT.

> > > > That makes xfs_trans_reserve_quota_bydquots() return new error EPDQUOT when

> > > > project quota is exceeded. xfs_bmapi_delay() then uses this flag so that

> > > > xfs_iomap_write_delay() can distinguish real ENOSPC (requiring flushing)

> > > > from exceeded project quota (not requiring flushing).

> > > >

> > > > As a side effect this patch fixes inconsistency where e.g. xfs_create()

> > > > returned EDQUOT even when project quota was exceeded.

> > >   Ping? Any opinions?

> >

> > FWIW, it doesn't look like it'll apply to a current XFs tree:

> >

> > > > @@ -441,8 +442,11 @@ retry:

> > > >          */

> > > >         if (nimaps == 0) {

> > > >                 trace_xfs_delalloc_enospc(ip, offset, count);

> > > > -               if (flushed)

> > > > -                       return XFS_ERROR(error ? error : ENOSPC);

> > > > +               if (flushed) {

> > > > +                       if (error == 0 || error == EPDQUOT)

> > > > +                               error = ENOSPC;

> > > > +                       return XFS_ERROR(error);

> > > > +               }

> > > >

> > > >                 if (error == ENOSPC) {

> > > >                         xfs_iunlock(ip, XFS_ILOCK_EXCL);

> >

> > This xfs_iomap_write_delay() looks like this now:

> >

> >         /*

> >          * If bmapi returned us nothing, we got either ENOSPC or EDQUOT. Retry

> >          * without EOF preallocation.

> >          */

> >         if (nimaps == 0) {

> >                 trace_xfs_delalloc_enospc(ip, offset, count);

> >                 if (prealloc) {

> >                         prealloc = 0;

> >                         error = 0;

> >                         goto retry;

> >                 }

> >                 return XFS_ERROR(error ? error : ENOSPC);

> >         }

> >

> > The flushing is now way up in xfs_file_buffered_aio_write(), and the

> > implementation of xfs_flush_inodes() has changed as well. Hence it

> > may or may not behave differently not....

>   OK, so I tested latest XFS tree and changes by commit 9aa05000 (changing

> xfs_flush_inodes()) indeed improve the performace from those ~6 minutes to

> ~6 seconds which is good enough I believe. Thanks for the pointer! I was

> thinking for a while why sync_inodes_sb() is so much faster than the

> original XFS implementation and I believe it's because we don't force the

> log on each sync now.



I think it's detter because we now don't scan every inode in the

inode cache doing a mapping_tagged(PAGECACHE_TAG_DIRTY) check to see

if they are dirty or not to decide whether it needs writeback. The

overhead of doing that adds up very quickly when you have lots of

cached inodes and you are scanning them for every page that is

written to....



Cheers,



Dave.

--

Dave Chinner

david@xxxxxxxxxxxxx






_______________________________________________

xfs mailing list

xfs@xxxxxxxxxxx

http://oss.sgi.com/mailman/listinfo/xfs



_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs