Re: file journal fadvise

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 1 Dec 2014, Mark Nelson wrote:
> On 11/30/2014 09:26 PM, Sage Weil wrote:
> > On Mon, 1 Dec 2014, ??? wrote:
> > > Hi sage:
> > >   For fadvise_random it only change the file readahead. I think it make
> > > no sense for xfs
> > > Becasue xfs don't like btrfs, the journal write always on old place(at
> > > first allocated). We only can make those place contiguous.
> > 
> > I'm thinking of the OSD journal, which can be a regular file.  I guess it
> > would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
> > an ioctl, which makes the delayed allocation especially unconcerned with
> > keeping blocks contiguous.  It would need to be combined with the discard
> > ioctl so that any journal write can be allocated wherever it is most
> > convenient (hopefully contiguous to some other write).
> > 
> > sage
> 
> Hi Sage,
> 
> Could you quick write down the steps you are thinking we'd take to implement
> this?  I'm concerned about the amount of overhead this could cause but I want
> to make sure I'm thinking about it correctly. Especially when trim happens and
> what you think/expect to happens at the FS and device levels.

1- set journal_discard = true
2- add journal_preallocate = true config option, set it to false, and make 
the fallocate(2) call on journal create conditional on that.
3- test with defaults (discard = false, preallocate = true) and 
compare it to discard = true + preallocate = false (with file journal).
4- possibly add a call to set extsize to something small on the journal 
file.  Or give xfs some other appropriate hint, if one exists.

sage

> 
> Mark
> 
> > 
> > 
> > > 
> > > Thanks!
> > > Jianpeng
> > > 
> > > 2014-12-01 2:46 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
> > > > Currently, when an OSD journal is stored as a file, we preallocate it as
> > > > a
> > > > large contiguous extent.  That means that for every journal write we're
> > > > seeking back to wherever the journal is.  That possibly not ideal for
> > > > writes.  For reads it's great, but that's the last thing we care about
> > > > optimizing (we only read the journal after a failure, which is very
> > > > rare).
> > > > 
> > > > I wonder if we would do better if we:
> > > > 
> > > >   1- trim/discard the old journal contents,
> > > >   2- posix_fadvise RANDOM
> > > > 
> > > > I'm not sure what the XFS behavior is in this case, but ideally it seems
> > > > what we want it to do is write the journal wherever on disk it is most
> > > > convenient... ideally contiguous with some other write that it is
> > > > already
> > > > doing.  If fadvise random doesn't do that, perhaps there is another
> > > > allocator hint we can give it that will get us that behavior...
> > > > 
> > > > sage
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux