Re: file journal fadvise

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 2 Dec 2014 11:32:39 +1100

On Mon, Dec 01, 2014 at 04:12:03PM -0800, Sage Weil wrote:
> On Tue, 2 Dec 2014, Dave Chinner wrote:
> > On Mon, Dec 01, 2014 at 04:31:18PM -0600, Mark Nelson wrote:
> > > 
> > > 
> > > On 12/01/2014 01:23 PM, Sage Weil wrote:
> > > >On Mon, 1 Dec 2014, Mark Nelson wrote:
> > > >>On 11/30/2014 09:26 PM, Sage Weil wrote:
> > > >>>On Mon, 1 Dec 2014, ??? wrote:
> > > >>>>Hi sage:
> > > >>>>   For fadvise_random it only change the file readahead. I think it make
> > > >>>>no sense for xfs
> > > >>>>Becasue xfs don't like btrfs, the journal write always on old place(at
> > > >>>>first allocated). We only can make those place contiguous.
> > > >>>
> > > >>>I'm thinking of the OSD journal, which can be a regular file.  I guess it
> > > >>>would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
> > > >>>an ioctl, which makes the delayed allocation especially unconcerned with
> > > >>>keeping blocks contiguous.  It would need to be combined with the discard
> > > >>>ioctl so that any journal write can be allocated wherever it is most
> > > >>>convenient (hopefully contiguous to some other write).
> > > >>>
> > > >>>sage
> > > >>
> > > >>Hi Sage,
> > > >>
> > > >>Could you quick write down the steps you are thinking we'd take to implement
> > > >>this?  I'm concerned about the amount of overhead this could cause but I want
> > > >>to make sure I'm thinking about it correctly. Especially when trim happens and
> > > >>what you think/expect to happens at the FS and device levels.
> > > >
> > > >1- set journal_discard = true
> > > >2- add journal_preallocate = true config option, set it to false, and make
> > > >the fallocate(2) call on journal create conditional on that.
> > > >3- test with defaults (discard = false, preallocate = true) and
> > > >compare it to discard = true + preallocate = false (with file journal).
> > > >4- possibly add a call to set extsize to something small on the journal
> > > >file.  Or give xfs some other appropriate hint, if one exists.
> > 
> > What behaviour are you wanting for a journal file? it sounds like
> > you want it to behave like a wandering log: automatically allocating
> > it's next block where-ever the previous write of any kind occurred?
> 
> Precisely.  Well, as long as it is adjacent to *some* other scheduled 
> write, it would save us a seek.  The real question, I guess, is whether 
> there is an XFS allocation mode that makes no attempt to avoid 
> fragmentation for the file and that chooses something adjacent to other 
> small, newly-written data during delayed allocation.

Ok, so what is the most common underlying storage you need to
optimise for? Is it raid5/6 where a small write will trigger a
larger RMW cycle and so proximity rather than exact adjacency
matters, or is it raid 0/1/jbod where exact adjacency is the only
way to avoid a seek?

I suspect that we can play certain tricks to trigger unaligned,
discontiguous allocation (i.e. no target allocation block), but the
question is whether we can get determine sufficient
allocation/writeback context to enable delayed allocation to make
sensible "next written block" decisions.

> > We can't actually do that in XFS - we have no idea where the last
> > write IO occurred because that's several layers down the IO stack.
> > We could store where the last allocation was, but that doesn't
> > guarantee we can allocate another block contiguously to that. Even
> > if we do, that then fragments whatever file the journal block now
> > sits adjacent to.
> > 
> > The other issue is that block allocation is divided up into
> > allocation groups, and allocation is mostly siloed to avoid randomly
> > allocating a file into different AGs. Just randomly allocating
> > blocks to a file is the polar opposite of everything the XFS
> > allocation strategies do, hence a bit more clarity on what the
> > overall goal is would be helpful. ;)
> 
> It's a circular file, usually a few GB in site, written sequentially with 
> a range of small to large (block-aligned) write sizes, and (for all 
> intents and purposes) is never read.  We periodically overwrite the first 
> block with recent start and end pointers and other metadata.

Ok, so it's just another typical WAL file. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html