Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 8 Nov 2021 09:25:50 +1100

On Fri, Nov 05, 2021 at 05:19:47PM +0100, Nikola Ciprich wrote:
> > 
> > ok, thanks for the clarification.
> 
> no problem... in the meantime, xfs_bmap finished as well,
> resulting output has 1.5GB, showing total of 25354643 groups :-O

Yeah, that'll do it. If you are on spinning disks, at ~250 extents
per btree block you're talking about a hundred thousand IOs to read
in the extent list on first access to the file after mount.

> > Though I've never heard of streaming video writes that weren't sequential ...
> > have you actually observed that via strace or whatnot?
> those are streams from many cameras, somehow multiplexed by processing software.
> The guy I communicate with, whos responsible unfortunately does not know
> many details

The multiplexing is the problem here. Look at the allocation pattern
in the trace.

	680367: [872751104..872759863]: 870787280..870796039
	680368: [872759864..872760423]: 870799440..870799999
	680369: [872760424..872761527]: 870921888..870922991
	680370: [872761528..872762079]: 870959584..870960135
	680371: [872762080..872763631]: 871192144..871193695
	680372: [872763632..872763647]: 871183760..871183775
	680373: [872763648..872767487]: hole
	680374: [872767488..872768687]: 870796040..870797239
	680375: [872768688..872769887]: 870800000..870801199
	680376: [872769888..872772367]: 870922992..870925471
	680377: [872772368..872773559]: 870989000..870990191
	680378: [872773560..872775639]: 871193696..871195775
	680379: [872775640..872775679]: hole
	680380: [872775680..872776231]: 870797240..870797791
	680381: [872776232..872776775]: 870801200..870801743
	680382: [872776776..872777847]: 870870440..870871511
	680383: [872777848..872778383]: 870990192..870990727
	680384: [872778384..872779727]: 871195776..871197119
	680385: [872779728..872779791]: 871175064..871175127
	680386: [872779792..872783871]: hole
	680387: [872783872..872785519]: 870797792..870799439
	680388: [872785520..872786927]: 870801744..870803151
	680389: [872786928..872789671]: 870925472..870928215
	680390: [872789672..872791087]: 870990728..870992143
	680391: [872791088..872791991]: 871197120..871198023
	680392: [872791992..872792063]: hole

Lets lay that out into sequential blocks:

Stream 1:
	680367: [872751104..872759863]: 870787280..870796039
	680374: [872767488..872768687]: 870796040..870797239
	680380: [872775680..872776231]: 870797240..870797791
	680387: [872783872..872785519]: 870797792..870799439

Stream 2:
	680368: [872759864..872760423]: 870799440..870799999
	680375: [872768688..872769887]: 870800000..870801199
	680381: [872776232..872776775]: 870801200..870801743
	680388: [872785520..872786927]: 870801744..870803151

Stream 3:
	680369: [872760424..872761527]: 870921888..870922991
	680376: [872769888..872772367]: 870922992..870925471
	680382: [872776776..872777847]: 870870440..870871511 (discontig)
	680389: [872786928..872789671]: 870925472..870928215

Stream 4:
	680370: [872761528..872762079]: 870959584..870960135
	680377: [872772368..872773559]: 870989000..870990191
	680383: [872777848..872778383]: 870990192..870990727
	680390: [872789672..872791087]: 870990728..870992143

Stream 5:
	680371: [872762080..872763631]: 871192144..871193695
	680378: [872773560..872775639]: 871193696..871195775
	680384: [872778384..872779727]: 871195776..871197119
	680391: [872791088..872791991]: 871197120..871198023

Stream 6:
	680372: [872763632..872763647]: 871183760..871183775
	680373: [872763648..872767487]: hole	(contig with 680372)
	680379: [872775640..872775679]: hole
	680385: [872779728..872779791]: 871175064..871175127
	680386: [872779792..872783871]: hole	(contig with 680385)
	680392: [872791992..872792063]: hole

The reason I point this out, is that the way tha XFS allocator works
is that is peels off a chunk of the longest free extent on every
new physical allocation for non-contiguous file offsets.

Hence when we see this physical allocation pattern:

	680367: [872751104..872759863]: 870787280..870796039
	680374: [872767488..872768687]: 870796040..870797239
	680380: [872775680..872776231]: 870797240..870797791
	680387: [872783872..872785519]: 870797792..870799439

It indicates the order in which the writes are occurring. Hence it
would appear that the application is doing sparse writes for chunks
in the file, that it then goes back and partially files holes later
with another run of sparse writes. Eventually, all holes are filled,
but you end up with a fragmented file.

This is actually by design - the XFS allocator is optimised for
efficient write IO (i.e. sequentialises writes as much as possible)
rather than optimal read IO.

>From the allocation pattern, I suspect there are 6 cameras in this
multiplexer setup, each sample time that it needs to store an image
has a frame from each camera, and a series of frames is written per
camera before writing the next set of frames from the next camera.
Hence the allocation pattern on disk is effectively sequential for
each camera stream as they are written, but when viewed as a
multiplexed file, it's extremely fragmented because the individual
camera streams are interleaved..

> > What might be happening is that if you are streaming multiple
> > files into a single directory at the same time, it competes for
> > the allocator, and they will interleave.
> > 
> > XFS has an allocator mode called "filestreams" which was
> > designed just for this (video ingest).

Won't do anything - that's for ensure "file per frame" video ingest
places all the files for a given video stream contiguously in an AG.
This looks like "multiple cameras and many frames per file" which
means the filestreams code will not trigger or do anything different
here.

> anyways I'll rather preallocate files fully for now, it takes a
> lot of time, but should be the safest way before we know what
> exactly is wrong..

That may well cause serious problems for camera data ingest, because
it forces the ingest write IO pattern to be non-contiguous rather
than sequential. Hence instead of larger, sequentialised writes per
incoming data set as the above pattern suggests, preallocation will
change to be many more smaller, sparse write IOs that cannot merge.

This will increase write IO latency and reduce the amount of data
that can be written to disk. The likely result of this is that it
will reduce the number of cameras that can be supported per spinning
disk.

I would suggest that the best solution is to rotate camera data
files at a much smaller size so that the extent list doesn't get too
large. e.g. max file size is 1TB, keep historic records in 500x1TB
files instead of one single 500TB file...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx