Re: Reduce resync time by keeping track of referenced blocks

Andreas Klauer <Andreas.Klauer@xxxxxxxxxxxxxx> · Mon, 9 Jul 2018 15:50:37 +0200

On Mon, Jul 09, 2018 at 03:00:27PM +0200, Michael Niewöhner wrote:
> I think both problems can be solved by keeping track of used blocks by upper
> layer in a bitmap-like structure in metadata.

It's similar to the write-intent bitmap, which already provides logic 
to sync only some areas when re-adding drives. Maybe it can be re-used 
or combined into one.

However, that bitmap tends to have rather low resolution (e.g. 64M chunk) 
and then you may have a problem.

If you write 1M of data, you have to set the bit;
but if you trim 1M of data, you can't clear the bit.
Because you don't know, was the same 1M data trimmed?

So the question is how often do filesystems issue small-ish TRIM requests, 
I expect that to happen when using discard flag, or filesystem optimized 
the fstrim case to not re-trim previously trimmed areas.

So even if the filesystem sees a large free area it might still trim 
only a small part of it and you can't clear bits.

So you need a high resolution bitmap or keep outright blocklists 
or just hope that large TRIM requests will come that help you 
clear out those bits.

> One problem I see is that every write will mean two writes: bitmap and data.
> Maybe the bitmap could be hold in-memory and synced to disk periodically e.g.
> every 5 seconds? Other ideas welcome..

Well, you have to set the bit immediately or you're going to lose data, 
if there is a power failure but bitmap still says "no data there".
The write-intent bitmap sets the bit first, then writes data.
Perhaps clearing the bit could be delayed... not sure if it helps.

You could build a RAID on top of loop devices backed by filesystem 
that supports punchhole. In that case reading zeroes from trimmed 
areas would not be physical I/O.

But it would give you a ton of filesystem overhead...

You can skip the bitmap idea altogether if there is a way to force 
all filesystems to trim all free space in one shot. Since you only 
need the information on re-sync, you could start the re-sync first,
then trim everything, and skip the areas that were just trimmed.

But this requires all of the RAID to be mounted and trimmed as 
the re-sync is happening, so all filesystems need to cooperate, 
and storage layers like LVM need special treatment too to trim 
free PV space.

This is unusual since by default, RAID does not involve itself 
with whatever is stored on top...

Regards
Andreas Klauer
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html