Re: Reduce resync time by keeping track of referenced blocks

Michael Niewöhner <linux@xxxxxxxxxxxxxx> · Mon, 09 Jul 2018 16:10:19 +0200

On Mon, 2018-07-09 at 15:50 +0200, Andreas Klauer wrote:
> On Mon, Jul 09, 2018 at 03:00:27PM +0200, Michael Niewöhner wrote:
> > I think both problems can be solved by keeping track of used blocks by upper
> > layer in a bitmap-like structure in metadata.
> 
> It's similar to the write-intent bitmap, which already provides logic 
> to sync only some areas when re-adding drives. Maybe it can be re-used 
> or combined into one.

Yes, that was one of my ideas, too.

> 
> However, that bitmap tends to have rather low resolution (e.g. 64M chunk) 
> and then you may have a problem.
> 
> If you write 1M of data, you have to set the bit;
> but if you trim 1M of data, you can't clear the bit.
> Because you don't know, was the same 1M data trimmed?
> 
> So the question is how often do filesystems issue small-ish TRIM requests, 
> I expect that to happen when using discard flag, or filesystem optimized 
> the fstrim case to not re-trim previously trimmed areas.

fstrim keeps trimmed block in an in-memory cache. After reboot it will trim
already-trimmed blocks, too.

> 
> So even if the filesystem sees a large free area it might still trim 
> only a small part of it and you can't clear bits.
> 
> So you need a high resolution bitmap or keep outright blocklists 
> or just hope that large TRIM requests will come that help you 
> clear out those bits.

IMHO only a blocklist / 1:1 bitmap would make sense in this case.

> 
> > One problem I see is that every write will mean two writes: bitmap and data.
> > Maybe the bitmap could be hold in-memory and synced to disk periodically
> > e.g.
> > every 5 seconds? Other ideas welcome..
> 
> Well, you have to set the bit immediately or you're going to lose data, 
> if there is a power failure but bitmap still says "no data there".
> The write-intent bitmap sets the bit first, then writes data.
> Perhaps clearing the bit could be delayed... not sure if it helps.

One would loose data when the power failure happens betwenn both writes - that
is the same problem is the raid5 write-hole, isn't it?

> 
> You could build a RAID on top of loop devices backed by filesystem 
> that supports punchhole. In that case reading zeroes from trimmed 
> areas would not be physical I/O.
> 
> But it would give you a ton of filesystem overhead...

My current setup is: ZFS -> ZVOL -> luks -> lvm -> ext4
I want to have: md-integrity -> md-raid5 -> lvm -> ext4
Your setup would be: md-integrity -> ext4 -> loop -> md-raid5 -> lvm -> ext4
IMHO that is a very bad idea regarding performance.

> 
> You can skip the bitmap idea altogether if there is a way to force 
> all filesystems to trim all free space in one shot. Since you only 
> need the information on re-sync, you could start the re-sync first,
> then trim everything, and skip the areas that were just trimmed.

That would require a - at least in-memory - bitmap, too. md-raid needs to know
which blocks need to be synced.
BUT: Trimming all the free space in one shot on a degraded array is a VERY BAD
idea since it will increase the probability of a second disk failure and the
risk of loosing all data!

discard or at least periodical fstrim would be the "way to go".

> 
> But this requires all of the RAID to be mounted and trimmed as 
> the re-sync is happening, so all filesystems need to cooperate, 
> and storage layers like LVM need special treatment too to trim 
> free PV space.

LVM passes trim to the lower layers so no problem here.

> 
> This is unusual since by default, RAID does not involve itself 
> with whatever is stored on top...

Full ACK. I'd like to keep this feature as independent as possible from the
upper layer(s).

> 
> Regards
> Andreas Klauer

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html