On Mon, Jul 09, 2018 at 03:00:27PM +0200, Michael Niewöhner wrote: > I think both problems can be solved by keeping track of used blocks by upper > layer in a bitmap-like structure in metadata. It's similar to the write-intent bitmap, which already provides logic to sync only some areas when re-adding drives. Maybe it can be re-used or combined into one. However, that bitmap tends to have rather low resolution (e.g. 64M chunk) and then you may have a problem. If you write 1M of data, you have to set the bit; but if you trim 1M of data, you can't clear the bit. Because you don't know, was the same 1M data trimmed? So the question is how often do filesystems issue small-ish TRIM requests, I expect that to happen when using discard flag, or filesystem optimized the fstrim case to not re-trim previously trimmed areas. So even if the filesystem sees a large free area it might still trim only a small part of it and you can't clear bits. So you need a high resolution bitmap or keep outright blocklists or just hope that large TRIM requests will come that help you clear out those bits. > One problem I see is that every write will mean two writes: bitmap and data. > Maybe the bitmap could be hold in-memory and synced to disk periodically e.g. > every 5 seconds? Other ideas welcome.. Well, you have to set the bit immediately or you're going to lose data, if there is a power failure but bitmap still says "no data there". The write-intent bitmap sets the bit first, then writes data. Perhaps clearing the bit could be delayed... not sure if it helps. You could build a RAID on top of loop devices backed by filesystem that supports punchhole. In that case reading zeroes from trimmed areas would not be physical I/O. But it would give you a ton of filesystem overhead... You can skip the bitmap idea altogether if there is a way to force all filesystems to trim all free space in one shot. Since you only need the information on re-sync, you could start the re-sync first, then trim everything, and skip the areas that were just trimmed. But this requires all of the RAID to be mounted and trimmed as the re-sync is happening, so all filesystems need to cooperate, and storage layers like LVM need special treatment too to trim free PV space. This is unusual since by default, RAID does not involve itself with whatever is stored on top... Regards Andreas Klauer -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html