On Mon, 2018-07-09 at 15:50 +0200, Andreas Klauer wrote: > On Mon, Jul 09, 2018 at 03:00:27PM +0200, Michael Niewöhner wrote: > > I think both problems can be solved by keeping track of used blocks by upper > > layer in a bitmap-like structure in metadata. > > It's similar to the write-intent bitmap, which already provides logic > to sync only some areas when re-adding drives. Maybe it can be re-used > or combined into one. Yes, that was one of my ideas, too. > > However, that bitmap tends to have rather low resolution (e.g. 64M chunk) > and then you may have a problem. > > If you write 1M of data, you have to set the bit; > but if you trim 1M of data, you can't clear the bit. > Because you don't know, was the same 1M data trimmed? > > So the question is how often do filesystems issue small-ish TRIM requests, > I expect that to happen when using discard flag, or filesystem optimized > the fstrim case to not re-trim previously trimmed areas. fstrim keeps trimmed block in an in-memory cache. After reboot it will trim already-trimmed blocks, too. > > So even if the filesystem sees a large free area it might still trim > only a small part of it and you can't clear bits. > > So you need a high resolution bitmap or keep outright blocklists > or just hope that large TRIM requests will come that help you > clear out those bits. IMHO only a blocklist / 1:1 bitmap would make sense in this case. > > > One problem I see is that every write will mean two writes: bitmap and data. > > Maybe the bitmap could be hold in-memory and synced to disk periodically > > e.g. > > every 5 seconds? Other ideas welcome.. > > Well, you have to set the bit immediately or you're going to lose data, > if there is a power failure but bitmap still says "no data there". > The write-intent bitmap sets the bit first, then writes data. > Perhaps clearing the bit could be delayed... not sure if it helps. One would loose data when the power failure happens betwenn both writes - that is the same problem is the raid5 write-hole, isn't it? > > You could build a RAID on top of loop devices backed by filesystem > that supports punchhole. In that case reading zeroes from trimmed > areas would not be physical I/O. > > But it would give you a ton of filesystem overhead... My current setup is: ZFS -> ZVOL -> luks -> lvm -> ext4 I want to have: md-integrity -> md-raid5 -> lvm -> ext4 Your setup would be: md-integrity -> ext4 -> loop -> md-raid5 -> lvm -> ext4 IMHO that is a very bad idea regarding performance. > > You can skip the bitmap idea altogether if there is a way to force > all filesystems to trim all free space in one shot. Since you only > need the information on re-sync, you could start the re-sync first, > then trim everything, and skip the areas that were just trimmed. That would require a - at least in-memory - bitmap, too. md-raid needs to know which blocks need to be synced. BUT: Trimming all the free space in one shot on a degraded array is a VERY BAD idea since it will increase the probability of a second disk failure and the risk of loosing all data! discard or at least periodical fstrim would be the "way to go". > > But this requires all of the RAID to be mounted and trimmed as > the re-sync is happening, so all filesystems need to cooperate, > and storage layers like LVM need special treatment too to trim > free PV space. LVM passes trim to the lower layers so no problem here. > > This is unusual since by default, RAID does not involve itself > with whatever is stored on top... Full ACK. I'd like to keep this feature as independent as possible from the upper layer(s). > > Regards > Andreas Klauer -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html