On 1/5/25 2:52 AM, Vishnu ks wrote: > Thank you all for your valuable feedback. I'd like to provide more > technical context about our implementation and the specific challenges > we're facing. > > System Architecture: > We've built a block-level continuous data protection system that: > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors > 2. Captures sector numbers (not data) of changed blocks in real-time > 3. Periodically syncs the actual data from these sectors based on > configurable RPO > 4. Layers these incremental changes on top of base snapshots > > Current Implementation: > - eBPF program attached to block_rq_complete tracks sector ranges from > bio requests > - Changed sector numbers are transmitted to a central dispatcher via websocket > - Dispatcher initiates periodic data sync (1-2 min intervals) > requesting data from tracked sectors > - Base snapshot + incremental changes provide point-in-time recovery capability > > @Christoph: Regarding stability concerns - we're not using tracepoints > for data integrity, but rather for change detection. The actual data > synchronization happens through standard block device reads. > > Technical Challenge: > The core issue we've identified is the gap between write completion > notification and data availability: > - block_rq_complete tracepoint triggers before data is actually > persisted to disk Then do a flush, or disable the write cache on the device (which can totally kill write performance depending on the device). Nothing new here. File systems have journaling for this reason (among others). > - Reading sectors immediately after block_rq_complete often returns stale data That is what POSIX mandates and also what most storage protocols specify (SCSI, ATA, NVMe): reading sectors that were just written give you back what you just wrote, regardless of the actual location of the data on the device (persisted to non volatile media or not). > - Observed delay between completion and actual disk persistence ranges > from 3-7 minutes That depends on how often/when/how the drive flushes its write cache, which you cannot know from the host. If you want to reduce this, explicitly flush the device write cache more often (execute blkdev_issue_flush() or similar). > - Data becomes immediately available only after unmount/sync/reboot ?? You can read data that was written even without a sync/flush. > Proposed Enhancement: > We're looking for ways to: > 1. Detect when data is actually flushed to disk If you have the write cache enabled on the device, there is no device interface that notifies this. This simply does not exist. If you want to guarantee data persistence to non-volatile media on the device, issue a synchronize cache command (which blkdev_issue_flush() does), or sync your file system if you are using one. Or as mentioned already, disable the device write cache. > 2. Track the relationship between bio requests and cache flushes That is up to you to do that. File systems do so for sync()/fsync(). Note that data persistence guarantees are always for write requests that have already completed. > 3. Potentially add tracepoints around such operations As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints for tracking data persistence is really not a good idea. -- Damien Le Moal Western Digital Research