Thank you for the detailed explanation about write cache behavior and data persistence. I understand now that: 1. Without explicit flush commands, there's no reliable way to know when data is actually persisted 2. The behavior we observed (3-7 minutes delay) is due to the device's write cache policy 3. For guaranteed persistence, we need to either: - Use explicit flush commands (though this impacts performance) - Disable write cache (with significant performance impact) - Rely on filesystem-level journaling We'll explore using filesystem sync operations for critical consistency points while maintaining the write cache for general operations. On Mon, 6 Jan 2025 at 07:24, Damien Le Moal <dlemoal@xxxxxxxxxx> wrote: > > On 1/5/25 2:52 AM, Vishnu ks wrote: > > Thank you all for your valuable feedback. I'd like to provide more > > technical context about our implementation and the specific challenges > > we're facing. > > > > System Architecture: > > We've built a block-level continuous data protection system that: > > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors > > 2. Captures sector numbers (not data) of changed blocks in real-time > > 3. Periodically syncs the actual data from these sectors based on > > configurable RPO > > 4. Layers these incremental changes on top of base snapshots > > > > Current Implementation: > > - eBPF program attached to block_rq_complete tracks sector ranges from > > bio requests > > - Changed sector numbers are transmitted to a central dispatcher via websocket > > - Dispatcher initiates periodic data sync (1-2 min intervals) > > requesting data from tracked sectors > > - Base snapshot + incremental changes provide point-in-time recovery capability > > > > @Christoph: Regarding stability concerns - we're not using tracepoints > > for data integrity, but rather for change detection. The actual data > > synchronization happens through standard block device reads. > > > > Technical Challenge: > > The core issue we've identified is the gap between write completion > > notification and data availability: > > - block_rq_complete tracepoint triggers before data is actually > > persisted to disk > > Then do a flush, or disable the write cache on the device (which can totally > kill write performance depending on the device). Nothing new here. File systems > have journaling for this reason (among others). > > > - Reading sectors immediately after block_rq_complete often returns stale data > > That is what POSIX mandates and also what most storage protocols specify (SCSI, > ATA, NVMe): reading sectors that were just written give you back what you just > wrote, regardless of the actual location of the data on the device (persisted > to non volatile media or not). > > > - Observed delay between completion and actual disk persistence ranges > > from 3-7 minutes > > That depends on how often/when/how the drive flushes its write cache, which you > cannot know from the host. If you want to reduce this, explicitly flush the > device write cache more often (execute blkdev_issue_flush() or similar). > > > - Data becomes immediately available only after unmount/sync/reboot > > ?? > > You can read data that was written even without a sync/flush. > > > Proposed Enhancement: > > We're looking for ways to: > > 1. Detect when data is actually flushed to disk > > If you have the write cache enabled on the device, there is no device interface > that notifies this. This simply does not exist. If you want to guarantee data > persistence to non-volatile media on the device, issue a synchronize cache > command (which blkdev_issue_flush() does), or sync your file system if you are > using one. Or as mentioned already, disable the device write cache. > > > 2. Track the relationship between bio requests and cache flushes > > That is up to you to do that. File systems do so for sync()/fsync(). Note that > data persistence guarantees are always for write requests that have already > completed. > > > 3. Potentially add tracepoints around such operations > > As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints > for tracking data persistence is really not a good idea. > > > -- > Damien Le Moal > Western Digital Research -- Vishnu KS, Opensource contributor and researcher, https://iamvishnuks.com