Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems

Vishnu ks <ksvishnu56@xxxxxxxxx> · Mon, 6 Jan 2025 23:58:02 +0530

Thank you for the detailed explanation about write cache behavior and
data persistence.
I understand now that:

1. Without explicit flush commands, there's no reliable way to know
when data is actually persisted
2. The behavior we observed (3-7 minutes delay) is due to the device's
write cache policy
3. For guaranteed persistence, we need to either:
   - Use explicit flush commands (though this impacts performance)
   - Disable write cache (with significant performance impact)
   - Rely on filesystem-level journaling

We'll explore using filesystem sync operations for critical
consistency points while maintaining the write cache for general
operations.

On Mon, 6 Jan 2025 at 07:24, Damien Le Moal <dlemoal@xxxxxxxxxx> wrote:
>
> On 1/5/25 2:52 AM, Vishnu ks wrote:
> > Thank you all for your valuable feedback. I'd like to provide more
> > technical context about our implementation and the specific challenges
> > we're facing.
> >
> > System Architecture:
> > We've built a block-level continuous data protection system that:
> > 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> > 2. Captures sector numbers (not data) of changed blocks in real-time
> > 3. Periodically syncs the actual data from these sectors based on
> > configurable RPO
> > 4. Layers these incremental changes on top of base snapshots
> >
> > Current Implementation:
> > - eBPF program attached to block_rq_complete tracks sector ranges from
> > bio requests
> > - Changed sector numbers are transmitted to a central dispatcher via websocket
> > - Dispatcher initiates periodic data sync (1-2 min intervals)
> > requesting data from tracked sectors
> > - Base snapshot + incremental changes provide point-in-time recovery capability
> >
> > @Christoph: Regarding stability concerns - we're not using tracepoints
> > for data integrity, but rather for change detection. The actual data
> > synchronization happens through standard block device reads.
> >
> > Technical Challenge:
> > The core issue we've identified is the gap between write completion
> > notification and data availability:
> > - block_rq_complete tracepoint triggers before data is actually
> > persisted to disk
>
> Then do a flush, or disable the write cache on the device (which can totally
> kill write performance depending on the device). Nothing new here. File systems
> have journaling for this reason (among others).
>
> > - Reading sectors immediately after block_rq_complete often returns stale data
>
> That is what POSIX mandates and also what most storage protocols specify (SCSI,
> ATA, NVMe): reading sectors that were just written give you back what you just
> wrote, regardless of the actual location of the data on the device (persisted
> to non volatile media or not).
>
> > - Observed delay between completion and actual disk persistence ranges
> > from 3-7 minutes
>
> That depends on how often/when/how the drive flushes its write cache, which you
> cannot know from the host. If you want to reduce this, explicitly flush the
> device write cache more often (execute blkdev_issue_flush() or similar).
>
> > - Data becomes immediately available only after unmount/sync/reboot
>
> ??
>
> You can read data that was written even without a sync/flush.
>
> > Proposed Enhancement:
> > We're looking for ways to:
> > 1. Detect when data is actually flushed to disk
>
> If you have the write cache enabled on the device, there is no device interface
> that notifies this. This simply does not exist. If you want to guarantee data
> persistence to non-volatile media on the device, issue a synchronize cache
> command (which blkdev_issue_flush() does), or sync your file system if you are
> using one. Or as mentioned already, disable the device write cache.
>
> > 2. Track the relationship between bio requests and cache flushes
>
> That is up to you to do that. File systems do so for sync()/fsync(). Note that
> data persistence guarantees are always for write requests that have already
> completed.
>
> > 3. Potentially add tracepoints around such operations
>
> As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints
> for tracking data persistence is really not a good idea.
>
>
> --
> Damien Le Moal
> Western Digital Research

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com