Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 6 Jan 2025 10:53:58 +0900

On 1/5/25 2:52 AM, Vishnu ks wrote:
> Thank you all for your valuable feedback. I'd like to provide more
> technical context about our implementation and the specific challenges
> we're facing.
> 
> System Architecture:
> We've built a block-level continuous data protection system that:
> 1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
> 2. Captures sector numbers (not data) of changed blocks in real-time
> 3. Periodically syncs the actual data from these sectors based on
> configurable RPO
> 4. Layers these incremental changes on top of base snapshots
> 
> Current Implementation:
> - eBPF program attached to block_rq_complete tracks sector ranges from
> bio requests
> - Changed sector numbers are transmitted to a central dispatcher via websocket
> - Dispatcher initiates periodic data sync (1-2 min intervals)
> requesting data from tracked sectors
> - Base snapshot + incremental changes provide point-in-time recovery capability
> 
> @Christoph: Regarding stability concerns - we're not using tracepoints
> for data integrity, but rather for change detection. The actual data
> synchronization happens through standard block device reads.
> 
> Technical Challenge:
> The core issue we've identified is the gap between write completion
> notification and data availability:
> - block_rq_complete tracepoint triggers before data is actually
> persisted to disk

Then do a flush, or disable the write cache on the device (which can totally
kill write performance depending on the device). Nothing new here. File systems
have journaling for this reason (among others).

> - Reading sectors immediately after block_rq_complete often returns stale data

That is what POSIX mandates and also what most storage protocols specify (SCSI,
ATA, NVMe): reading sectors that were just written give you back what you just
wrote, regardless of the actual location of the data on the device (persisted
to non volatile media or not).

> - Observed delay between completion and actual disk persistence ranges
> from 3-7 minutes

That depends on how often/when/how the drive flushes its write cache, which you
cannot know from the host. If you want to reduce this, explicitly flush the
device write cache more often (execute blkdev_issue_flush() or similar).

> - Data becomes immediately available only after unmount/sync/reboot

??

You can read data that was written even without a sync/flush.

> Proposed Enhancement:
> We're looking for ways to:
> 1. Detect when data is actually flushed to disk

If you have the write cache enabled on the device, there is no device interface
that notifies this. This simply does not exist. If you want to guarantee data
persistence to non-volatile media on the device, issue a synchronize cache
command (which blkdev_issue_flush() does), or sync your file system if you are
using one. Or as mentioned already, disable the device write cache.

> 2. Track the relationship between bio requests and cache flushes

That is up to you to do that. File systems do so for sync()/fsync(). Note that
data persistence guarantees are always for write requests that have already
completed.

> 3. Potentially add tracepoints around such operations

As Christoph said, tracepoints are not a stable ABI. So relying on tracepoints
for tracking data persistence is really not a good idea.

-- 
Damien Le Moal
Western Digital Research