Re: [LSF/MM/BPF TOPIC] Improving Block Layer Tracepoints for Next-Generation Backup Systems

Vishnu ks <ksvishnu56@xxxxxxxxx> · Sat, 4 Jan 2025 23:22:40 +0530

Thank you all for your valuable feedback. I'd like to provide more
technical context about our implementation and the specific challenges
we're facing.

System Architecture:
We've built a block-level continuous data protection system that:
1. Uses eBPF to monitor block_rq_complete tracepoint to track modified sectors
2. Captures sector numbers (not data) of changed blocks in real-time
3. Periodically syncs the actual data from these sectors based on
configurable RPO
4. Layers these incremental changes on top of base snapshots

Current Implementation:
- eBPF program attached to block_rq_complete tracks sector ranges from
bio requests
- Changed sector numbers are transmitted to a central dispatcher via websocket
- Dispatcher initiates periodic data sync (1-2 min intervals)
requesting data from tracked sectors
- Base snapshot + incremental changes provide point-in-time recovery capability

@Christoph: Regarding stability concerns - we're not using tracepoints
for data integrity, but rather for change detection. The actual data
synchronization happens through standard block device reads.

Technical Challenge:
The core issue we've identified is the gap between write completion
notification and data availability:
- block_rq_complete tracepoint triggers before data is actually
persisted to disk
- Reading sectors immediately after block_rq_complete often returns stale data
- Observed delay between completion and actual disk persistence ranges
from 3-7 minutes
- Data becomes immediately available only after unmount/sync/reboot

@Song: Our approach fundamentally differs from md/raid in several ways:

1. Network-based vs Local:
   - Our system operates over network, allowing replication across
geographically distributed systems
   - md/raid works only with locally attached storage devices

2. Replication Model:
   - We use asynchronous replication with configurable RPO windows
   - md/raid requires synchronous, immediate mirroring of data

3. Recovery Capabilities:
   - We provide point-in-time recovery through incremental sector tracking
   - md/raid focuses on immediate redundancy without historical state

@Zhu: The eBPF performance impact is minimal as we're only tracking
sector numbers, not actual data. The main overhead comes from the
periodic data sync operations.

Proposed Enhancement:
We're looking for ways to:
1. Detect when data is actually flushed to disk
2. Track the relationship between bio requests and cache flushes
3. Potentially add tracepoints around such operations

Questions for the community:
1. Are there existing tracepoints that could help track actual disk persistence?
2. Would adding tracepoints in the page cache writeback path be feasible?
3. Are there alternative approaches to detecting when data is actually
persisted?

Would love to hear the community's thoughts on this specific challenge
and potential approaches to addressing it.

Best regards,
Vishnu KS

On Sat, 4 Jan 2025 at 06:41, Song Liu <song@xxxxxxxxxx> wrote:
>
> Hi Vishnu,
>
> On Tue, Dec 31, 2024 at 10:35 PM Vishnu ks <ksvishnu56@xxxxxxxxx> wrote:
> >
> > Dear Community,
> >
> > I would like to propose a discussion topic regarding the enhancement
> > of block layer tracepoints, which could fundamentally transform how
> > backup and recovery systems operate on Linux.
> >
> > Current Scenario:
> >
> > - I'm developing a continuous data protection system using eBPF to
> > monitor block request completions
>
> This makes little sense. It is not clear how this works.
>
> > - The system aims to achieve reliable live data replication for block devices
> > Current tracepoints present challenges in capturing the complete
> > lifecycle of write operations
>
> What's the difference between this approach and existing data
> replication solutions, such as md/raid?
>
> >
> > Potential Impact:
> >
> > - Transform Linux Backup Systems:
> > - Enable true continuous data protection at block level
> > - Eliminate backup windows by capturing changes in real-time
> > - Reduce recovery point objectives (RPO) to near-zero
> > - Allow point-in-time recovery at block granularity
> >
> > Current Technical Limitations:
> >
> > - Inconsistent visibility into write operation completion
> > - Gaps between write operations and actual data flushes
> > - Potential missing instrumentation points
>
> If a tracepoint is missing or misplaced, we can fix it in a patch.
>
> > - Challenges in ensuring data consistency across replicated volumes
> >
> > Proposed Improvements:
> >
> > - Additional tracepoints for better write operation visibility
> > - Optimal placement of existing tracepoints
> > - New instrumentation points for reliable block-level monitoring
>
> Some details in these would help this topic proposal.
>
> Thanks,
> Song

-- 
Vishnu KS,
Opensource contributor and researcher,
https://iamvishnuks.com