Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

Andreas Dilger <adilger@xxxxxxxxx> · Mon, 1 Oct 2018 14:34:08 -0600

On Sep 20, 2018, at 10:40 PM, Zygo Blaxell <ce3g8jdj@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote:
>> On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> [...]
>> With no DMAPI in the future, people with custom HSM-like interfaces
>> based on dmapi are starting to turn to fanotify and friends to
>> provide them with the change notifications they require....
> 
> I had a fanotify-based scanner once, before I noticed btrfs effectively
> had timestamps all over its metadata.
> 
> fanotify won't tell me which parts of a file were modified (unless it
> got that feature in the last few years?).  fanotify was pretty useless
> when the only file on the system that was being modified was a 13TB
> VM image.  Or even a little 16GB one.  Has to scan the whole file to
> find the one new byte.  Even on desktops the poor thing spends most of
> its time looping over /var/log/messages.  It was sad.
> 
> If fanotify gave me (inode, offset, length) tuples of dirty pages in
> cache, I could look them up and use a dedupe_file_range call to replace
> the dirty pages with a reference to an existing disk block.  If my
> listener can do that fast enough, it's in-band dedupe; if it doesn't,
> the data gets flushed to disk as normal, and I fall back to a scan of
> the filesystem to clean it up later.
> 
>>>> e.g. a soft requirement is that we need to scan the entire fs at
>>>> least once a month.
>>> 
>>> I have to scan and dedupe multiple times per hour.  OK, the first-ever
>>> scan of a non-empty filesystem is allowed to take much longer, but after
>>> that, if you have enough spare iops for continuous autodefrag you should
>>> also have spare iops for continuous dedupe.
>> 
>> Yup, but using notifications avoids the for even these scans - you'd
>> know exactly what data has changed, when it changed, and know
>> exactly that you needed to read to calculate the new hashes.
> 
> ...if the scanner can keep up with the notifications; otherwise, the
> notification receiver has to log them somewhere for the scanner to
> catch up.  If there are missed or dropped notifications--or 23 hours a
> day we're not listening for notifications because we only have an hour
> a day maintenance window--some kind of filesystem scan has to be done
> after the fact anyway.

It is worthwhile to mention that Lustre has a persistent Changelog record
that is generated atomically with the filesystem transaction that the event
happened in.

Once there is a Changelog consumer that registers itself with the filesystem,
along with a mask of the event types that it is interested in, the Changelog
begins recording all such events to disk (e.g. create, mkdir, setattr, etc.).
The Changelog consumer periodically notifies the filesystem when it has
processed events up to X, so that it can purge old events from the log.  It
is possible to have multiple consumers registered, and the log is only purged
up to the slowest consumer.

If a consumer hasn't processed logs in some (relatively long) time (e.g. many
days or weeks), or if the filesystem is otherwise going to run out of space,
then the consumer is deregistered and the old log records are cleaned up.  This
also notifies the consumer that it is is no longer active, and it has to do a
full scan to update its state for the events that it missed.

Having a persistent changelog is useful for all kinds of event processing,
and avoids the need to do real-time processing.  If the userspace daemon fails,
or the system is restarted, etc. then there is no need to rescan the whole
filesystem, which is important when there are many billions of files therein.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP