Re: data corruption issue with "rbd export-diff/import-diff"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-09-12 19:49:16-07:00 Jason Dillaman wrote:
 
On Wed, Sep 12, 2018 at 10:15 PM <patrick.mclean@xxxxxxxx> wrote:
>
> On 2018-09-12 17:35:16-07:00 Jason Dillaman wrote:
>
>
> Any chance you know the LBA or byte offset of the corruption so I can
> compare it against the log?
>
> The LBAs of the corruption are 0xA74F000 through 175435776

Are you saying the corruption starts at byte offset 175435776 from the
start of the RBD image? If so, that would correspond to object 0x29:

Yes, that is where we are seeing the corruption. We have also noticed that different runs of export-diff seem to corrupt the data in different ways.
2018-09-12 21:22:17.117246 7f268928f0c0 20 librbd::DiffIterate: object
rbd_data.4b383f1e836edc.0000000000000029: list_snaps complete
2018-09-12 21:22:17.117249 7f268928f0c0 20 librbd::DiffIterate:   diff
[499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
end_exists=1
2018-09-12 21:22:17.117251 7f268928f0c0 20 librbd::DiffIterate:
diff_iterate object rbd_data.4b383f1e836edc.0000000000000029 extent
0~4194304 from [0,4194304]
2018-09-12 21:22:17.117268 7f268928f0c0 20 librbd::DiffIterate:  opos
0 buf 0~4194304 overlap
[499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
2018-09-12 21:22:17.117270 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 499712~4096 logical 172466176~4096
2018-09-12 21:22:17.117271 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 552960~4096 logical 172519424~4096
2018-09-12 21:22:17.117272 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 589824~4096 logical 172556288~4096
2018-09-12 21:22:17.117273 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3338240~4096 logical 175304704~4096
2018-09-12 21:22:17.117274 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3371008~4096 logical 175337472~4096
2018-09-12 21:22:17.117275 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3469312~4096 logical 175435776~4096  <-------
2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3502080~4096 logical 175468544~4096
2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3534848~4096 logical 175501312~4096
2018-09-12 21:22:17.117277 7f268928f0c0 20 librbd::DiffIterate:
overlap extent 3633152~4096 logical 175599616~4096

... and I can see it being imported ...

2018-09-12 22:07:38.698380 7f23ab2ec0c0 20 librbd::io::ObjectRequest:
0x5615cb507da0 send: write rbd_data.38abe96b8b4567.0000000000000029
3469312~4096

Therefore, I don't see anything structurally wrong w/ the
export/import behavior. Just to be clear, did you freeze/coalesce the
filesystem before you took the snapshot?

The filesystem was unmounted at the time of the export, our system is designed to only work on unmounted filesystems.
> On Wed, Sep 12, 2018 at 8:32 PM <patrick.mclean@xxxxxxxx> wrote:
> >
> > Hi Jason,
> >
> > On 2018-09-10 11:15:45-07:00 ceph-users wrote:
> >
> > On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:
> >
> >
> > > In addition to this, we are seeing a similar type of corruption in another use case when we migrate RBDs and snapshots across pools. In this case we clone a version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd export-diff/import-diff' to restore the last 3 snapshots on top. Here too we see cases of fsck and RBD checksum failures.
> > > We maintain various metrics and logs. Looking back at our data we have seen the issue at a small scale for a while on Jewel, but the frequency increased recently. The timing may have coincided with a move to Luminous, but this may be coincidence. We are currently on Ceph 12.2.5.
> > > We are wondering if people are experiencing similar issues with 'rbd export-diff / import-diff'. I'm sure many people use it to keep backups in sync. Since it is backups, many people may not inspect the data often. In our use case, we use this mechanism to keep data in sync and actually need the data in the other location often. We are wondering if anyone else has encountered any issues, it's quite possible that many people may have this issue, buts simply don't realize. We are likely hitting it much more frequently due to the scale of our operation (tens of thousands of syncs a day).
> >
> > If you are able to recreate this reliably without tiering, it would
> > assist in debugging if you could capture RBD debug logs during the
> > export along w/ the LBA of the filesystem corruption to compare
> > against.
> >
> > We haven't been able to reproduce this reliably as of yet, as of yet we haven't actually figured out the exact conditions that cause this to happen, we have just been seeing it happen on some percentage of export/import-diff operations.
> >
> >
> > Logs from both export-diff and import-diff in a case where the result gets corrupted are attached. Please let me know if you need any more information.
> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux