Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device

Laurence Oberman <loberman@xxxxxxxxxx> · Tue, 24 Jul 2018 11:31:48 -0400

On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at  9:57am -0400,
> Laurence Oberman <loberman@xxxxxxxxxx> wrote:
> 
> > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > > 
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > > 
> > > Cheers,
> > > 
> > > Hannes
> 
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others.  It was a means to experiment.  More below.
> 
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
> 
> Yes, so topology for the customer's setup is:
> 
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
>    mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
> 
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
> 
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
> 
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
> 
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
> 
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.

In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.

We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.

The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns

nvme0n1                             259:0    0 372.6G  0 disk  
├─nvme0n1p1                         259:1    0   150G  0 part  
└─nvme0n1p2                         259:2    0   150G  0 part  
  ├─cache_FC-nvme_blk_cache_cdata   253:42   0    20G  0 lvm   
  │ └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC
  └─cache_FC-nvme_blk_cache_cmeta   253:43   0    40M  0 lvm   
    └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC

cache_FC-fc_disk (253:45)
 ├─cache_FC-fc_disk_corig (253:44)
 │  └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
 │     ├─ (68:224)
 │     ├─ (69:240)
 │     ├─ (8:192)
 │     └─ (8:64)
 ├─cache_FC-nvme_blk_cache_cdata (253:42)
 │  └─ (259:2)
 └─cache_FC-nvme_blk_cache_cmeta (253:43)
    └─ (259:2)