Re: [ext] Copying large file stuck, two cephfs-2 mounts on two cluster

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Wed, 4 Jan 2023 15:18:35 +0530

Hi Mathias,

I am glad that you could find it's a client related issue and figured a way
around it.
I too could reproduce the issue locally i.e. when a client which was
initially copying the snapshot still
has access to it even when it's got deleted from the other client. I think
this needs further investigation.
I will raise a tracker for the same and share it here.

Thanks and Regards,
Kotresh H R

On Tue, Jan 3, 2023 at 3:23 PM Kuhring, Mathias <
mathias.kuhring@xxxxxxxxxxxxxx> wrote:

> Trying to exclude clusters and/or clients might have gotten me on the
> right track. It might have been a client issue or actually a snapshot
> retention issue. As it turned out when I tried other routes for the data
> using a different client, the data was not available anymore since the
> snapshot had been trimmed.
>
> We got behind syncing our snapshots a while ago (due to other issues). And
> now we are somewhere in between our weekly (16 weeks) and daily (30 days)
> snapshots. So, I assume before we catch up with daily (<30), there is a
> general risk that snapshots disappear while we are syncing them.
>
> The funny/weird thing is though (and why I didn't catch up on this), the
> particular file (and potentially others) of this trimmed snapshot was
> apparently still available for the client I initially used for the
> transfer. I'm wondering if the client somehow cached the data until the
> snapshot got trimmed. And then just re-tried copying the incompletely
> cached data.
>

> Continuing with the next available snapshot, mirroring/syncing is now
> catching up again. I expect it might happen again once we catch up to the
> 30-days threshold. If the time point of snapshot trimming falls into the
> syncinc time frame. But then I know to just cancel/skip the current
> snapshot and continue with the next one. Syncing time is short enough to
> get me over the hill then before the next trimming.
>
> Note to myself: Next time something similar things happens, check if
> different clients AND different snapshots or original data behave the same.
>
> On 12/22/2022 4:27 PM, Kuhring, Mathias wrote:
>
> Dear ceph community,
>
>
>
> We have two ceph cluster of equal size, one main and one mirror, both
> using cephadm and on version
>
> ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy
> (stable)
>
>
>
> We are stuck with copying a large file (~ 64G) between the CephFS file
> systems of the two clusters.
>
>
> The source path is a snapshot (i.e. something like
> /my/path/.snap/schedule_some-date/…).
> But I don't think that should make any difference.
>
>
>
> First, I was thinking that I need to adapt some rsync parameters to work
> better with bigger files on CephFS.
>
> But when confirming by just copying the file with cp, the transfer get's
> also stuck.
>
> Without any error message, the process just keeps running (rsync or cp).
>
> But the file size on the target doesn't increase anymore at some point
> (almost 85%).
>
>
>
> Main:
>
> -rw------- 1 cockpit-ws printadmin 68360698297 16. Nov 13:40
> LB22_2764_dragen.bam
>
>
>
> Mirror:
>
> -rw------- 1 root root 58099499008 22. Dez 15:54 LB22_2764_dragen.bam
>
>
>
> Our CephFS file size limit is with 10 TB more than generous.
> And as far as I know from clients there are indeed files in TB ranges on
> the cluster without issues.
>
>
>
> I don't know if this is the file's fault or if this is some issue with
> either of the CephFS' or cluster.
>
> And I don't know how to look and troubleshoot this.
>
> Can anybody give me a tip where I can start looking and debug these kind
> of issues?
>
>
>
> Thank you very much.
>
>
>
> Best Wishes,
>
> Mathias
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
>
> --
> Mathias Kuhring
>
> Dr. rer. nat.
> Bioinformatician
> HPC & Core Unit Bioinformatics
> Berlin Institute of Health at Charité (BIH)
>
> E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx<mailto:
> mathias.kuhring@xxxxxxxxxxxxxx>
> Mobile: +49 172 3475576
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx