Re: [ext] Copying large file stuck, two cephfs-2 mounts on two cluster

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Tue, 3 Jan 2023 09:52:26 +0000

Trying to exclude clusters and/or clients might have gotten me on the right track. It might have been a client issue or actually a snapshot retention issue. As it turned out when I tried other routes for the data using a different client, the data was not available anymore since the snapshot had been trimmed.

We got behind syncing our snapshots a while ago (due to other issues). And now we are somewhere in between our weekly (16 weeks) and daily (30 days) snapshots. So, I assume before we catch up with daily (<30), there is a general risk that snapshots disappear while we are syncing them.

The funny/weird thing is though (and why I didn't catch up on this), the particular file (and potentially others) of this trimmed snapshot was apparently still available for the client I initially used for the transfer. I'm wondering if the client somehow cached the data until the snapshot got trimmed. And then just re-tried copying the incompletely cached data.

Continuing with the next available snapshot, mirroring/syncing is now catching up again. I expect it might happen again once we catch up to the 30-days threshold. If the time point of snapshot trimming falls into the syncinc time frame. But then I know to just cancel/skip the current snapshot and continue with the next one. Syncing time is short enough to get me over the hill then before the next trimming.

Note to myself: Next time something similar things happens, check if different clients AND different snapshots or original data behave the same.

On 12/22/2022 4:27 PM, Kuhring, Mathias wrote:

Dear ceph community,

We have two ceph cluster of equal size, one main and one mirror, both using cephadm and on version

ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)

We are stuck with copying a large file (~ 64G) between the CephFS file systems of the two clusters.

The source path is a snapshot (i.e. something like /my/path/.snap/schedule_some-date/…).
But I don't think that should make any difference.

First, I was thinking that I need to adapt some rsync parameters to work better with bigger files on CephFS.

But when confirming by just copying the file with cp, the transfer get's also stuck.

Without any error message, the process just keeps running (rsync or cp).

But the file size on the target doesn't increase anymore at some point (almost 85%).

Main:

-rw------- 1 cockpit-ws printadmin 68360698297 16. Nov 13:40 LB22_2764_dragen.bam

Mirror:

-rw------- 1 root root 58099499008 22. Dez 15:54 LB22_2764_dragen.bam

Our CephFS file size limit is with 10 TB more than generous.
And as far as I know from clients there are indeed files in TB ranges on the cluster without issues.

I don't know if this is the file's fault or if this is some issue with either of the CephFS' or cluster.

And I don't know how to look and troubleshoot this.

Can anybody give me a tip where I can start looking and debug these kind of issues?

Thank you very much.

Best Wishes,

Mathias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

--
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx>
Mobile: +49 172 3475576
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx