On Thu, Apr 9, 2020 at 11:01 AM Cyril Servant <cyril.servant@xxxxxxxxx> wrote: > > > Le 9 avr. 2020 à 00:34, Nico Kadel-Garcia <nkadel@xxxxxxxxx> a écrit : > > > > On Wed, Apr 8, 2020 at 11:31 AM Cyril Servant <cyril.servant@xxxxxxxxx> wrote: > >> > >> Hello, I'd like to share with you an evolution I made on sftp. > > > > It *sounds* like you should be using rparallelized rsync over xargs. > > Partial sftp or scp transfers are almost inevitable in builk transfers > > over a crowded network, and sftp does not have good support for > > "mirroring", only for copying content. > > > > See https://stackoverflow.com/questions/24058544/speed-up-rsync-with-simultaneous-concurrent-file-transfers > > This solution is perfect for parallel sending a lot of files. But in the case of > sending one really big file, it does not improve transfer speed. It's helpful because it allows you to retry where the last transmission failed, and it does not leave a partial upload sitting there tempting people. It uploads to .filename-hash, and moves the upload in place when the individual file upload is completed. > >> I'm working at CEA (Commissariat à l'énergie atomique et aux énergies > >> alternatives) in France. We have a compute cluster complex, and our customers > >> regularly need to transfer big files from and to the cluster. Each of our front > >> nodes has an outgoing bandwidth limit (let's say 1Gb/s each, generally more > >> limited by the CPU than by the network bandwidth), but the total interconnection > >> to the customer is higher (let's say 10Gb/s). Each front node shares a > >> distributed file system on an internal high bandwidth network. So the contention > >> point is the 1Gb/s limit of a connection. If the customer wants to use more than > >> 1Gb/s, he currently uses GridFTP. We want to provide a solution based on ssh to > >> our customers. > >> > >> 2. The solution > >> > >> I made some changes in the sftp client. The new option "-n" (defaults to 0) sets > >> the number of extra channels. There is one main ssh channel, and n extra > >> channels. The main ssh channel does everything, except the put and get commands. > >> Put and get commands are parallelized on the n extra channels. Thanks to this, > >> when the customer uses "-n 5", he can transfer his files up to 5Gb/s. There is > >> no server side change. Everything is made on the client side. > > > > While the option sounds useful for niche cases, I'd be leery of > > partial transfers and being compelled to replicate content to handle > > partial transfers. rsync has been very good, for years, in completing > > partial transfers. > > I can fully understand this. In our case, the network is not really crowded, as > customers are generally using research / educational links. Indeed, this is > totally a niche case, but still a need for us. The main use case is putting data > you want to process into the cluster, and when the job is finished, getting the > output of the process. There is rarely the need for synchronising files, except > for the code you want to execute on the cluster, which is considered small > compared to the data. rsync is the obvious choice for synchronising the code, > but not for putting / getting huge amounts of data. > > The only other ssh based tool that can speed up the transfer of one big file is > lftp, and it only works for get commands, not for put commands. yeah, lftp can also support ftps. ftps is supporied by the vsftpd FTP server, and I use it in places where I do not want OpenSSH server's tendency ro let people with access look around the rest of the filesystem. _______________________________________________ openssh-unix-dev mailing list openssh-unix-dev@xxxxxxxxxxx https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev