speeding up EC recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi,

on a large cluster with ~1600 OSDs, 60 servers and using 16+3 erasure coded pools, the recovery after OSD failure (HDD) is quite slow. Typical values are at 4GB/s with 125 ops/s and 32MB object sizes, which then takes 6-8 hours, during that time the pgs are degraded. I tried to speed it up with

  osd         advanced  osd_max_backfills 32
  osd         advanced  osd_recovery_max_active 10
  osd         advanced  osd_recovery_op_priority 63
  osd         advanced  osd_recovery_sleep_hdd 0.000000

which at least kept the iops/s at a constant level. The recovery does not seem to be cpu or memory bound. Is there any way to speed it up? While testing the recovery on replicated pools, it reached 50GB/s.

In contrast, replacing the failed drive with a new one and re-adding the OSD is  quite fast, with 1GB/s recovery rate of misplaced pgs, or ~120MB/s average HDD write speed, which is not very far from HDD throughput.

Regards,
Andrej

--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail: Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-425-7074
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux