Re: Deep-scrub much slower than HDD speed

Marc <Marc@xxxxxxxxxxxxxxxxx> · Wed, 26 Apr 2023 19:53:44 +0000

Hi Niklas,

> 
> > 100MB/s is sequential, your scrubbing is random. afaik everything is
> random.
> 
> Is there any docs that explain this, any code, or other definitive
> answer?

do a fio[1] test on a disk to see how it performs under certain conditions. Or look at atop during scrubbing, it will give you an impression how many % of your disk performance is used.

> Also wouldn't it make sense that for scrubbing to be able to read the
> disk linearly, at least to some significant extent?

I would also think so, but I have no idea how this is implemented.

> >> Changing scrubbing settings does not help (see below).
> >>
> >
> > I think you should be able to use the full performance of the disk
> when
> > ceph tell osd.* injectargs '--osd_max_scrubs=X'.
> 
> In my post I already showed that increasing `osd_max_scrubs` e.g. by 3x
> does not help.
> 
> Also, what would be the logic how it could?

I would argue. Because an individual scrub is not using all the disk resources. When you allow 2 scrub sessions on the same disk, it uses 2x the ios, which of course would be at the costs of available client io.

> If random IO is thrashing disk seeks, how could querying more concurrent
> disk seeks help?

it is, but 1 single scrub session is not taking all of your disk io. None of the recovery procedures do, afaik. Because the cluster likes to serve client io first. 
The larger the cluster, the more often some part of the cluster is doing recovery.

> > ceph tell osd.* injectargs '--osd_recovery_sleep_hdd=0.100000'
> 
> There is no recovery going on in the cluster.

Yes I know, but this is a throttling factor, maybe something like this exists for scrubbing.

The question you should ask yourself, why you want to change/investigate this? I like also to have a good performing cluster, but never looked at the scrubbing. Except turning it off before a reboot/update or so.

[1]
[global]
ioengine=libaio
#ioengine=posixaio
invalidate=1
ramp_time=30
iodepth=1
runtime=180
time_based
direct=1
filename=/dev/sdX
#filename=/mnt/disk/fio-bench.img

[write-4k-seq]
stonewall
bs=4k
rw=write

[randwrite-4k-seq]
stonewall
bs=4k
rw=randwrite
fsync=1

[read-4k-seq]
stonewall
bs=4k
rw=read

[randread-4k-seq]
stonewall
bs=4k
rw=randread
fsync=1

[rw-4k-seq]
stonewall
bs=4k
rw=rw

[randrw-4k-seq]
stonewall
bs=4k
rw=randrw

[randrw-4k-d4-seq]
stonewall
bs=4k
rw=randrw
iodepth=4

[randread-4k-d32-seq]
stonewall
bs=4k
rw=randread
iodepth=32

[randwrite-4k-d32-seq]
stonewall
bs=4k
rw=randwrite
iodepth=32

[write-128k-seq]
stonewall
bs=128k
rw=write

[randwrite-128k-seq]
stonewall
bs=128k
rw=randwrite

[read-128k-seq]
stonewall
bs=128k
rw=read

[randread-128k-seq]
stonewall
bs=128k
rw=randread

[rw-128k-seq]
stonewall
bs=128k
rw=rw

[randrw-128k-seq]
stonewall
bs=128k
rw=randrw

[write-1024k-seq]
stonewall
bs=1024k
rw=write

[randwrite-1024k-seq]
stonewall
bs=1024k
rw=randwrite

[read-1024k-seq]
stonewall
bs=1024k
rw=read

[randread-1024k-seq]
stonewall
bs=1024k
rw=randread

[rw-1024k-seq]
stonewall
bs=1024k
rw=rw

[randrw-1024k-seq]
stonewall
bs=1024k
rw=randrw

[write-4096k-seq]
stonewall
bs=4096k
rw=write

[write-4096k-d16-seq]
stonewall
bs=4M
rw=write
iodepth=16

[randwrite-4096k-seq]
stonewall
bs=4096k
rw=randwrite

[read-4096k-seq]
stonewall
bs=4096k
rw=read

[read-4096k-d16-seq]
stonewall
bs=4M
rw=read
iodepth=16

[randread-4096k-seq]
stonewall
bs=4096k
rw=randread

[rw-4096k-seq]
stonewall
bs=4096k
rw=rw

[randrw-4096k-seq]
stonewall
bs=4096k
rw=randrw

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx