Re: Time Estimation for cephfs-data-scan scan_links

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Fri, 13 Oct 2023 18:11:31 +0100

>> However, I've observed that the cephfs-data-scan scan_links step has
>> been running for over 24 hours on 35 TB of data, which is replicated
>> across 3 OSDs, resulting in more than 100 TB of raw data.

What matters is the number of "inodes" (and secondarily their
size), that is the number of metadata objects, which is
proportional to the number of files and directories in the
CephFS instance.

>> Does anyone have an estimation on the duration for this step?

> scan_links has to iterate through every object in the metadata pool
> and for each object iterate over the omap key/values - so this step
> scales to the amount of objects in the metadata pool, i.e., the number
> of directories and files in the file system.

>> pools:   12 pools, 1475 pgs
>> objects: 50.89M objects, 72 TiB
>> usage:   207 TiB used, 148 TiB / 355 TiB avail
>> pgs:     579358/152674596 objects misplaced (0.379%)

51m between data and metadata objects, average obiect space used
4MiB, average metadata per object 230KiB (looks like 3-way
replication as per default).

>> POOL          TYPE     USED  AVAIL
>> cephfs_metadata   metadata  1045G  35.6T
>> cephfs.c3sl.data    data     114T  35.6T
[...]
>> POOL          TYPE     USED  AVAIL
>> cephfs.c3sl.meta  metadata  28.2G  35.6T
>> cephfs.c3sl.data    data     114T  35.6T

Total between data and metadata 142TiB, so CephFS uses around 2/3
of the 207TiB stored in this Ceph instance, so perhaps 2/3 of
the objects too, so maybe 35m objects in the CephFS instance.

What is being done is a serial tree walk and copy in 3 replicas
of all objects in the CephFS metadata pool, so it depends on
both the read and write IOPS rate for the metadata pools, but
mostly in the write IOPS.

Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3
copies of each inode.

I wonder whether the source (presumably 'cephfs_metadata') and
target (presumably 'cephfs.c3sl.meta') pools are on the same
physical devices, and whether they are SSDs with high small
writes rates or not, and the physical storage properties.

Wild guess: metadata is on 10x 3.84TB SSDs without persistent
cache, data is on 48x 8TB devices probably HDDs. Very cost
effective :-).

>      mds: 0/10 daemons up (10 failed), 9 standby
>      osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs

Overall it looks like 1 day copied 1TB out of 28TB in metadata,
so looks like it will take a month.

1TB of metadata means 1.5m 230KiB metadata object processed in 1
day, so around 15 metadata objects read and written in 3 copies
per second, with a 12MB/s metadata storage write rate, which are
plausible numbers for a metadata pool on SSDs with
non-persistent cache, so the estimate of just 3-4 more weeks
looks plausible again.

Running something like 'iostat -dk -zy 1' on one of the servers
with metadata drives might also help get an idea.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx