>> However, I've observed that the cephfs-data-scan scan_links step has >> been running for over 24 hours on 35 TB of data, which is replicated >> across 3 OSDs, resulting in more than 100 TB of raw data. What matters is the number of "inodes" (and secondarily their size), that is the number of metadata objects, which is proportional to the number of files and directories in the CephFS instance. >> Does anyone have an estimation on the duration for this step? > scan_links has to iterate through every object in the metadata pool > and for each object iterate over the omap key/values - so this step > scales to the amount of objects in the metadata pool, i.e., the number > of directories and files in the file system. >> pools: 12 pools, 1475 pgs >> objects: 50.89M objects, 72 TiB >> usage: 207 TiB used, 148 TiB / 355 TiB avail >> pgs: 579358/152674596 objects misplaced (0.379%) 51m between data and metadata objects, average obiect space used 4MiB, average metadata per object 230KiB (looks like 3-way replication as per default). >> POOL TYPE USED AVAIL >> cephfs_metadata metadata 1045G 35.6T >> cephfs.c3sl.data data 114T 35.6T [...] >> POOL TYPE USED AVAIL >> cephfs.c3sl.meta metadata 28.2G 35.6T >> cephfs.c3sl.data data 114T 35.6T Total between data and metadata 142TiB, so CephFS uses around 2/3 of the 207TiB stored in this Ceph instance, so perhaps 2/3 of the objects too, so maybe 35m objects in the CephFS instance. What is being done is a serial tree walk and copy in 3 replicas of all objects in the CephFS metadata pool, so it depends on both the read and write IOPS rate for the metadata pools, but mostly in the write IOPS. Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3 copies of each inode. I wonder whether the source (presumably 'cephfs_metadata') and target (presumably 'cephfs.c3sl.meta') pools are on the same physical devices, and whether they are SSDs with high small writes rates or not, and the physical storage properties. Wild guess: metadata is on 10x 3.84TB SSDs without persistent cache, data is on 48x 8TB devices probably HDDs. Very cost effective :-). > mds: 0/10 daemons up (10 failed), 9 standby > osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs Overall it looks like 1 day copied 1TB out of 28TB in metadata, so looks like it will take a month. 1TB of metadata means 1.5m 230KiB metadata object processed in 1 day, so around 15 metadata objects read and written in 3 copies per second, with a 12MB/s metadata storage write rate, which are plausible numbers for a metadata pool on SSDs with non-persistent cache, so the estimate of just 3-4 more weeks looks plausible again. Running something like 'iostat -dk -zy 1' on one of the servers with metadata drives might also help get an idea. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx