On Thu, Apr 16, 2020 at 3:27 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > On Thu, Apr 16, 2020 at 3:53 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > On Thu, Apr 16, 2020 at 12:15 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > On Wed, Apr 15, 2020 at 5:13 PM Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > > > > > > > On Wed, Apr 15, 2020 at 2:33 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Following some cephfs issues today we have a stable cluster but the > > > > > num_strays is incorrect. > > > > > After starting the mds, the values are reasonable, but they very soon > > > > > underflow and start showing 18E (2^64 - a few) > > > > > > > > > > ---------------mds---------------- --mds_cache--- ------mds_log------ > > > > > -mds_mem- ----mds_server----- mds_ ---objecter--- > > > > > req rlat fwd inos caps exi imi |stry recy recd|subm evts segs > > > > > repl|ino dn |hcr hcs hsr cre |sess|actv rd wr | > > > > > 129 0 0 253k 1.9k 0 0 |246 0 0 | 5 4.0k 5 0 > > > > > |253k 254k|129 0 0 0 |119 | 3 1 2 > > > > > 8.2k 0 0 253k 1.9k 0 0 |129 0 0 |395 4.4k 7 0 > > > > > |253k 254k|8.2k 11 0 0 |119 | 0 33 517 > > > > > 9.7k 0 0 253k 1.8k 0 0 |181 0 0 |302 4.7k 7 0 > > > > > |253k 254k|9.7k 5 0 0 |119 | 1 44 297 > > > > > 10k 0 0 253k 1.8k 0 0 |217 0 0 |382 5.1k 7 0 > > > > > |253k 254k| 10k 11 0 0 |119 | 0 54 405 > > > > > 9.0k 0 0 253k 1.7k 0 0 |205 0 0 |386 5.5k 8 0 > > > > > |253k 254k|9.0k 4 0 0 |119 | 1 46 431 > > > > > 8.2k 0 0 253k 1.7k 0 0 |161 0 0 |326 5.8k 8 0 > > > > > |253k 254k|8.2k 6 0 0 |119 | 1 37 397 > > > > > 8.0k 0 0 253k 1.6k 0 0 |135 0 0 |279 6.1k 8 0 > > > > > |253k 254k|8.0k 4 0 0 |119 | 1 31 317 > > > > > 9.2k 0 0 253k 1.6k 0 0 | 18E 0 0 |153 6.2k 8 0 > > > > > |253k 254k|9.2k 6 0 0 |119 | 1 2 265 > > > > > 8.2k 0 0 253k 1.7k 0 0 | 18E 0 0 | 40 6.3k 8 0 > > > > > |253k 254k|8.2k 5 0 0 |119 | 3 3 17 > > > > > > > > > > Is there a way to reset the num_strays to the correct number of strays ? > > > > > > > > > > > > > try command 'ceph daemon <mds of rank 0> scrub_path '~mdsdir' force > > > > recursive repair' > > > > > > thanks for the reply. Here's the ceph log from this repair: > > > > > > https://termbin.com/o8tc > > > > > > The active mds (single active only) still showed 18E, so I failed over > > > to a standby and it seems a bit better, but still occasionally > > > dropping below zero to 18E. > > > I ran scrub_path a few times and it finds errors each time... > > > > > > > do you mean scrub fixed the error, but the stat error keeps happening? > > which version of mds do you use? > > yes that's correct -- the num_strays is still going negative. This > cluster is running v14.2.8. > > Here is another attempt to fix: > > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 1 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 0 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 2 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 2 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 18446744073709552000 > # ceph daemon mds.`hostname -s` scrub_path '~mdsdir' force recursive repair > { > "return_code": 0, > "scrub_tag": "c33061e2-01c2-46f3-9d42-3d65408067bc", > "mode": "asynchronous" > } > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 18446744073709552000 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 18446744073709552000 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 1 > # ceph daemon mds.`hostname -s` perf dump | jq .mds_cache.num_strays > 18446744073709552000 > > Here is the only log from that scrub instance -- nothing logged about > the stats this time: > > 2020-04-16 09:13:35.589 7fe16ef2a700 1 mds.cephdwightmds2 > asok_command: scrub_path (starting...) > 2020-04-16 09:13:35.590 7fe16ef2a700 1 mds.cephdwightmds2 > asok_command: scrub_path (complete) > 2020-04-16 09:13:35.590 7fe167491700 0 log_channel(cluster) log [WRN] > : bad backtrace on inode 0x1003c43f025(~mds0/str > ay1/1003c43f025), rewriting it > 2020-04-16 09:13:35.591 7fe167491700 0 log_channel(cluster) log [INF] > : Scrub repaired inode 0x1003c43f025 (~mds0/stra > y1/1003c43f025) > 2020-04-16 09:13:35.591 7fe167491700 -1 mds.0.scrubstack > _validate_inode_done scrub error on inode [inode 0x1003c43f025 > [2,head] ~mds0/stray1/1003c43f025 auth v2590028892 dirtyparent s=0 > nl=0 n(v0 rc2020-04-16 09:13:32.052106 1=1+0) (iaut > h excl) (ifile excl) (ixattr excl) (iversion lock) > cr={897496551=0-4194304@1} caps={897496551=pAsxLsXsxFsxcrwb/pAsxXsxF > xwb@5},l=897496551 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 > scrubqueue=0 dirty=1 waiter=0 authpin=0 0x55a773 > 512700]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_va > l":-2,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(173)0x1003c43f025:[<0x601/1003c43f025 > v2590028892>,<0x100/stray1 v2 > 461787128>]//","error_str":"failed to decode on-disk backtrace (0 > bytes)!"},"raw_stats":{"checked":false,"passed":false > ,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.r > stat":"n()","error_str":""},"return_code":-2} > 2020-04-16 09:13:35.591 7fe167491700 0 log_channel(cluster) log [WRN] > : bad backtrace on inode 0x1003c43f023(~mds0/str > ay1/1003c43f023), rewriting it > 2020-04-16 09:13:35.591 7fe167491700 0 log_channel(cluster) log [INF] > : Scrub repaired inode 0x1003c43f023 (~mds0/stra > y1/1003c43f023) > 2020-04-16 09:13:35.591 7fe167491700 -1 mds.0.scrubstack > _validate_inode_done scrub error on inode [inode 0x1003c43f023 > [2,head] ~mds0/stray1/1003c43f023 auth v2590028888 dirtyparent s=0 > nl=0 n(v0 rc2020-04-16 09:13:31.925471 1=1+0) (iaut > h excl) (ifile excl) (ixattr excl) (iversion lock) > cr={897496551=0-4194304@1} caps={897496551=pAsxLsXsxFsxcrwb/pAsxXsxF > xwb@5},l=897496551 | ptrwaiter=0 request=0 lock=0 caps=1 dirtyparent=1 > scrubqueue=0 dirty=1 waiter=0 authpin=0 0x55a773 > 511800]: {"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_va > l":-2,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(173)0x1003c43f023:[<0x601/1003c43f023 > v2590028888>,<0x100/stray1 v2 > 461787128>]//","error_str":"failed to decode on-disk backtrace (0 > bytes)!"},"raw_stats":{"checked":false,"passed":false > ,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.r > stat":"n()","error_str":""},"return_code":-2} > 2020-04-16 09:13:35.637 7fe167491700 0 log_channel(cluster) log [INF] > : scrub complete with tag 'c33061e2-01c2-46f3-9d > 42-3d65408067bc' > looks like dirfrags' stats are fixed. Restart mds should fix the incorrect 'num_strays' perf counter > Cheers, Dan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx