Dear Ceph / CephFS gurus... Bare a bit with me while I give you a bit of context. Questions will appear at the end. 1) I am currently running ceph 9.0.3 and I have install it to test the cephfs recovery tools. 2) I've created a situation where I've deliberately (on purpose) lost some data and metadata (check annex 1 after the main email). 3) I've stopped the mds, and waited to check how the cluster reacts. After some time, as expected, the cluster reports a ERROR state, with a lot of PGs degraded and stuck # ceph -s4) I've umounted the cephfs clients ('umount -l' worked for me this time but I already had situations where 'umount' would simply hang, and the only viable solutions would be to reboot the client). 5) I've recovered the ceph cluster by (details on the recover operations are in annex 2 after the main email.) - declaring the osds lost - removing the osds from the crush map - letting the cluster stabilize and letting all the recover I/O finish - identifying stuck PGs - checking if they existed, and if not recreate them. 6) I've restarted the MDS. Initially, the mds cluster was considered degraded but after some small amount of time, that message disappeared. The WARNING status was just because of "too many PGs per OSD (409 > max 300)" # ceph -s # ceph -s7) I was able to mount the cephfs filesystem in a client. When I tried to read a file made of some lost objects, I got holes in part of the file (compare with the same operation on annex 1) # od /cephfs/goncalo/5Gbytes_029.txt | head Finally the questions: a./ Under a situation as the one describe above, how can we safely terminate cephfs in the clients? I have had situations where umount simply hangs and there is no real way to unblock the situation unless I reboot the client. If we have hundreds of clients, I would like to avoid that. b./ I was expecting to have lost metadata information since I've clean OSDs where metadata information was stored for the /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that the /'cephfs/goncalo/5Gbytes_029.txt' was still properly referenced, without me having to run any recover tool. What am I missing? c./ After recovering the cluster, I though I was in a cephfs situation where I had c.1 files with holes (because of lost PGs and objects in the data pool) c.2 files without metadata (because of lost PGs and objects in the metadata pool) c.3 metadata without associated files (because of lost PGs and objects in the data pool) I've tried to run the recovery tools, but I have several doubts which I did not found described in the documentation - Is there a specific order / a way to run the tools for the c.1, c.2 and c.3 cases I mentioned? d./ Since I was testing, I simply ran the following sequence but I am not sure of what the command are doing, nor if the sequence is correct. I think an example use case should be documented. Specially the cephfs-data-scan did not returned any output, or information. So, I am not sure if anything happened at all. # cephfs-table-tool 0 reset session # cephfs-journal-tool --rank=0 journal reset # cephfs-data-scan scan_extents --force-pool cephfs_mt (doesn't seem to work)e./ After running the cephfs tools, everything seemed exactly in the same status. No visible changes or errors at the filesystem level. So, at this point not sure what to conclude... Thank you in Advance for your responses Cheers Goncalo # ##################### # ANNEX 1: GENERATE DATA LOSS # # ##################### 1) Check a file # ls -l /cephfs/goncalo/5Gbytes_029.txt -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 /cephfs/goncalo/5Gbytes_029.txt --- * --- 2) See its contents # od /cephfs/goncalo/5Gbytes_029.txt | head 0000000 150343 117016 156040 100553 154377 174521 137643 047440 0000020 006310 013157 064422 136662 145623 116101 137007 031237 0000040 111570 010104 103540 126335 014632 053445 006114 047003 0000060 123201 170045 042771 036561 152363 017716 000405 053556 0000100 102524 106517 066114 071112 144366 011405 074170 032621 0000120 047761 177217 103414 000774 174320 122332 110323 065706 0000140 042467 035356 132363 067446 145351 155277 177533 062050 0000160 016303 030741 066567 043517 172655 176016 017304 033342 0000200 177440 130510 163707 060513 055027 107702 023012 130435 0000220 022342 011762 035372 044033 152230 043424 004062 177461 --- * --- 3) Get its inode, and convert it to HEX # ls -li /cephfs/goncalo/5Gbytes_029.txt 1099511627812 -rw-r--r-- 1 root root 5368709120 Sep 8 03:55 /cephfs/goncalo/5Gbytes_029.txt (1099511627812)_base = (10000000024)_base16 --- * --- 4) Get the osd pool details # ceph osd pool ls detail pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 196 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 182 flags hashpspool stripe_width 0 --- * --- 5) Get the file / PG / OSD mapping # ceph osd map cephfs_dt 10000000024.00000000 osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> pg 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19) # ceph osd map cephfs_mt 10000000024.00000000 osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> pg 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27) --- * --- 6) Kill the relevant osd daemons, umount the osd partition and delete the partitions [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done [root@server2 ~]# for o in 13 15; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done [root@server3 ~]# for o in 19 23; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done [root@server4 ~]# for o in 27; do dev=`df /var/lib/ceph/osd/ceph-$o | tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s ${dev::8} rm 2; partprobe; done # ####################### # ANNEX 2: RECOVER CEPH CLUSTER # # ####################### 1) Declare OSDS losts # for o in 6 13 15 19 23 27;do ceph osd lost $o --yes-i-really-mean-it; done marked osd lost in epoch 480 marked osd lost in epoch 482 marked osd lost in epoch 487 marked osd lost in epoch 483 marked osd lost in epoch 489 marked osd lost in epoch 485 --- * --- 2) Remove OSDs from CRUSH map # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph osd down $o; ceph osd rm $o; ceph auth del osd.$o; done removed item id 6 name 'osd.6' from crush map osd.6 is already down. removed osd.6 updated removed item id 13 name 'osd.13' from crush map osd.13 is already down. removed osd.13 updated removed item id 15 name 'osd.15' from crush map osd.15 is already down. removed osd.15 updated removed item id 19 name 'osd.19' from crush map osd.19 is already down. removed osd.19 updated removed item id 23 name 'osd.23' from crush map osd.23 is already down. removed osd.23 updated removed item id 27 name 'osd.27' from crush map osd.27 is already down. removed osd.27 updated --- * --- 3) Give time to the cluster react, and to the recover I/O to finish. --- * --- 4) Check which PGS are still stale # ceph pg dump_stuck stale ok pg_stat state up up_primary acting acting_primary 1.23 stale+undersized+degraded+peered [23] 23 [23] 23 2.38b stale+undersized+degraded+peered [23] 23 [23] 23 (...) --- * --- 5) Try to query those stale PGs # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print $1}'`; do ceph pg $pg query; done ok Error ENOENT: i don't have pgid 1.23 Error ENOENT: i don't have pgid 2.38b (...) --- * --- 6) Create the non existing PGs # for pg in `ceph pg dump_stuck stale | grep ^[12] | awk '{print $1}'`; do ceph pg force_create_pg $pg; done ok pg 1.23 now creating, ok pg 2.38b now creating, ok (...) --- * --- 7) At this point, for the PGs to leave the 'creating' status, I had to restart all remaining OSDs. Otherwise those PGs were in the creating state forever. -- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com