Hi Michael, that sounds like a big step forward. I would probably remove the data pool from the ceph fs first before doing anything on it. Is the new pool set as data pool on the root of the entire ceph fs? If so, I see no reason for not detaching the pool from the ceph fs right away. Also to confirm that this goes without issues. Your choice though. This is a decisive moment and I would sweat as well. A shot of whiskey for the nerves maybe :) If you manage to "ceph fs rm_data_pool fs.data.archive.frames" the pool without problems, you are then safe to play with it. I think it might b a good idea to keep the broken pool for a while for debugging and not destroy any objects in it (or dump the objects/pool before changing). I'm a bit surprised that no developers seem to show interest in this case. The pool reduced to problematic objects only should hold interesting information about the original cause of the degradation. In a way, I really wonder how the pool delete will go. I guess there is still the problem with the OSD map that has broken PG information. The pool delete would be the last step that can lead to hiccups. I hope it goes away without taking anything with it. Best regards and good luck, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Michael Thomas <wart@xxxxxxxxxxx> Sent: 15 December 2020 21:00:12 To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: Re: multiple OSD crash, unfound objects Hi Frank, I was able to migrate the data off of the "broken" pool (fs.data.archive.frames) and onto the new one (fs.data.archive.newframes). I verified that no useful data is left on the "broken" pool: * 'find + getfattr -n ceph.file.layout.pool' shows no files on the bad pool * 'find + getfattr -n ceph.dir.layout.pool' shows no future files will land on the bad pool * 'ceph -s' shows some misplaced/degraded/unfound objects on the bad pool: data: pools: 14 pools, 3492 pgs objects: 111.94M objects, 425 TiB usage: 587 TiB used, 525 TiB / 1.1 PiB avail pgs: 68/893408279 objects degraded (0.000%) 35/893408279 objects misplaced (0.000%) 24/111943463 objects unfound (0.000%) 3480 active+clean 5 active+recovery_unfound+degraded+remapped 4 active+clean+scrubbing+deep 2 active+recovery_unfound+undersized+degraded+remapped 1 active+recovery_unfound+degraded * 'rados ls --pool fs.data.archive.frames' shows these orphaned objects. I extracted the first component of the rados object names (eg 10000020fa1.00000030) and ran 'find /ceph -inum XXX' to verify that none of these objects maps back to a known file in the cephfs filesystem. Here are the next steps that I plan to perform: * 'rados rm --pool fs.data.archive.frames <obj_id>' on a couple of objects to see how ceph handles it. * 'rados purge fs.data.archive.frames' to purge all objects in the "broken" pool * ceph fs rm_data_pool fs.data.archive.frames Is there anything else you think I ought to check before finalizing the removal of this broken pool? --Mike On 11/22/20 1:59 PM, Frank Schilder wrote: > Dear Michael, > > yes, your plan will work if the temporary space requirement can be addressed. Good luck! > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Michael Thomas <wart@xxxxxxxxxxx> > Sent: 22 November 2020 20:14:09 > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: Re: multiple OSD crash, unfound objects > > Hi Frank, > > From my understanding, with my current filesystem layout, I should be > able to remove the "broken" pool once the data has been moved off of it. > This is because the "broken" pool is not the default data pool. > According to the documentation[1]: > > fs rm_data_pool <file system name> <pool name/id> > > "This command removes the specified pool from the list of data pools for > the file system. If any files have layouts for the removed data pool, > the file data will become unavailable. The default data pool (when > creating the file system) cannot be removed." > > My default data pool (triply replicated on SSD) is still healthy. The > "broken" pool is EC on HDD, and while it holds a majority of the > filesystem data (~400TB), it is not the root of the filesystem. > > My plan would be: > > * Create a new data pool matching the "broken" pool > * Create a parallel directory tree matching the directories that are > mapped to the "broken" pool. eg Broken: /ceph/frames/..., New: > /ceph/frames.new/... > * Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree > to map the content to the new data pool > * Use parallel+rsync to copy data from the broken pool to the new pool. > * After each directory gets filled in the new pool, mv/rename the old > and new directories so that users start accessing the data from the new > pool. > * Delete data from the renamed old pool directories as they are > replaced, to keep the OSDs from filling up > * After all data is moved off of the old pool (verified by checking > ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs, > as well as rados ls, ceph df), remove the pool from the fs. > > This is effectively the same strategy I did when moving frequently > accessed directories from the EC pool to a replicated SSD pool, except > that in the previous situation I didn't need to remove any pools at the > end. It's time consuming, because every file on the "broken" pool needs > to be copied, but it minimizes downtime. Being able to add some > temporary new OSDs to the new pool (but not the "broken" pool) would > reduce some pressure of filling up the OSDs. If the old and new pools > use the same crush rule, would disabling backfilling+rebalancing keep > the OSDs from being used in the old pool until the old pool is deleted > (with the exception of the occasional new file)? > > --Mike > [1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems > > > > On 11/22/20 12:19 PM, Frank Schilder wrote: >> Dear Michael, >> >> I was also wondering whether deleting the broken pool could clean up everything. The difficulty is, that while migrating a pool to new devices is easy via a crush rule change, migrating data between pools is not so easy. In particular, if you can't afford downtime. >> >> In case you can afford some downtime, it might be possible to migrate fast by creating a new pool and use the pool copy command to migrate the data (rados cppool ...). Its important that the FS is shutdown (no MDS active) during this copy process. After copy, one could either rename the pools to have the copy match the fs data pool name, or change the data pool at the top level directory. You might need to set some pool meta data by hand, notably, the fs tag. >> >> Having said that, I have no idea how a ceph fs reacts if presented with a replacement data pool. Although I don't believe that meta data contains the pool IDs, I cannot exclude that complication. The copy pool variant should be tested with an isolated FS first. >> >> The other option is what you describe, create a new data pool, make the fs root placed on this pool and copy every file onto itself. This should also do the trick. However, with this method you will not be able to get rid of the broken pool. After the copy, you could, however, reduce the number of PGs to below the unhealthy one and the broken PG(s) might get deleted cleanly. Then you still have a surplus pool, but at least all PGs are clean. >> >> I hope one of these will work. Please post your experience here. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Michael Thomas <wart@xxxxxxxxxxx> >> Sent: 22 November 2020 18:29:16 >> To: Frank Schilder; ceph-users@xxxxxxx >> Subject: Re: Re: multiple OSD crash, unfound objects >> >> On 10/23/20 3:07 AM, Frank Schilder wrote: >>> Hi Michael. >>> >>>> I still don't see any traffic to the pool, though I'm also unsure how much traffic is to be expected. >>> >>> Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. >>> >>> That osdmaptool crashes indicates that your cluster runs with corrupted internal data. I tested your crush map and you should get complete PGs for the fs data pool. That you don't and that osdmaptool crashes points at a corruption of internal data. I'm afraid this is the point where you need support from ceph developers and should file a tracker report (https://tracker.ceph.com/projects/ceph/issues). A short description of the origin of the situation with the osdmaptool output and a reference to this thread linked in should be sufficient. Please post a link to the ticket here. >> >> https://tracker.ceph.com/issues/48059 >> >>> In parallel, you should probably open a new thread focussed on the osd map corruption. Maybe there are low-level commands to repair it. >> >> Will do. >> >>> You should wait with trying to clean up the unfound objects until this is resolved. Not sure about adding further storage either. To me, this sounds quite serious. >> >> Another approach that I'm considering is to create a new pool using the >> same set of OSDs, adding it to the set of cephfs data pools, and >> migrating the data from the "broken" pool to the new pool. >> >> I have some additional unused storage that I could add to this new pool, >> if I can figure out the right crush rules to make sure they don't get >> used for the "broken" pool too. >> >> --Mike >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx