Re: Please guide us inidentifying thecause ofthedata miss in EC pool

"Best Regards" <wu_chulin@xxxxxx> · Thu, 8 Aug 2024 14:10:39 +0800

Hi， Frédéric Nass

Thank you for your continued attention and guidance. Let's analyze and verify this issue from different perspectives.

The reason why we did not stop the investigation is that we tried to find other ways to avoid the losses caused by this sudden failure. Turning off the disk cache is the last option, of course, this operation will only be carried out after finding definite evidence.

I also have a question that among the 9 OSDs, some have not been restarted. In theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if the object cannot be recovered. I sorted out the OSD booting log where the object should be located and the PG peering process:

OSD 494/1169/1057 has been in the running state, and osd.494 was the primary of the acting_set during the failure. However, no record of the object was found using `ceph-object-tool --op list or --op log` in, so the loss of data due to disk cache loss does not seem to be a complete explanation (perhaps there is some processing logic that we have not paid attention to).

Best Regards,

Woo
wu_chulin@xxxxxx
Best Regards

                      Original Email

From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/8 4:01
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us inidentifying thecause ofthedata miss in EC pool

Hey Chulin,

Looks clearer now.

Non-persistent cache for KV metadata and Bluestore metadata certainly explains how data was lost without the cluster even noticing.

What's unexpected is data staying for so long in the disks buffers and not being written to persistent sectors at all.

Anyways, thank you for sharing your use case and investigation. It was nice chatting with you.

If you can, share this in the ceph-user list. It will for sure benefit everyone in the community.

Best regards,
Frédéric.

PS : Note that using min_size >= k + 1 on EC pools is recommended (so as min_size >= 2 on rep X3 pools) because you don't want to write data without any parity chunks.

De : wu_chulin@xxxxxx
Envoyé : mercredi 7 août 2024 11:30
À : Frédéric Nass
Objet : Re:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool

Hi,
Yes, after the file -> object -> PG -> OSD correspondence is found, the object record can be found on the specified OSD using the command `ceph-objectore-tool --op list `
The pool min_size is 6

The business department reported more than 30, but we proactively screened out more than 100. The upload time of the lost files was mainly distributed about 3 hours before the failure, and these files were successfully downloaded after being uploaded (RGW log).

One OSD corresponds to one disk, and no separate space is allocated for WAL/DB.

The HDD cache is the default (SATA is enabled by default), and the hard disk cache has not been forcibly turned off due to performance issues.

The loss of OSD data due to the loss of hard disk cache was our initial inference, and the initial explanation provided to the business department was the same. When the cluster was restored, ceph reported 12 unfound objects, which is acceptable. After all, most devices were powered off abnormally, and it is difficult to ensure the integrity of all data. Up to now, our team have not located how the data was lost. In the past, when the hard disk hardware was damaged, either the OSD could not start because of damaged key data, or some objects were read incorrectly after the OSD started, which could be repaired. Now deep-scrub cannot find the problem, which may be related to the loss (or deletion) of object metadata. After all, deep-scrub needs the object list of the current PG. If those 9 OSDs do not have the object metadata information, deep-scrub does not know the existence of this object.

wu_chulin@xxxxxx
wu_chulin@xxxxxx

Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/6 20:40
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: Re:Re: [ceph-users] Please guide us in identifying thecause ofthedata miss in EC pool

That's interesting.

Have you tried to correlate any existing retrievable object to PG id and OSD mapping in order to verify the presence of each of these object's shards using ceph-objectore-tool on each one of its acting OSDs, for a previously and successfully written S3 object.

This would help verify that the command you've run trying to find the missing shard was good.

Also what min_size is this pool using? 

How many S3 objects like these were reported missing by the business unit? Have you or they made an inventory of unretrievable / missing objects?

Are WAL / RocksDB collocated on HDDs only OSDs or are these OSDs using SSDs for WAL / RocksDB?

Did you disable HDD buffers (also known as disk cache) as... HDD buffers are non-persistent.

I know nothing about the intensity of your workloads but if your looking for a few tens or a few hundreds of unwritten s3 objects, there might be some situation with non-persistent cache (like volatile disk buffers) were ceph would consider the data to be written when it was actually not at the moment of the power outage, due to the use of non-persistent disk buffers. Especially if you kept writing data with less shards (min_size) than k+1 (no parity at all). That sound like a possibility.

Also what I'm thinking right is... If you can identify which shard over 9 is wrong, then you may use ceph-objectore-tool or ceph-kvstore-tool to destroy this particular shard, then deep-scrub the PG so to detect the inexistent shard and have it rebuilt.

Never tried this myself though.

Best regards,
Frédéric.

De : wu_chulin@xxxxxx
Envoyé : mardi 6 août 2024 12:15
À : Frédéric Nass
Objet : Re:Re: Re:Re: [ceph-users] Please guide us in identifying the cause ofthedata miss in EC pool

Hi, 
Thank you for your attention to this matter.

1. Manually executing deep-scrub PG did not report any errors. I checked the OSD logs and did not see any errors detected or fixed by the OSD. Ceph health was also normal, and the OS where the OSD was located did not report any IO type errors.

2. At first, I also suspected whether the objects were distributed on these OSDs before the failure. I used ceph osd getmap <epoch_before_failure> and got "Error ENOENT: there is no map for epoch xxxxx". I am not sure if there are other ways to get the osdmap before the failure. After the failure, all OSDs joined the cluster, and the number was large (1000+). It was a bit difficult to find the OSD corresponding to the object without osdmap, so we didn't go into this direction. When the business department reported that the files could not be downloaded, I first opened the RGW debug log, found the mapping between file_name (64M) and oid (4M), then found the corresponding OSD according to ceph osd map <pool> <oid>, and then opened the OSD debug log to find the error information.

3. Our Ceph version is 13.2.10, which does not have pg autoscaler; lifecycle policies are not set.

4. We have 90+ hosts, most of which are Dell R720xd, most of the hard disks are 3.5 inches/5400 rpm/10TB Western Digital/, and most of the controllers are PERC H330 Mini; this is the current cluster status:
cluster:
id: f990db28-9604-4d49-9733-b17155887e3b
health: HEALTH_OK

services:
  mon: 5 daemons, quorum cz-ceph-01,cz-ceph-02,cz-ceph-03,cz-ceph-07,cz-ceph-13
  mgr: cz-ceph-01(active), standbys: cz-ceph-03, cz-ceph-02
  osd: 1172 osds: 1172 up, 1172 in 
  rgw: 9 daemons active 
data: 
  pools: 16 pools, 25752 pgs 
  objects: 2.23 G objects, 6.1 PiB 
  usage: 9.4 PiB used, 2.5 PiB / 12 PiB avail 
  pgs:      25686 active+clean 
               64 active+clean+scrubbing+deep+repair 
               2 active+clean+scrubbing+deep

Best regards.

Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/6 15:34
To:"wu_chulin"< wu_chulin@xxxxxx >;
Subject:Re: Re:Re: [ceph-users] Please guide us in identifying the cause ofthedata miss in EC pool

Hi,

Did the deep-scrub report any errors? (any inconsistencies should show errors after deep-scrubbing the PG.)

We're these errors fixed by the PG repair?

Is is possible that you looked for the wrong PG or OSD when trying to list these objects with ceph-objectstore-tool?

Were the PG autoscaler running at that time? Are you using S3 lifecycle policies that could have move this object to another placement pool and so another PG?

Can you give details about this cluster? Hardware, disks, controller, etc.

Cheers,
Frédéric.

De : wu_chulin@xxxxxx
Envoyé : lundi 5 août 2024 10:09
À : Frédéric Nass
Objet : Re:Re: [ceph-users] Please guide us in identifying the cause of thedata miss in EC pool

Hi，
Thank you for your reply. I apologize for the omission in the previous email. Please disregard the previous email and refer to this one instead. 
After the failure, we executed the repair and deep-scrub command on some of the PGs that lost data, and the status was active+clean after completion, but the object still could not be retrieved.
Our erasure code parameters are k=6, m=3. Theoretically, the data of three OSDs lost due to power failure should be recoverable. However, we stopped nine OSDs and exported the object list, but could not find the lost object information. What puzzled us was that some OSDs were not powered off and were still running, but their object lists did not have the information too.

Best regards.
wu_chulin@xxxxxx
wu_chulin@xxxxxx

Original Email
From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx >;
Sent Time:2024/8/3 15:11
To:"wu_chulin"< wu_chulin@xxxxxx >;"ceph-users"< ceph-users@xxxxxxx >;
Subject:Re: [ceph-users] Please guide us in identifying the cause of thedata miss in EC pool

Hi,

First thing that comes to mind when it comes to data unavailability or inconsistencies after a power outage is that some dirty data may have been lost along the IO path before reaching persistent storage. This can happen with non enterprise grade SSDs using non-persistent cache or with HDDs disk buffer if left enabled for example.

With that said, have you tried to deep-scrub the PG from which you can't retrieve data? What's the status of this PG now? Did it recover?

Regards,
Frédéric.

De : wu_chulin@xxxxxx
Envoyé : mercredi 31 juillet 2024 05:49
À : ceph-users
Objet : [ceph-users] Please guide us in identifying the cause of the data miss in EC pool

Dear Ceph team:&nbsp; &nbsp; &nbsp;On July 13th at 4:55 AM, our Ceph cluster experienced a significant power outage in the data center, causing a large number of OSDs to power off and restart (total: 1172, down: 821). Approximately two hours later, all OSDs successfully started, and the cluster resumed its services. However, around 6 PM, the business department reported that some files, which had been successfully written (via the RGW service), were failing to download, and the number of such files was quite significant. Consequently, we began a series of investigations:

1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were UP&amp;IN, and subsequently, we executed `ceph osd unpause`.

2. We randomly selected a problematic file and attempted to download it via the S3 API. The RGW returned "No such key".

3. The RGW logs showed op status=-2, http status=200. We also checked the upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, http_status=200.

4. We set debug_rgw=20 and attempted to download the file again. It was found that a 4M chunk(this file is 64M) failed to get.

5. Using rados get for this chunk returned: "No such file or directory".

6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache.

7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != '0xfffffffffffffffeffffffffffffffff'o'.

8. We stopped the primary OSD and tried to get the file again, but the result was the same. The object’s corresponding PG state was active+recovery_wait+degraded.

9. Using ceph-objectstore-tool --op list &amp;&amp; --op log, we could not find the object information. The ceph-kvstore-tool rocksdb command also did not reveal anything new.

10. If an OSD had lost data, the PG state should have been unfound or inconsistency.

11. We started reanalyzing the startup logs of the OSDs related to the PG. The pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, and after peering, the PG state became ACTIVE.

12. We divided the lost files, and the upload time was before the failure occurred. The earliest upload time was around 1 am, and the successful upload records could be found in the RGW log

13. We have submitted an issue on the Ceph issue tracker:&nbsp;https://tracker.ceph.com/issues/66942, it includes the original logs needed for troubleshooting. However, four days have passed without any response. In desperation, we are sending this email, hoping that someone from the Ceph team can guide us as soon as possible.

We are currently in a difficult situation and hope you can provide guidance. Thank you.

Best regards.

wu_chulin@xxxxxx

wu_chulin@xxxxxx

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx