Re: Please guide us inidentifying thecauseofthedata miss in EC pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Chulin,

When it comes to data consistency, it's generally admitted that Ceph is an undefeated master.

Considering the very few (~100) rados objects that were completely lost (data and metadata) and the fact that you're using colocated HDD OSDs with volatile disk buffers caching rocksdb metadata and Bluestore data and metadata, I doubt that volatile disk buffers weren't involved in the data loss, whatever the logs say or don't say about which of the 6 over 9 OSDs were in the acting set at the moment of the power outage.

Unless you're ok with facing data loss again, I'd advise you fix the initial design flaws if you can. Like stop using non-persistent cache / buffers along the IO path, raise mon_size to k+1 and reconsider data placement in regards to risks of network partitioning, power outage, fire. Also, considering the ceph status, make sure you don't run out of disk space.

Best regards,
Frédéric.

________________________________
De : Best Regards <wu_chulin@xxxxxx>
Envoyé : jeudi 8 août 2024 11:32
À : Frédéric Nass
Cc: ceph-users 
Objet : Re:Re: Re: Re: Please guide us inidentifying thecauseofthedata miss in EC pool

Hi,Frédéric Nass


Sorry, I may not have expressed it clearly before. The epoch and OSD up/down timeline was extracted and merged from the 9 OSD logs. I analyzed the PG (9.11b6) peering process. OSD 494, 1169, 1057 fully recorded the down/up of other OSDs. I also checked the logs of the other 6 OSDs. The role conversion during peering was expected and no abnormalities were found. I also checked the status of the monitor. One of the 5 monitors lost power and was powered on after about 40 minutes. The log showed that its rank value was relatively large and it did not become the leader.

Let's talk about the fault domain. The fault domain we set is the host level, but in fact all hosts are distributed in 2 buildings, but the original designer did not consider the fault level of the building.



In this case, the OSD may have a brain split, but from the log, it does not happen.



Best regards.



Best&nbsp;Regards
wu_chulin@xxxxxx

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/8 15:40

To:"Best Regards"< wu_chulin@xxxxxx &gt;;

Cc recipient:"ceph-users"< ceph-users@xxxxxxx &gt;;

Subject:Re: Re: Re: Please guide us inidentifying thecauseofthedata miss in EC pool


ceph osd pause is a lot of constraints from an operational perspective. :-)


host uptime and service running time is a thing. But it doesn't mean that these 3 OSDs were in the acting set when the power outage occured.


Since OSDs 494, 1169 and 1057 did not crash, I assume they're in the same failure domain. Is that right? 


Being isolated along with their local MON(s) from other MONs and other 6 OSDs, there's a fair chance that any of the 6 other OSDs in other failure domains took the lead, sent 5 chunks around and acknowledged the write to RGW client. Then all of them crashed.


Your thoughts?


Frédéric.






De : Best Regards <wu_chulin@xxxxxx&gt;
Envoyé : jeudi 8 août 2024 09:16
À : Frédéric Nass
Cc: ceph-users 
Objet : Re: Re: Please guide us inidentifying thecause ofthedata miss in EC pool






Hi,&nbsp;Frédéric Nass


Yes.&nbsp;I checked the host running time where the OSD is located and the OSD service running time.These were stopped when I executed `ceph-object-tool`. They were running before.


Because we need to maintain the hardware frequently (the hardware is quite old), min_size is set to the lowest value. When a failure occurs, we set the read and write pause flag.&nbsp;During the failure, there is no PUT action on the S3 keys to which these objects belong.








Best Regards





Best&nbsp;Regards
wu_chulin@xxxxxx

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/8 14:40

To:"Best Regards"< wu_chulin@xxxxxx &gt;;

Cc recipient:"ceph-users"< ceph-users@xxxxxxx &gt;;

Subject: Re: Please guide us inidentifying thecause ofthedata miss in EC pool


Hi Chulin,

Are you 100% sure that 494, 1169 and 1057 (that did not restart) were in the acting set at the exact moment the power outage occured?&nbsp;

I'm asking because min_size 6 would have allowed the data to be written to eventually 6 crashing OSDs.

Bests,
Frédéric.


________________________________
De : Best Regards 
Envoyé : jeudi 8 août 2024 08:10
À : Frédéric Nass
Cc: ceph-users 
Objet : Re:Re: Re:Re: Re:Re: Re:Re:  Please guide us inidentifying thecause ofthedata miss in EC pool

Hi,&nbsp;Frédéric Nass


Thank you for your continued attention and guidance. Let's analyze and verify this issue from different perspectives.


The reason why we did not stop the investigation is that we tried to find other ways to avoid the losses caused by this sudden failure. Turning off the disk cache is the last option, of course, this operation will only be carried out after finding definite evidence.

I also have a question that among the 9 OSDs, some have not been restarted. In theory, these OSDs should retain the object info(metadata,pg_log,etc.), even if the object cannot be recovered. I sorted out the OSD booting log where the object should be located and the PG peering process:


OSD 494/1169/1057 has been in the running state, and osd.494 was the primary of the acting_set during the failure. However, no record of the object was found using `ceph-object-tool --op list or --op log` in, so the loss of data due to disk cache loss does not seem to be a complete explanation (perhaps there is some processing logic that we have not paid attention to).





Best Regards,



Woo
wu_chulin@xxxxxx

Best&nbsp;Regards






                       
Original Email
                       
                     

From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/8 4:01

To:"wu_chulin"< wu_chulin@xxxxxx &gt;;

Subject:Re: Re:Re: Re:Re: Re:Re:  Please guide us inidentifying thecause ofthedata miss in EC pool


Hey Chulin,


Looks clearer now.
 


Non-persistent cache for KV metadata and Bluestore metadata certainly explains how data was lost without the cluster even noticing.


What's unexpected is data staying for so long in the disks buffers and not being written to persistent sectors at all.


Anyways, thank you for sharing your use case and investigation. It was nice chatting with you.


If you can, share this in the ceph-user list. It will for sure benefit everyone in the community.


Best regards,
Frédéric.


PS : Note that using min_size &gt;= k + 1 on EC pools is recommended (so as min_size &gt;= 2 on rep X3 pools) because you don't want to write data without any parity chunks.









De : wu_chulin@xxxxxx
Envoyé : mercredi 7 août 2024 11:30
À : Frédéric Nass
Objet : Re:Re: Re:Re: Re:Re:  Please guide us in identifying thecause ofthedata miss in EC pool




Hi,
Yes, after the file -&gt; object -&gt; PG -&gt; OSD correspondence is found, the object record can be found on the specified OSD using the command `ceph-objectore-tool --op list `

The pool min_size is 6


The business department reported more than 30, but we proactively screened out more than 100. The upload time of the lost files was mainly distributed about 3 hours before the failure, and these files were successfully downloaded after being uploaded (RGW log).


One OSD corresponds to one disk, and no separate space is allocated for WAL/DB.


The HDD cache is the default (SATA is enabled by default), and the hard disk cache has not been forcibly turned off due to performance issues.


The loss of OSD data due to the loss of hard disk cache was our initial inference, and the initial explanation provided to the business department was the same. When the cluster was restored, ceph reported 12 unfound objects, which is acceptable. After all, most devices were powered off abnormally, and it is difficult to ensure the integrity of all data. Up to now, our team have not located how the data was lost. In the past, when the hard disk hardware was damaged, either the OSD could not start because of damaged key data, or some objects were read incorrectly after the OSD started, which could be repaired. Now deep-scrub cannot find the problem, which may be related to the loss (or deletion) of object metadata. After all, deep-scrub needs the object list of the current PG. If those 9 OSDs do not have the object metadata information, deep-scrub does not know the existence of this object.



wu_chulin@xxxxxx
wu_chulin@xxxxxx








Original Email



From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/6 20:40

To:"wu_chulin"< wu_chulin@xxxxxx &gt;;

Subject:Re: Re:Re: Re:Re:  Please guide us in identifying thecause ofthedata miss in EC pool




That's interesting.


Have you tried to correlate any existing retrievable object to PG id and OSD mapping in order to verify the presence of each of these object's shards using ceph-objectore-tool on each one of its acting OSDs, for a previously and successfully written S3 object.


This would help verify that the command you've run trying to find the missing shard was good.


Also what min_size is this pool using?&nbsp;
&nbsp;
How many S3 objects like these were reported missing by the business unit? Have you or they made an inventory of unretrievable / missing objects?


Are WAL / RocksDB collocated on HDDs only OSDs or are these OSDs using SSDs for WAL / RocksDB?


Did you disable HDD buffers (also known as disk cache) as... HDD buffers are non-persistent.


I know nothing about the intensity of your workloads but if your looking for a few tens or a few hundreds of unwritten s3 objects, there might be some situation with non-persistent cache (like volatile disk buffers) were ceph would consider the data to be written when it was actually not at the moment of the power outage, due to the use of non-persistent disk buffers. Especially if you kept writing data with less shards (min_size) than k+1 (no parity at all). That sound like a possibility.




Also what I'm thinking right is... If you can identify which shard over 9 is wrong, then you may use ceph-objectore-tool or ceph-kvstore-tool to destroy this particular shard, then deep-scrub the PG so to detect the inexistent shard and have it rebuilt.


Never tried this myself though.


Best regards,
Frédéric.






De : wu_chulin@xxxxxx
Envoyé : mardi 6 août 2024 12:15
À : Frédéric Nass
Objet : Re:Re: Re:Re:  Please guide us in identifying the cause ofthedata miss in EC pool









Hi,&nbsp;
Thank you for your attention to this matter.


1. Manually executing deep-scrub PG did not report any errors. I checked the OSD logs and did not see any errors detected or fixed by the OSD. Ceph health was also normal, and the OS where the OSD was located did not report any IO type errors.

2. At first, I also suspected whether the objects were distributed on these OSDs before the failure. I used ceph osd getmap 

3. Our Ceph version is 13.2.10, which does not have pg autoscaler; lifecycle policies are not set.


4. We have 90+ hosts, most of which are Dell R720xd, most of the hard disks are 3.5 inches/5400 rpm/10TB Western Digital/, and most of the controllers are PERC H330 Mini; this is the current cluster status:
cluster:
id: f990db28-9604-4d49-9733-b17155887e3b
health: HEALTH_OK


services:
&nbsp; mon: 5 daemons, quorum cz-ceph-01,cz-ceph-02,cz-ceph-03,cz-ceph-07,cz-ceph-13
&nbsp; mgr: cz-ceph-01(active), standbys: cz-ceph-03, cz-ceph-02
&nbsp; osd: 1172 osds: 1172 up, 1172 in&nbsp;
&nbsp; rgw: 9 daemons active&nbsp;
data:&nbsp;
&nbsp; pools: 16 pools, 25752 pgs&nbsp;
&nbsp; objects: 2.23 G objects, 6.1 PiB&nbsp;
&nbsp; usage: 9.4 PiB used, 2.5 PiB / 12 PiB avail&nbsp;
&nbsp; pgs:&nbsp; &nbsp; &nbsp; 25686 active+clean&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;64 active+clean+scrubbing+deep+repair&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2 active+clean+scrubbing+deep


Best regards.



Original Email



From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/6 15:34

To:"wu_chulin"< wu_chulin@xxxxxx &gt;;

Subject:Re: Re:Re:  Please guide us in identifying the cause ofthedata miss in EC pool


Hi,


Did the deep-scrub report any errors? (any inconsistencies should show errors after deep-scrubbing the PG.)
&nbsp;
We're these errors fixed by the PG repair?


Is is possible that you looked for the wrong PG or OSD when trying to list these objects with ceph-objectstore-tool?


Were the PG autoscaler running at that time? Are you using S3 lifecycle policies that could have move this object to another placement pool and so another PG?


Can you give details about this cluster? Hardware, disks, controller, etc.


Cheers,
Frédéric.











De : wu_chulin@xxxxxx
Envoyé : lundi 5 août 2024 10:09
À : Frédéric Nass
Objet : Re:Re:  Please guide us in identifying the cause of thedata miss in EC pool




Hi,
Thank you for your reply. I apologize for the omission in the previous email. Please disregard the previous email and refer to this one instead.&nbsp;
After the failure, we executed the repair and deep-scrub command on some of the PGs that lost data, and the status was active+clean after completion, but&nbsp;the object still could not be retrieved.
Our erasure code parameters are k=6, m=3. Theoretically, the data of three OSDs lost due to power failure should be recoverable. However, we stopped nine OSDs and exported the object list, but could not find the lost object information.&nbsp;What puzzled us was that some OSDs were not powered off and were still running, but their object lists did not have the information too.




Best regards.




wu_chulin@xxxxxx
wu_chulin@xxxxxx








Original Email



From:"Frédéric Nass"< frederic.nass@xxxxxxxxxxxxxxxx &gt;;

Sent Time:2024/8/3 15:11

To:"wu_chulin"< wu_chulin@xxxxxx &gt;;"ceph-users"< ceph-users@xxxxxxx &gt;;

Subject:Re:  Please guide us in identifying the cause of thedata miss in EC pool


Hi,


First thing that comes to mind when it comes to data unavailability or inconsistencies after a power outage is that some dirty data may have been lost along the IO path before reaching persistent storage. This can happen with non enterprise grade SSDs using non-persistent cache or with HDDs disk buffer if left enabled for example.


With that said, have you tried to deep-scrub the PG from which you can't retrieve data? What's the status of this PG now? Did it recover?


Regards,
Frédéric.






De : wu_chulin@xxxxxx
Envoyé : mercredi 31 juillet 2024 05:49
À : ceph-users
Objet :  Please guide us in identifying the cause of the data miss in EC pool




Dear Ceph team:&amp;nbsp; &amp;nbsp; &amp;nbsp;On July 13th at 4:55 AM, our Ceph cluster experienced a significant power outage in the data center, causing a large number of OSDs to power off and restart (total: 1172, down: 821). Approximately two hours later, all OSDs successfully started, and the cluster resumed its services. However, around 6 PM, the business department reported that some files, which had been successfully written (via the RGW service), were failing to download, and the number of such files was quite significant. Consequently, we began a series of investigations: 


 1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were UP&amp;amp;IN, and subsequently, we executed `ceph osd unpause`. 


 2. We randomly selected a problematic file and attempted to download it via the S3 API. The RGW returned "No such key". 


 3. The RGW logs showed op status=-2, http status=200. We also checked the upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, http_status=200. 


 4. We set debug_rgw=20 and attempted to download the file again. It was found that a 4M chunk(this file is 64M) failed to get. 


 5. Using rados get for this chunk returned: "No such file or directory". 


 6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache. 


 7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != '0xfffffffffffffffeffffffffffffffff'o'. 


 8. We stopped the primary OSD and tried to get the file again, but the result was the same. The object’s corresponding PG state was active+recovery_wait+degraded. 


 9. Using ceph-objectstore-tool --op list &amp;amp;&amp;amp; --op log, we could not find the object information. The ceph-kvstore-tool rocksdb command also did not reveal anything new. 


 10. If an OSD had lost data, the PG state should have been unfound or inconsistency. 


 11. We started reanalyzing the startup logs of the OSDs related to the PG. The pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, and after peering, the PG state became ACTIVE. 


 12. We divided the lost files, and the upload time was before the failure occurred. The earliest upload time was around 1 am, and the successful upload records could be found in the RGW log 


 13. We have submitted an issue on the Ceph issue tracker:&amp;nbsp;https://tracker.ceph.com/issues/66942, it includes the original logs needed for troubleshooting. However, four days have passed without any response. In desperation, we are sending this email, hoping that someone from the Ceph team can guide us as soon as possible. 


 We are currently in a difficult situation and hope you can provide guidance. Thank you. 



 Best regards. 





 wu_chulin@xxxxxx 
 wu_chulin@xxxxxx
 _______________________________________________
 ceph-users mailing list -- ceph-users@xxxxxxx
 To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux