Re: Rocksdb: Corruption: missing start of fragmented record(1)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dan,

I can try to find the thread and the link again. I should mention that my inbox is a mess and our search function on the outlook 365 app is, well, don't mention the war. Is there a "list by thread" option on the lists.ceph.io? I can go through threads for 2 years, but not all messages.

> ceph could disable the write cache itself

I thought the newer versions were doing that already, but it looks like there is only a udev rule recommended: https://github.com/ceph/ceph/pull/43848/files. I think the write cache issue is mostly relevant for consumer grade- or low-level datacenter hardware, they need to simulate performance with cheap components. I have never seen an enterprise SAS drive with write cache enabled.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 01 December 2021 11:28:03
To: Frank Schilder
Cc: huxiaoyu@xxxxxxxxxxxx; YiteGu; ceph-users
Subject: Re:  Re: Rocksdb: Corruption: missing start of fragmented record(1)

Hi Frank,

I'd be interested to read that paper, if you can find it again. I
don't understand why the volatile cache + fsync might be dangerous due
to a buggy firmware, but yet we should trust that a firmware respects
FUA when the volatile cache is disabled.

In https://github.com/ceph/ceph/pull/43848 we're documenting the
implications of WCE -- but in the context of performance, not safety.
If write through / volatile cache off is required for safety too, then
we should take a different approach (e.g. ceph could disable the write
cache itself).

Cheers, dan



On Tue, Nov 30, 2021 at 9:36 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan.
>
> > ...however it is not unsafe to leave the cache enabled -- ceph uses
> > fsync appropriately to make the writes durable.
>
> Actually it is. You will rely on the drive's firmware to implement this correctly and this is, unfortunately, less than a given. Within the last one-two years somebody posted a link to a very interesting research paper to this list, where drives were tested under real conditions. Turns out that the "fsync to make writes persistent" is very vulnerable to power loss if volatile write cache is enabled. It I remember correctly, about 1-2% of drives ended up with data loss every time. In other words, for every drive with volatile write cache enabled, every 100 power loss events you will have 1-2 data loss events (in certain situations, the drive replies with ack before the volatile cache is actually flushed). I think even PLP did not prevent data loss in all cases.
>
> Its all down to bugs in firmware that fail to catch all corner cases and internal race conditions with ops scheduling. Vendors will very often take priority for performance over fixing a rare race condition and I will not take nor recommend to take chances.
>
> I think this kind of advice should really not be given in a ceph context without also referring to the pre-requisites: perfect firmware. Ceph is a scale-out system and any large sized cluster will have enough drives to see low-probability events on a regular basis. At least recommend to test that thoroughly, that is, perform power-loss tests under load, and I mean many power loss events per drive with randomised intervals under different load patterns.
>
> Same applies to disk controllers with cache. Nobody recommends using the controller cache because of firmware bugs that seem to be present in all models. We have sufficient cases on this list for data loss after power loss with controller cache being the issue. The recommendation is to enable HBA mode and write-through. Do the same with your disk firmware, get better sleep and better performance in one go.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> Sent: 29 November 2021 09:24:29
> To: Frank Schilder
> Cc: huxiaoyu@xxxxxxxxxxxx; YiteGu; ceph-users
> Subject: Re:  Re: Rocksdb: Corruption: missing start of fragmented record(1)
>
> Hi Frank,
>
> That's true from the performance perspective, however it is not unsafe
> to leave the cache enabled -- ceph uses fsync appropriately to make
> the writes durable.
>
> This issue looks rather to be related to concurrent hardware failure.
>
> Cheers, Dan
>
> On Mon, Nov 29, 2021 at 9:21 AM Frank Schilder <frans@xxxxxx> wrote:
> >
> > This may sound counter-intuitive, but you need to disable write cache to enable PLP cache only. SSDs with PLP have usually 2 types of cache, volatile and non-volatile. The volatile cache will experience data loss on power loss. It is the volatile cache that gets disabled when issuing the hd-/sdparm/smartctl command to switch it off. In many cases this can increase the non-volatile cache and also performance.
> >
> > It is the non-volatile cache you want your writes to go to directly.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx>
> > Sent: 26 November 2021 22:41:10
> > To: YiteGu; ceph-users
> > Subject:  Re: Rocksdb: Corruption: missing start of fragmented record(1)
> >
> > wal/db are on Intel S4610 960GB SSDs, with PLP and write back on
> >
> >
> >
> > huxiaoyu@xxxxxxxxxxxx
> >
> > From: YiteGu
> > Date: 2021-11-26 11:32
> > To: huxiaoyu@xxxxxxxxxxxx; ceph-users
> > Subject: Re: Rocksdb: Corruption: missing start of fragmented record(1)
> > It look like your wal/db device loss data.
> > please check your wal/db device whether have writeback cache, and power loss cause data loss. replay log failure when rocksdb restart.
> >
> >
> >
> > YiteGu
> > ess_gyt@xxxxxx
> >
> >
> >
> > ------------------ Original ------------------
> > From: "huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx>;
> > Date: Fri, Nov 26, 2021 06:02 PM
> > To: "ceph-users"<ceph-users@xxxxxxx>;
> > Subject:  Rocksdb: Corruption: missing start of fragmented record(1)
> >
> > Dear Cephers,
> >
> > I just had one Ceph osd node (Luminous 12.2.13) power-loss unexpectedly, and after restarting that node, two OSDs out of 10 can not be started, issuing the following errors (see below image), in particular, i see
> >
> > Rocksdb: Corruption: missing start of fragmented record(1)
> > Bluestore(/var/lib/ceph/osd/osd-21) _open_db erroring opening db:
> > ...
> > **ERROR: OSD init failed: (5)  Input/output error
> >
> > I checked the db/val SSDs, and they are working fine. So I am wondering the following
> > 1) Is there a method to restore the OSDs?
> > 2) what could be the potential causes of the corrupted db/wal? The db/wal SSDs have PLP and not been damaged during the power loss
> >
> > Your help would be highly appreciated.
> >
> > best regards,
> >
> > samuel
> >
> >
> >
> >
> > huxiaoyu@xxxxxxxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux