Re: Broken snapshots... CEPH 0.94.2

Samuel Just <sjust@xxxxxxxxxx> · Thu, 20 Aug 2015 16:28:07 -0700



Ok, create a ticket with a timeline and all of this information, I'll
try to look into it more tomorrow.
-Sam

On Thu, Aug 20, 2015 at 4:25 PM, Voloshanenko Igor
<igor.voloshanenko@xxxxxxxxx> wrote:
> Exactly
>
> пятница, 21 августа 2015 г. пользователь Samuel Just написал:
>
>> And you adjusted the journals by removing the osd, recreating it with
>> a larger journal, and reinserting it?
>> -Sam
>>
>> On Thu, Aug 20, 2015 at 4:24 PM, Voloshanenko Igor
>> <igor.voloshanenko@xxxxxxxxx> wrote:
>> > Right ( but also was rebalancing cycle 2 day before pgs corrupted)
>> >
>> > 2015-08-21 2:23 GMT+03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >>
>> >> Specifically, the snap behavior (we already know that the pgs went
>> >> inconsistent while the pool was in writeback mode, right?).
>> >> -Sam
>> >>
>> >> On Thu, Aug 20, 2015 at 4:22 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> >> > Yeah, I'm trying to confirm that the issues did happen in writeback
>> >> > mode.
>> >> > -Sam
>> >> >
>> >> > On Thu, Aug 20, 2015 at 4:21 PM, Voloshanenko Igor
>> >> > <igor.voloshanenko@xxxxxxxxx> wrote:
>> >> >> Right. But issues started...
>> >> >>
>> >> >> 2015-08-21 2:20 GMT+03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >> >>>
>> >> >>> But that was still in writeback mode, right?
>> >> >>> -Sam
>> >> >>>
>> >> >>> On Thu, Aug 20, 2015 at 4:18 PM, Voloshanenko Igor
>> >> >>> <igor.voloshanenko@xxxxxxxxx> wrote:
>> >> >>> > WE haven't set values for max_bytes / max_objects.. and all data
>> >> >>> > initially
>> >> >>> > writes only to cache layer and not flushed at all to cold layer.
>> >> >>> >
>> >> >>> > Then we received notification from monitoring that we collect
>> >> >>> > about
>> >> >>> > 750GB in
>> >> >>> > hot pool ) So i changed values for max_object_bytes to be 0,9 of
>> >> >>> > disk
>> >> >>> > size... And then evicting/flushing started...
>> >> >>> >
>> >> >>> > And issue with snapshots arrived
>> >> >>> >
>> >> >>> > 2015-08-21 2:15 GMT+03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >> >>> >>
>> >> >>> >> Not sure what you mean by:
>> >> >>> >>
>> >> >>> >> but it's stop to work in same moment, when cache layer fulfilled
>> >> >>> >> with
>> >> >>> >> data and evict/flush started...
>> >> >>> >> -Sam
>> >> >>> >>
>> >> >>> >> On Thu, Aug 20, 2015 at 4:11 PM, Voloshanenko Igor
>> >> >>> >> <igor.voloshanenko@xxxxxxxxx> wrote:
>> >> >>> >> > No, when we start draining cache - bad pgs was in place...
>> >> >>> >> > We have big rebalance (disk by disk - to change journal side
>> >> >>> >> > on
>> >> >>> >> > both
>> >> >>> >> > hot/cold layers).. All was Ok, but after 2 days - arrived
>> >> >>> >> > scrub
>> >> >>> >> > errors
>> >> >>> >> > and 2
>> >> >>> >> > pgs inconsistent...
>> >> >>> >> >
>> >> >>> >> > In writeback - yes, looks like snapshot works good. but it's
>> >> >>> >> > stop
>> >> >>> >> > to
>> >> >>> >> > work in
>> >> >>> >> > same moment, when cache layer fulfilled with data and
>> >> >>> >> > evict/flush
>> >> >>> >> > started...
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > 2015-08-21 2:09 GMT+03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >> >>> >> >>
>> >> >>> >> >> So you started draining the cache pool before you saw either
>> >> >>> >> >> the
>> >> >>> >> >> inconsistent pgs or the anomalous snap behavior?  (That is,
>> >> >>> >> >> writeback
>> >> >>> >> >> mode was working correctly?)
>> >> >>> >> >> -Sam
>> >> >>> >> >>
>> >> >>> >> >> On Thu, Aug 20, 2015 at 4:07 PM, Voloshanenko Igor
>> >> >>> >> >> <igor.voloshanenko@xxxxxxxxx> wrote:
>> >> >>> >> >> > Good joke )))))))))
>> >> >>> >> >> >
>> >> >>> >> >> > 2015-08-21 2:06 GMT+03:00 Samuel Just <sjust@xxxxxxxxxx>:
>> >> >>> >> >> >>
>> >> >>> >> >> >> Certainly, don't reproduce this with a cluster you care
>> >> >>> >> >> >> about
>> >> >>> >> >> >> :).
>> >> >>> >> >> >> -Sam
>> >> >>> >> >> >>
>> >> >>> >> >> >> On Thu, Aug 20, 2015 at 4:02 PM, Samuel Just
>> >> >>> >> >> >> <sjust@xxxxxxxxxx>
>> >> >>> >> >> >> wrote:
>> >> >>> >> >> >> > What's supposed to happen is that the client
>> >> >>> >> >> >> > transparently
>> >> >>> >> >> >> > directs
>> >> >>> >> >> >> > all
>> >> >>> >> >> >> > requests to the cache pool rather than the cold pool
>> >> >>> >> >> >> > when
>> >> >>> >> >> >> > there
>> >> >>> >> >> >> > is
>> >> >>> >> >> >> > a
>> >> >>> >> >> >> > cache pool.  If the kernel is sending requests to the
>> >> >>> >> >> >> > cold
>> >> >>> >> >> >> > pool,
>> >> >>> >> >> >> > that's probably where the bug is.  Odd.  It could also
>> >> >>> >> >> >> > be a
>> >> >>> >> >> >> > bug
>> >> >>> >> >> >> > specific 'forward' mode either in the client or on the
>> >> >>> >> >> >> > osd.
>> >> >>> >> >> >> > Why
>> >> >>> >> >> >> > did
>> >> >>> >> >> >> > you have it in that mode?
>> >> >>> >> >> >> > -Sam
>> >> >>> >> >> >> >
>> >> >>> >> >> >> > On Thu, Aug 20, 2015 at 3:58 PM, Voloshanenko Igor
>> >> >>> >> >> >> > <igor.voloshanenko@xxxxxxxxx> wrote:
>> >> >>> >> >> >> >> We used 4.x branch, as we have "very good" Samsung 850
>> >> >>> >> >> >> >> pro
>> >> >>> >> >> >> >> in
>> >> >>> >> >> >> >> production,
>> >> >>> >> >> >> >> and they don;t support ncq_trim...
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> And 4,x first branch which include exceptions for this
>> >> >>> >> >> >> >> in
>> >> >>> >> >> >> >> libsata.c.
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> sure we can backport this 1 line to 3.x branch, but we
>> >> >>> >> >> >> >> prefer
>> >> >>> >> >> >> >> no
>> >> >>> >> >> >> >> to
>> >> >>> >> >> >> >> go
>> >> >>> >> >> >> >> deeper if packege for new kernel exist.
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> 2015-08-21 1:56 GMT+03:00 Voloshanenko Igor
>> >> >>> >> >> >> >> <igor.voloshanenko@xxxxxxxxx>:
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> root@test:~# uname -a
>> >> >>> >> >> >> >>> Linux ix-s5 4.0.4-040004-generic #201505171336 SMP Sun
>> >> >>> >> >> >> >>> May 17
>> >> >>> >> >> >> >>> 17:37:22
>> >> >>> >> >> >> >>> UTC
>> >> >>> >> >> >> >>> 2015 x86_64 x86_64 x86_64 GNU/Linux
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>> 2015-08-21 1:54 GMT+03:00 Samuel Just
>> >> >>> >> >> >> >>> <sjust@xxxxxxxxxx>:
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> Also, can you include the kernel version?
>> >> >>> >> >> >> >>>> -Sam
>> >> >>> >> >> >> >>>>
>> >> >>> >> >> >> >>>> On Thu, Aug 20, 2015 at 3:51 PM, Samuel Just
>> >> >>> >> >> >> >>>> <sjust@xxxxxxxxxx>
>> >> >>> >> >> >> >>>> wrote:
>> >> >>> >> >> >> >>>> > Snapshotting with cache/tiering *is* supposed to
>> >> >>> >> >> >> >>>> > work.
>> >> >>> >> >> >> >>>> > Can
>> >> >>> >> >> >> >>>> > you
>> >> >>> >> >> >> >>>> > open a
>> >> >>> >> >> >> >>>> > bug?
>> >> >>> >> >> >> >>>> > -Sam
>> >> >>> >> >> >> >>>> >
>> >> >>> >> >> >> >>>> > On Thu, Aug 20, 2015 at 3:36 PM, Andrija Panic
>> >> >>> >> >> >> >>>> > <andrija.panic@xxxxxxxxx> wrote:
>> >> >>> >> >> >> >>>> >> This was related to the caching layer, which
>> >> >>> >> >> >> >>>> >> doesnt
>> >> >>> >> >> >> >>>> >> support
>> >> >>> >> >> >> >>>> >> snapshooting per
>> >> >>> >> >> >> >>>> >> docs...for sake of closing the thread.
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >> On 17 August 2015 at 21:15, Voloshanenko Igor
>> >> >>> >> >> >> >>>> >> <igor.voloshanenko@xxxxxxxxx>
>> >> >>> >> >> >> >>>> >> wrote:
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Hi all, can you please help me with unexplained
>> >> >>> >> >> >> >>>> >>> situation...
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> All snapshot inside ceph broken...
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> So, as example, we have VM template, as rbd
>> >> >>> >> >> >> >>>> >>> inside
>> >> >>> >> >> >> >>>> >>> ceph.
>> >> >>> >> >> >> >>>> >>> We can map it and mount to check that all ok with
>> >> >>> >> >> >> >>>> >>> it
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd map
>> >> >>> >> >> >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5
>> >> >>> >> >> >> >>>> >>> /dev/rbd0
>> >> >>> >> >> >> >>>> >>> root@test:~# parted /dev/rbd0 print
>> >> >>> >> >> >> >>>> >>> Model: Unknown (unknown)
>> >> >>> >> >> >> >>>> >>> Disk /dev/rbd0: 10.7GB
>> >> >>> >> >> >> >>>> >>> Sector size (logical/physical): 512B/512B
>> >> >>> >> >> >> >>>> >>> Partition Table: msdos
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Number  Start   End     Size    Type     File
>> >> >>> >> >> >> >>>> >>> system
>> >> >>> >> >> >> >>>> >>> Flags
>> >> >>> >> >> >> >>>> >>>  1      1049kB  525MB   524MB   primary  ext4
>> >> >>> >> >> >> >>>> >>> boot
>> >> >>> >> >> >> >>>> >>>  2      525MB   10.7GB  10.2GB  primary
>> >> >>> >> >> >> >>>> >>> lvm
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Than i want to create snap, so i do:
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd snap create
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> And now i want to map it:
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd map
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap
>> >> >>> >> >> >> >>>> >>> /dev/rbd1
>> >> >>> >> >> >> >>>> >>> root@test:~# parted /dev/rbd1 print
>> >> >>> >> >> >> >>>> >>> Warning: Unable to open /dev/rbd1 read-write
>> >> >>> >> >> >> >>>> >>> (Read-only
>> >> >>> >> >> >> >>>> >>> file
>> >> >>> >> >> >> >>>> >>> system).
>> >> >>> >> >> >> >>>> >>> /dev/rbd1 has been opened read-only.
>> >> >>> >> >> >> >>>> >>> Warning: Unable to open /dev/rbd1 read-write
>> >> >>> >> >> >> >>>> >>> (Read-only
>> >> >>> >> >> >> >>>> >>> file
>> >> >>> >> >> >> >>>> >>> system).
>> >> >>> >> >> >> >>>> >>> /dev/rbd1 has been opened read-only.
>> >> >>> >> >> >> >>>> >>> Error: /dev/rbd1: unrecognised disk label
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Even md5 different...
>> >> >>> >> >> >> >>>> >>> root@ix-s2:~# md5sum /dev/rbd0
>> >> >>> >> >> >> >>>> >>> 9a47797a07fee3a3d71316e22891d752  /dev/rbd0
>> >> >>> >> >> >> >>>> >>> root@ix-s2:~# md5sum /dev/rbd1
>> >> >>> >> >> >> >>>> >>> e450f50b9ffa0073fae940ee858a43ce  /dev/rbd1
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Ok, now i protect snap and create clone... but
>> >> >>> >> >> >> >>>> >>> same
>> >> >>> >> >> >> >>>> >>> thing...
>> >> >>> >> >> >> >>>> >>> md5 for clone same as for snap,,
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd unmap /dev/rbd1
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd snap protect
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd clone
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> cold-storage/0e23c701-401d-4465-b9b4-c02939d57bb5@new_snap
>> >> >>> >> >> >> >>>> >>> cold-storage/test-image
>> >> >>> >> >> >> >>>> >>> root@test:~# rbd map cold-storage/test-image
>> >> >>> >> >> >> >>>> >>> /dev/rbd1
>> >> >>> >> >> >> >>>> >>> root@test:~# md5sum /dev/rbd1
>> >> >>> >> >> >> >>>> >>> e450f50b9ffa0073fae940ee858a43ce  /dev/rbd1
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> .... but it's broken...
>> >> >>> >> >> >> >>>> >>> root@test:~# parted /dev/rbd1 print
>> >> >>> >> >> >> >>>> >>> Error: /dev/rbd1: unrecognised disk label
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> =========
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> tech details:
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# ceph -v
>> >> >>> >> >> >> >>>> >>> ceph version 0.94.2
>> >> >>> >> >> >> >>>> >>> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> We have 2 inconstistent pgs, but all images not
>> >> >>> >> >> >> >>>> >>> placed
>> >> >>> >> >> >> >>>> >>> on
>> >> >>> >> >> >> >>>> >>> this
>> >> >>> >> >> >> >>>> >>> pgs...
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# ceph health detail
>> >> >>> >> >> >> >>>> >>> HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
>> >> >>> >> >> >> >>>> >>> pg 2.490 is active+clean+inconsistent, acting
>> >> >>> >> >> >> >>>> >>> [56,15,29]
>> >> >>> >> >> >> >>>> >>> pg 2.c4 is active+clean+inconsistent, acting
>> >> >>> >> >> >> >>>> >>> [56,10,42]
>> >> >>> >> >> >> >>>> >>> 18 scrub errors
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> ============
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> root@test:~# ceph osd map cold-storage
>> >> >>> >> >> >> >>>> >>> 0e23c701-401d-4465-b9b4-c02939d57bb5
>> >> >>> >> >> >> >>>> >>> osdmap e16770 pool 'cold-storage' (2) object
>> >> >>> >> >> >> >>>> >>> '0e23c701-401d-4465-b9b4-c02939d57bb5' -> pg
>> >> >>> >> >> >> >>>> >>> 2.74458f70
>> >> >>> >> >> >> >>>> >>> (2.770)
>> >> >>> >> >> >> >>>> >>> -> up
>> >> >>> >> >> >> >>>> >>> ([37,15,14], p37) acting ([37,15,14], p37)
>> >> >>> >> >> >> >>>> >>> root@test:~# ceph osd map cold-storage
>> >> >>> >> >> >> >>>> >>> 0e23c701-401d-4465-b9b4-c02939d57bb5@snap
>> >> >>> >> >> >> >>>> >>> osdmap e16770 pool 'cold-storage' (2) object
>> >> >>> >> >> >> >>>> >>> '0e23c701-401d-4465-b9b4-c02939d57bb5@snap' -> pg
>> >> >>> >> >> >> >>>> >>> 2.793cd4a3
>> >> >>> >> >> >> >>>> >>> (2.4a3)
>> >> >>> >> >> >> >>>> >>> -> up
>> >> >>> >> >> >> >>>> >>> ([12,23,17], p12) acting ([12,23,17], p12)
>> >> >>> >> >> >> >>>> >>> root@test:~# ceph osd map cold-storage
>> >> >>> >> >> >> >>>> >>> 0e23c701-401d-4465-b9b4-c02939d57bb5@test-image
>> >> >>> >> >> >> >>>> >>> osdmap e16770 pool 'cold-storage' (2) object
>> >> >>> >> >> >> >>>> >>> '0e23c701-401d-4465-b9b4-c02939d57bb5@test-image'
>> >> >>> >> >> >> >>>> >>> ->
>> >> >>> >> >> >> >>>> >>> pg
>> >> >>> >> >> >> >>>> >>> 2.9519c2a9
>> >> >>> >> >> >> >>>> >>> (2.2a9)
>> >> >>> >> >> >> >>>> >>> -> up ([12,44,23], p12) acting ([12,44,23], p12)
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Also we use cache layer, which in current moment
>> >> >>> >> >> >> >>>> >>> -
>> >> >>> >> >> >> >>>> >>> in
>> >> >>> >> >> >> >>>> >>> forward
>> >> >>> >> >> >> >>>> >>> mode...
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Can you please help me with this.. As my brain
>> >> >>> >> >> >> >>>> >>> stop
>> >> >>> >> >> >> >>>> >>> to
>> >> >>> >> >> >> >>>> >>> understand
>> >> >>> >> >> >> >>>> >>> what is
>> >> >>> >> >> >> >>>> >>> going on...
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> Thank in advance!
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> _______________________________________________
>> >> >>> >> >> >> >>>> >>> ceph-users mailing list
>> >> >>> >> >> >> >>>> >>> ceph-users@xxxxxxxxxxxxxx
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>> >> >> >> >>>> >>>
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >> --
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >> Andrija Panić
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >> _______________________________________________
>> >> >>> >> >> >> >>>> >> ceph-users mailing list
>> >> >>> >> >> >> >>>> >> ceph-users@xxxxxxxxxxxxxx
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>> >> >> >> >>>> >>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >
>> >> >>> >
>> >> >>
>> >> >>
>> >
>> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com