On Thu, 31 Jan 2013, sheng qiu wrote: > Hi Sage, > > thanks for your reply. sorry to bother you again. > > i like your suggestion "create different pools with different types of > storage devices". Can you give me a quick guide, if i want to > implement this function, which code files i need to review in ceph > server side? > i tried to figure out myself, but not quite work well. The code base > is very complicated to me as a new guy to ceph. > > i reviewed the codes on kernel client part, generally it will first > calculate the obj-id and pgid and pg-pool id. Then call > ceph_osdc_start_request() to use CRUSH to actually map to the OSDs > list and send the request through the socket connection. > some concepts are not clear to me: > > a. what is the pg pool used for? is it a logical group of PGs? if so, > how do you group the PGs? > b. How does the client/monitor know the alive OSDs? i suppose ceph > should maintain a table or list which have information for all > connected OSDs. if so, can you tell me what's the data structure in > the code? How do you fill/update this table? i suppose you can get the > initial OSDs from the configure file during setting up. 'ceph osd dump' to see the structure. > If i want to implement my function, i guess i need to do the following job: > > a. tag different obj with which pool it want to go (i.e. SSD or HDD) > b. put the obj into the proper PG and PG pools. > c. when map PG to OSD, map them to the proper OSD. The fs client does all of that for you. All you need to do is tag a direcory with a new data pool so that new files are tagged; the rest is done for you. It's something like ceph osd pool create foo <num pgs> ceph mds add_data_pool foo ceph osd dump | grep foo cephfs /mnt/ceph/something -p <numeric id for pool foo> > > Do you think this is the correct path for it? > > Thanks, > Sheng > > On Fri, Jan 25, 2013 at 10:46 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > > On Fri, 25 Jan 2013, sheng qiu wrote: > >> Hi Sage, > >> > >> i am appreciated for your reply. > >> > >> by my understanding on reading the client codes, i think ceph > >> allocates msg based on individual file. In another word, if one client > >> is updating on different files (each file is doing small > >> writes/updates, i.e. 4kb), ceph has to compose different msgs for each > >> file and send them to the corresponding OSDs. > > > > Right. > > > >> if there are msgs > >> targeted on the same OSDs, can they be merged at client side? > > > > No.. > > > >> This may > >> help if the bandwidth of network is not sufficient, although i do not > >> know how much the chance they fall onto the same OSDs. > > > > It would save on the header/footer msg overhead, but increase complexity > > on both sides (esp the OSD side). I don't think it's worth it. > > > >> if my understanding is not correct, please figure out. i am doing > >> research on DFS, and pretty interested on ceph. Have you considered > >> manage a hybrid storage pool, while may compose of some faster devices > >> such as NVRAM/SSD and some slow devices such as HDD, and make ceph be > >> aware of this to better place/distribute data instead of a flat way. > > > > You can create different pols with different types of storage, and > > distribute uniformly ("flat") across each pool. Eventually, we'd like > > cephfs to migrate files between pools based on temperature. > > > > Alternatively, you can build hybrid OSDs that combine SSD and HDDs (and/or > > NVRAM) and do the teiring inside each OSD. > > > > sage > > > >> > >> Thanks, > >> Sheng > >> > >> > >> On Thu, Jan 24, 2013 at 11:00 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >> > Hi Sheng, > >> > > >> > On Thu, 24 Jan 2013, sheng qiu wrote: > >> >> i am trying to understand the ceph codes on client side. > >> >> for write path, if it's aio_write, the ceph_write_begin() allocate > >> >> pages in page cache to buffer the written data, however i did not see > >> >> it allocated any space on the remote OSDs (for local fs such as ext2, > >> >> the get_block() did this), > >> >> i suppose it's done later when invoke kernel flushing process to write > >> >> back the dirty pages. > >> > > >> > Right. Objects are instantiated and written to teh osds when the write > >> > operations are sent over the network, normally during writeback (via > >> > the ->writepages() op in addr.c). > >> > > >> >> i checked the ceph_writepages_start(), here it seems organize the > >> >> dirty data and prepare the requests to send to the OSDs. For new > >> >> allocated written data, how it maps to the OSDs and where it is done? > >> >> is it done in ceph_osdc_new_request()? > >> > > >> > I happens later, when the actual request is ready to go over the wire. The > >> > target OSD may change in the meantime, or the request may have to be > >> > resent to another OSD. As far as the upper layers are concerned, though, > >> > they are writing to the object, without caring where the object happens to > >> > currently live. > >> > > >> >> If the transfer unit is not limited to sizes of obj, i supposed that > >> >> ceph needed to packed several pieces of data (smaller than one obj > >> >> size) together so that there won't be internal fragmentation for an > >> >> object. who does this job and which part of source codes/files are > >> >> related with this? > >> > > >> > Each file is striped over a different sequence of objects. Small > >> > files mean small objects. Large files stipe over (by default) 4 > >> > MB objects. It's the OSDs job to store these efficiently. We just use a > >> > local file system. btrfs is great about packing small files inline in the > >> > btree; xfs and ext4 are more convential fs's and pretty well. > >> > > >> > sage > >> > > >> >> I really want to get a deep understanding about the codes, so i raised > >> >> these questions. if my understanding is not correct, please figure > >> >> out. i will be very appreciated. > >> >> > >> >> Thanks, > >> >> Sheng > >> >> > >> >> -- > >> >> Sheng Qiu > >> >> Texas A & M University > >> >> Room 332B Wisenbaker > >> >> email: herbert1984106@xxxxxxxxx > >> >> College Station, TX 77843-3259 > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> >> > >> >> > >> > >> > >> > >> -- > >> Sheng Qiu > >> Texas A & M University > >> Room 332B Wisenbaker > >> email: herbert1984106@xxxxxxxxx > >> College Station, TX 77843-3259 > >> > >> > > > > -- > Sheng Qiu > Texas A & M University > Room 332B Wisenbaker > email: herbert1984106@xxxxxxxxx > College Station, TX 77843-3259 > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html