Re: optimizing space for array of booleans

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 25, 2009 at 12:12 AM, leledumbo <leledumbo_cool@xxxxxxxxxxx> wrote:
>
>> Generally relationships like the one you describe are stored in three
>> separate and related tables: Students, Courses, and Enrollment. The
>> latter is a n:m association between the first two. The advantage this
>> approach has with regard to storage is that it is a sparse matrix.
>
> I've done that and I'm playing around with the last one, this is where the
> problem arise. It's a table with two columns, user id & the 'set' that
> stores the enroll information. Since SQL doesn't have 'nice set data type'
> (MySQL has it but it's restricted to 64 entries, and it's difficult to use),
> I need to cover the problem from PHP side.

That's because AFAIK MySQL implements the internal storage for a SET
or ENUM column as a 4-byte unsigned value. A lot of people like MySQL
because it offers those two data types, but this is the limitation. If
you want to add more options to the list, you have to figure out a way
to extend the functionality manually.

>> Most students will only enroll in a handful of courses. With this
>> approach, you only store a row in the Enrollment table if a student is
>> enrolled in a course. Otherwise, you don't need to store anything.
>
> Do you mean that, for example, if student A enrolls course x,y,z the table
> would be:
>
> user_id | course
> -------+-------
>    A    |    x
>    A    |    y
>    A    |    z

Exactly.


> Students often forget (or are lazy) to unenroll after they've finished
> studying. Wouldn't the database be bloated?

That would be up to you and how much you need to keep in the table.
You can add other columns as needed to help maintain the table. For
instance, you could put start and end dates in the relation table.
Then you could either leave the rows in there and maintain a history
of when students took specific courses, or use those dates to delete
associations that are older than you need to store in your
application.

And believe me - I'd rather deal with the bloat than with the overhead
I expect you'll see with this custom bitmask solution you are trying
to build.

> And what would be the primary key?

The primary key is usually a two-column key. (In some cases such as
when you allow a user to retake the same course during multiple date
periods, you might have to toss some additional columns into the
primary key.) In the example above, the primary key would be on
(user_id, course). The benefit you get, other than speeding up joins
to this table, is that it prevents user_id A from being enrolled in
course x more than once.

I also usually add another index on just course. This uses a little
more storage space, but I can query the table efficiently from either
side of the relationship.


>> And no, 1.2MB is not that big of a deal. A single floppy high-density
>> drive could hold that much data 18 years ago!
>
> Yeah, but now I'm also considering the speed. So, I decided to build a class
> that implements bitset operations and I'm gonna store the contents in
> database as string. Here's the code:

Try what you will, I doubt you'll beat a well-designed relational
database for speed when it comes to problems like this. For example,
how do you plan to optimize a query to find all students whose 23rd
bit is 1 or whose 2nd bit is 0? Even with an integer column, that type
of query doesn't use an index because it can't be answered by finding
a value within a range of values. Instead, it has to perform a bitwise
operation on every value in the column to see if the row matches.
That's a table scan, and they are not very efficient in a database.
They are even less efficient if you have to fetch the entire table
into PHP and perform a manual table scan there. And now that you're
looking at using strings to overcome the limitation of 64 values in a
SET column, the problem gets even more expensive to solve.

> class bitset {
>  private $bits;
>
>  private function index($bit) { return $bit / 32; }
>  private function offset($bit) { return $bit % 32; }
>
>  private function binstr() {
>    $temp_bits = $this->bits;
>    $str = "";
>    for ($i=0;$i<32;$i++) {
>      $str = strval($temp_bits & 1) . $str;
>      $temp_bits >>= 1;
>    }
>    return $str;
>  }
>
>  public function __construct() {
>    $this->bits = array(0,0,0,0,0,0,0,0);
>  }
>
>  public function tostring() {
>    $str = $this->binstr($this->bits[0]);
>    for ($i=1;$i<8;$i++)
>      $str .= "," . $this->binstr($this->bits[$i]);
>    return $str;
>  }
>
>  public function set($bit) {
>    $index  = $this->index($bit);
>    $offset = $this->offset($bit);
>    $this->bits[$index] |= 1 << $offset;
>  }
>
>  public function clear($bit) {
>    $index  = $this->index($bit);
>    $offset = $this->offset($bit);
>    $this->bits[$index] &= ~(1 << $offset);
>  }
>
>  public function test($bit) {
>    $index  = $this->index($bit);
>    $offset = $this->offset($bit);
>    return $this->bits[$index] & (1 << $offset);
>  }
> }
>
> little problem: tostring() method always returns 00..001,00..001,..
> regardless $bits contents. binstr() has been tested and works fine, what
> might be wrong?
>

Without testing it (it's late here), your binstr() function doesn't
accept parameters, so it would always return the same result each time
it's called, regardless of what you pass into it.

Andrew

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux