Personal:
   About Me
  +My CV
   Friends
  +My favourite
Interests:
  +Linux
  +Java
  +Python
   Open Source
  +Epplets
Projects:
   Freenet
   Cloud Maps
   Subsets
   WMgrav

Last modified:
13-01-2000 11:55
by Ian Clarke


BACK TO MY HOMEPAGE 

Testing Subsets Using Prime Numbers

Ian Clarke March-May 1997

During a recent project which I was pondering, the problem arose of how to test whether one set is a subset of another, and it had to be fast.
To illustrate things, let us create three sets, set A consists of the whole numbers {17,12,54,81,55}, set B consists of the whole numbers {16,19,12}, and set C consists of the whole numbers {54,17,55}.  How can we design an algorithm to test which, if any, of sets B and C is a subset of set A?
The obvious algorithm is to, for each element of the smaller set, test to see whether it is in the larger set, and accept the smaller set as a subset if all of its elements can be found in the larger set.  This algorithm works fine, with one exception, it is slooooooow.  Let us say our two sets have 3000 and 2000 elements respectively (obviously we are testing to see whether the smaller set is a subset of the larger set).  We must do 3000 X 2000 comparisons ie. 6000000 comparisons!  This is definitely not acceptable.  Back to the drawing board.

I eventually came up with an idea.  Let us say we are dealing only with whole numbers between 1 and 100 (although the same idea will work with any finite domain of possible elements).  We shall assign a unique prime number to each element, as follows:

Number  Prime 

11 

13 


...all the way up to 100.  When we do this for the elements we are dealing with above, we get the table:

Number  12  16  17  19  54  55  81 
Prime  37  53  59  61  251  257  419 


Now, for each set, obtain the corresponding prime for each element, and find their product:

Set A:

Number  17  12  54  81  55 

Product 

Prime  59  37  251  419  257  59003069239 


Set B:

Number  16  19  12  Product 
Prime  53  61  37 

119621 


Set C:

Number  54  17  55  Product 
Prime  251  59  257 

3805913 

So we now have three numbers corresponding to our three sets - what now?
Let us try an experiment, what happens if we divide the number for set A by the numbers for sets B and C?

Well,
    A/B = 59003069239 / 119621 = 493250.0919

Nothing very remarkable about that, but...
    A/C = 59003069239 / 3805913 = 15503

Wow!  It divides in evenly ie. there is no remainder.

In actual fact, if we do this with any two subsets, and we discover that the smaller one divides evenly into the larger one, it is a subset, if it does not, it isn't!

But how fast is this really?  Well, if we used the simple algorithm, we would have had to do 15 comparisons for each of B and C, a total of 30 operations.  Here we did 11 multiplications and 2 divisions, a total of 13 operations!

Great, so where is the snag?  Well, I have glossed over something.  The numbers we are dealing with get very big very quickly, and multiplying large numbers can slow things down (or stop them compeltely if you are restricted to low (ie.less than 4294967296 for 32 bit machines).  If we are dealing with a couple of hundred elements per set, we start getting very VERY large numbers, and any speed advantage is out the window.

So what can we do to keep the numbers smaller?  Well, there are some options.  Let us say that we want to calculate an equation such as the following:

A x B x C x D x E x F
G x H x I x J 

While keeping the numbers small.  Well, we could also look at this equation as:

(A/G) x (B/H) x (C/I) x (D/J) x E x F 

So now we have the product of some fractions... at this stage we are still going to get big numbers, unless, after calculating each fraction, we reduce it to its smallest form.  This is done by dividing the numerator and denominator by their Greatest Common Divisor (GCD).  There are some very very efficient algorithms around for caclulating GCDs so speed is not much of a worry.  We could even have the algorithm only reduce the fractions if they get to large thus making things yet more efficient.  But what is going on when we do this?  Well, essentially when we compute the GCD, we are pairing off any primes which appear both on the top of the fraction and on the bottom, and removing them as we go along.  With this knowledge, we also know that we can increase the chances of primes matching up in this way by sorting both sets in order - as a matter of fact, there is a very good chance they will be in order anyway as this is the most efficient way to deal with sets if you don't want repetition.

Before I finish, I should briefly describe some of the competition. One of these was explained to me by Alexios Chouchoulas, the other by Steve Faulkner.

The first algorithm is as follows: Given a domain of N elements, we will assign each element a bit in a bit-string. We then take the larger would-be superset and express it by flipping the corresponding bit for each element in the set. We do the same for the smaller set. By way of an example, let us take a domain of numbers from 1 to 8, and two sets [1,4,6,7] and [1,4,7]. They would be expressed as follows:

10010110 [1,4,6,7] and
10010010 [1,4,7]

We now perform a bitwise AND on these two bit-strings, a function which is second-nature to computers, the string we get is:

10010010

If we find that this is the same as the smaller set then we can conclude that the smaller set is indeed a subset of the larger set.

This algorithm requires space and time proportional to the size of the domain, which may be problematic.

The second algorithm is also best illustrated with an example:

Let us take two sets, in the form of words, let's say ALPHABET and BAT. First we sort them producing AABEHLPT and ABT. We now take the first letter of the larger string and compare it to the first letter of the other string, they match, we now move onto the second unique letter of the larger string, and the second letter of the smaller string, they match too, we now move onto the fourth letter of the larger string, an 'E', and we find that this does not match the next letter of the smaller string, we then move on to the next, and the next, until we reach 'T' which matches. We have now found all of the letters in the smaller string and so we can conclude that it does indeed match.

The main overhead with this algorithm is the initial sorting of the sets which in the worst case can take time proportional to N.(log2N)*, where N is the length of the set. Having said that, once the set has been sorted the algorithm is extremely quick, taking time proportional to the length of the larger set.

These two methods are both preferable in various different circumstances, the former is preferable in a situation where the domain is reasonably small, the second where the sets come already sorted. One of the questions I have not yet answered is where my prime algorithm is most useful, and whether it will be more efficient than both of the other algorithms I have described here.


If you would like more information on Prime numbers take a look at The Prime Page.

* Thanks to Dick Christoph for this piece of information.