|
BACK
TO MY HOMEPAGE
Testing Subsets Using Prime Numbers
Ian Clarke March-May 1997
During a recent project which I was pondering, the problem arose of
how to test whether one set is a subset of another, and it had to be fast.
To illustrate things, let us create three sets, set A consists of the whole
numbers {17,12,54,81,55}, set B consists of the whole numbers {16,19,12},
and set C consists of the whole numbers {54,17,55}. How can we design
an algorithm to test which, if any, of sets B and C is a subset of set
A?
The obvious algorithm is to, for each element of the smaller set, test
to see whether it is in the larger set, and accept the smaller set as a
subset if all of its elements can be found in the larger set. This
algorithm works fine, with one exception, it is slooooooow. Let us
say our two sets have 3000 and 2000 elements respectively (obviously we
are testing to see whether the smaller set is a subset of the larger set).
We must do 3000 X 2000 comparisons ie. 6000000 comparisons! This
is definitely not acceptable. Back to the drawing board.
I eventually came up with an idea. Let us say we are dealing only
with whole numbers between 1 and 100 (although the same idea will work
with any finite domain of possible elements). We shall assign a unique
prime number to each element, as follows:
| Number |
Prime |
|
1
|
2
|
|
2
|
3
|
|
3
|
5
|
|
4
|
7
|
|
5
|
11
|
|
6
|
13
|
...all the way up to 100. When we do this for the elements we are
dealing with above, we get the table:
| Number |
12 |
16 |
17 |
19 |
54 |
55 |
81 |
| Prime |
37 |
53 |
59 |
61 |
251 |
257 |
419 |
Now, for each set, obtain the corresponding prime for each element, and
find their product:
Set A:
| Number |
17 |
12 |
54 |
81 |
55 |
Product
|
| Prime |
59 |
37 |
251 |
419 |
257 |
59003069239 |
Set B:
| Number |
16 |
19 |
12 |
Product |
| Prime |
53 |
61 |
37 |
119621
|
Set C:
| Number |
54 |
17 |
55 |
Product |
| Prime |
251 |
59 |
257 |
3805913
|
So we now have three numbers corresponding to our three sets - what
now?
Let us try an experiment, what happens if we divide the number for set
A by the numbers for sets B and C?
Well,
A/B = 59003069239 /
119621 = 493250.0919
Nothing very remarkable about that, but...
A/C = 59003069239 /
3805913 = 15503
Wow! It divides in evenly ie. there is no remainder.
In actual fact, if we do this with any two subsets, and we discover that
the smaller one divides evenly into the larger one, it is a subset, if
it does not, it isn't!
But how fast is this really? Well, if we used the simple algorithm,
we would have had to do 15 comparisons for each of B and C, a total of
30 operations. Here we did 11 multiplications and 2 divisions, a
total of 13 operations!
Great, so where is the snag? Well, I have glossed over something.
The numbers we are dealing with get very big very quickly, and multiplying
large numbers can slow things down (or stop them compeltely if you are
restricted to low (ie.less than 4294967296 for 32 bit machines).
If we are dealing with a couple of hundred elements per set, we start getting
very VERY large numbers, and any speed advantage is out the window.
So what can we do to keep the numbers smaller? Well, there are some
options. Let us say that we want to calculate an equation such as
the following:
A x B x C x D x E x F
G x H x I x J
While keeping the numbers small. Well, we could also look at this
equation as:
(A/G) x (B/H) x (C/I) x (D/J) x E x F
So now we have the product of some fractions... at this stage we are still
going to get big numbers, unless, after calculating each fraction, we reduce
it to its smallest form. This is done by dividing the numerator and
denominator by their Greatest Common Divisor (GCD). There are some
very very efficient algorithms around for caclulating GCDs so speed is
not much of a worry. We could even have the algorithm only reduce
the fractions if they get to large thus making things yet more efficient.
But what is going on when we do this? Well, essentially when we compute
the GCD, we are pairing off any primes which appear both on the top of
the fraction and on the bottom, and removing them as we go along.
With this knowledge, we also know that we can increase the chances of primes
matching up in this way by sorting both sets in order - as a matter of
fact, there is a very good chance they will be in order anyway as this
is the most efficient way to deal with sets if you don't want repetition.
Before I finish, I should briefly describe some of the competition.
One of these was explained to me by Alexios
Chouchoulas, the other by Steve
Faulkner.
The first algorithm is as follows: Given a domain of N elements, we
will assign each element a bit in a bit-string. We then take the larger
would-be superset and express it by flipping the corresponding bit for
each element in the set. We do the same for the smaller set. By way of
an example, let us take a domain of numbers from 1 to 8, and two sets [1,4,6,7]
and [1,4,7]. They would be expressed as follows:
10010110 [1,4,6,7] and
10010010 [1,4,7]
We now perform a bitwise AND on these two bit-strings, a function which
is second-nature to computers, the string we get is:
10010010
If we find that this is the same as the smaller set then we can conclude
that the smaller set is indeed a subset of the larger set.
This algorithm requires space and time proportional to the size of the
domain, which may be problematic.
The second algorithm is also best illustrated with an example:
Let us take two sets, in the form of words, let's say ALPHABET and BAT.
First we sort them producing AABEHLPT and ABT. We now take the first letter
of the larger string and compare it to the first letter of the other string,
they match, we now move onto the second unique letter of the larger string,
and the second letter of the smaller string, they match too, we now move
onto the fourth letter of the larger string, an 'E', and we find that this
does not match the next letter of the smaller string, we then move on to
the next, and the next, until we reach 'T' which matches. We have now found
all of the letters in the smaller string and so we can conclude that it
does indeed match.
The main overhead with this algorithm is the initial sorting of the
sets which in the worst case can take time proportional to N.(log2N)*,
where N is the length of the set. Having said that, once the set has been
sorted the algorithm is extremely quick, taking time proportional to the
length of the larger set.
These two methods are both preferable in various different circumstances,
the former is preferable in a situation where the domain is reasonably
small, the second where the sets come already sorted. One of the questions
I have not yet answered is where my prime algorithm is most useful, and
whether it will be more efficient than both of the other algorithms I have
described here.
If you would like more information on Prime numbers take a look at The
Prime Page.
* Thanks to Dick
Christoph for this piece of information.
|