The
set of
prime numbers is certainly not a
regular language.
Unfortunately this has not been proven yet and will probably never be because of the
countless ways existing to represent a
number.
There is indeed a great difference between the concept of number and the
characters used
to represent them (22
decimal = 16
hexadecimal = 10110
binary = twenty two = vingt-deux = ...).
Not to mention that they can be represented as a
product of
primes, which makes
primality testing very very easy, except that writing them in that form isn't that easy.
Using finite state machines to deal with prime numbers sounds interesting.
Determining the acceptance of a string by a deterministic finite automaton (DFA) is a really
fast process.
Besides that, this process has the great property that it can be inverted :
every path linking an entry point to an exit point of the state machine gives an accepted string,
thus the same state machine could be used for prime number generation and primality testing,
with the great advantage that the process is in O(n) for both ways.
There aren't many papers out there which explain what doesn't work. Yet, understanding
why things are intrinsically impossible is as important as understanding why others work.
As you'll see, finding out why finite state machines are inefficient to deal with prime
numbers points out interesting results on the set of prime numbers and deterministic finite
automata.
Prime numbers and base 1
A prime number p written in base 1 consists of p times the digit 1, and therefore its length
is the prime number p. But unfortunately this language, as shown in "1^p" p prime language
is not a regular language. The proof points out that a state machine with an infinite number
of states is required to match this language.
Nevertheless, it takes only a few modifications to have a state-machine-like algorithm for
the language. One idea is to introduce a capture buffer : as the token advances in the state
machine, a buffer is filled with characters. The buffer can then be tested against the
remaining characters.
The idea behind this is to fill the capture buffer with at least two characters and test
if this sequence is repeated at least two times to represent the number. It is therefore
possible to do primality testing with Perl regexs. But these are not state machines.
When acceptance of the string fails, it means that the
choice of the buffer was bad. The algorithm has to start over with a new buffer until there
are no more cases left. This method is in O(n2) and is only of theoretical interest.
The important thing shown here is that state machines do not know how to count.
This means that they can only count up to a finite given value, but not an arbitrarily
big one. For example there exists state machines that will recognize odd numbers (eg.
"1(11)*"), even numbers (eg. "(11)*") or numbers which are equal to three modulus 5
(eg. "(11111)*111"), because in those cases they only need to count up to 1, 2, 3 and 5.
For prime numbers, they would have to count up to that prime number (or its square root, etc...)
and it can be arbitrarily big. One can only build a state machine that tests divisibility by a finite set of numbers, for example "(11)*|(111)*|(11111)*|(1111111)*" will accept any integer divisible by 2, 3, 5 or 7.
Prime numbers and base 2
State machines are meant to recognize patterns.
In base 2 (aka. binary), prime numbers are represented by a succession of '0's and '1's
and therefore this base might be more adapted to primality testing since it creates "patterns".
Say I write a dot (.) for each 0 of the binary representation of the prime, and a X for
each 1. If I write that way the prime numbers from 3 to 100, I get an interesting diagram
which is reproduced below.
Two things are important to notice about it :
can repeating patterns be found and how does its width expand.
XX
X.X
XXX
XX.X
X.XX
X...X
XX..X
XXX.X
X.XXX
XXXXX
X.X..X
X..X.X
XX.X.X
XXXX.X
X.X.XX
XX.XXX
X.XXXX
XX....X
XXX...X
X..X..X
XXXX..X
XX..X.X
X..XX.X
X....XX
X.X..XX
XXX..XX
XX.X.XX
X.XX.XX
X...XXX
XXXXXXX
XX.....X
X..X...X
XX.X...X
X.X.X..X
XXX.X..X
X.XXX..X
XX...X.X
XXX..X.X
X.XX.X.X
XX..XX.X
X.X.XX.X
XXXXXX.X
X.....XX
X.X...XX
XXX...XX
XX..X.XX
XXXXX.XX
XX...XXX
X.X..XXX
X..X.XXX
XXXX.XXX
X...XXXX
XX.XXXXX
X.......X
XXX.....X
X.XX....X
XXXX....X
X.X.X...X
X..XX...X
XX.XX...X
X.X..X..X
XX..XX..X
XXX.XX..X
X..XXX..X
X.XXXX..X
XX.X..X.X
X...X.X.X
XX.XX.X.X
X.XXX.X.X
X....XX.X
XXX..XX.X
XXXX.XX.X
X.X.XXX.X
XXXXXXX.X
X.X....XX
X.XX...XX
X...X..XX
X..XX..XX
XX...X.XX
X.X..X.XX
XXXX.X.XX
X...XX.XX
XXX.XX.XX
XX.XXX.XX
X.....XXX
X..X..XXX
X.XX..XXX
XXXX..XXX
XX..X.XXX
XXXXX.XXX
XXX..XXXX
XX.X.XXXX
XX..XXXXX
XXX.XXXXX
X.XXXXXXX
X..X.....X
XX.X.....X
The characters . (dot) and X have been chosen because they contrast better than
0 and 1. The digits are written backwards : 19 should be written
X..XX instead of XX..X
The diagram will widen indefinitely because the set of prime numbers is infinite.
Gauss stated that Π(n), the number of primes ≤ n, is equivalent to n/ln(n).
Note that Π-1(k) gives the kth prime number.
Gauss's equivalent is very good, indeed 0.9*n/ln(n) < Π(n) < 1.1*n/ln(n)
for not so large values of n (n ≥ 100).
There are Π(2k+1) - Π(2k)
prime numbers between 2k and 2k+1, ie. that have k binary digits.
That represents roughly 2k/ln(2)*(2/(k+1)-1/k) primes, which is equivalent to
2k/(k*ln(2)) for large values of k.
But this is not likely to be the number of different words with a given length of a regular
language. Say φ(y) is the number of words of a regular language, formed
of y letters.
It can be proven that φ(y) is a recurrent linear sequence (see regular language),
and thus the (real) solutions are linear combinations of terms of the form P(y)*exp(λ*y),
where P is a polynom and λ a real number. It will never be equivalent to
2k/(k*ln(2)).
This yields an interesting result : the growth of a regular language is
either equivalent to a linear function of positive or null slope,
or equivalent to a function of the form
μ*ykexp(λ*y), k integer ≥ 0 and λ, μ real number > 0.
The growth of the set of prime numbers is not compatible with that of a regular language,
therefore it isn't a regular language.
Prime numbers and base n
It would be quite complex to prove that the set formed of the base n representations
of prime numbers isn't a regular language using the pumping lemma as is is done in
"1^p" p prime language and in "p in binary" p prime language because it involves heavy arithmetic. But it is reasonable to think that is can be done.
However, the estimation of words of length k remains
valid for other bases; in base n
it is equivalent to nk/(k*ln(n)), which is not the growth of
a regular language. Prime numbers written in any base never form a regular language.
Subsets of prime numbers and prime number generation
Suppose I wish to generate prime numbers. I don't need a function that is able to
generate every prime number, but an infinite subset of primes will do. Hence the question :
is it possible to build a deterministic finite automaton that will generate an
infinite set of primes ? The answer is unfortunately no.
The idea behind this method would be to take a bunch of prime numbers and to mix them up in order to generate a new prime number. Since the set of base n representations of prime numbers isn't a regular language, thanks to the pumping lemma it is possible to show that none of the prime numbers can be decomposed in the form uvw such that uviw is a prime number for every given integer i. It is proven for bases 1 and 2 and it is reasonable to think that it is true for other bases.
A method is to use an algorithm that gives numbers which are mostly primes and use a probabilistic test to check out the bad ones. For example it is known that many numbers of the form 22n-1 and 22n+1 (2^(2^n) +/- 1) are prime numbers.
Closing words
Deterministic finite automata aren't suited to deal with prime numbers. Even though methods can be derived from them, they lack performance results and are only of theoretical interest. If you want to play with prime numbers, forget state machines, but remember that you can learn from things that don't work.
This probably means that our usual "base n" representation of numbers isn't well suited to deal with prime numbers. Nevertheless it show for sure that there is more to numbers than their representation. Writing a number in base n is only performing a simple bijection between the set of numbers and the set of their base n representation. Maybe great improvements will be achieved in prime number theory when a better representation technique is found for numbers.
Primality testing and prime number generation are topical subjects because it's the heart of cryptography, the only way to transmit secured documents today. The most efficient way to accomplish these two tasks nowadays is to use probabilistic methods. State machines are useful to find logic in apparently random languages, still they are very limited.