Data Science Method to Discover Large Prime Numbers
Large prime numbers have been a topic of considerable research, for its own mathematical beauty, as well as to develop more powerful cryptographic applications and random number generators. In this article, we show how big data, statistical science (more specifically, pattern recognition) and the use of new efficient, distributed algorithms, could lead to an original research path to discover large primes. Here we also discuss new mathematical conjectures related to our methodology.
Much of the focus so far has been on discovering raw large primes: Any time a new one, bigger than all predecessors, is found, it gets a lot of attention even beyond the mathematical community, see here. Here we explore a different path: finding numbers (usually not primes) that have a very large prime factor. In short, we are looking for special integer-valued functions f(n) such that f(n) has a prime factor bigger than n, hopefully much bigger than n, for most values of n.
The distribution of the largest prime factor has been studied extensively, see here and here and also here. If we choose a function that grows fast enough, one would expect that the largest prime factor of f(n) will always be larger than n. However, this would lead to intractable factoring issues to find the large primes in question. So in practice, we are interested in functions f(n) that do not grow too fast. The problem is that many, if not most very large integers, are friable : their largest prime factor is a relatively small prime. I like to call them porous numbers. So the challenge is to find a function f(n) that is not growing too fast, and that somehow produces very few friable numbers as n becomes extremely large.
Results for f(n) with a fast growth
Let’s start with two functions f(n) that are not good candidates (because they grow too fast), but are interesting for historical reasons, and also because they provide a simple illustration: f(n) = 2^n + 3, and f(n) = 2^n + 1. Here 2^n is two at power n, for instance 2^3 = 2 * 2 * 2 = 8. The results are shown in Figure 1 below.
Figure 1: Largest prime factor associated with two simple, fast growing functions f(n)
Interestingly, in Figure 1, except in one instance, the largest prime factor of f(n) is always bigger than n, at least up to n = 24. The numbers highlighted in green are actually prime factors identical to f(n), and thus much bigger than n. Note that for f(n) = 2^n + 1, the number f(n) is prime itself if and only if n = 2, 4, 8, or 16, according to a famous conjecture. Likewise, for f(n) = 2^n – 1 to be a prime number (called Mersenne prime), n must be prime, though this condition is far from being enough. The largest prime numbers known to date are all Mersenne primes, and have been discovered using HPC (high performance computing) techniques, the largest one currently being 2^74207281 – 1. The search for such large primes is actually used to test the performance of some HPC systems.
Here is an interesting conjecture: For n > 3, is the largest prime factor of f(n) always bigger than n? Note that on average (actually if you consider the median, not the average), the largest prime factor of f(n) based on the above table is 18.7 times larger than n if f(n) = 2^n +3, and 22.0 times larger than n if f(n) = 2^n +1. I conjecture that for very large n, the differences between the two functions smooth out, and eventually disappear as n gets big enough. The function f(n) = 2^n + 3 seems to produce more raw primes (highlighted in green), at least up to n = 24.
Results for functions f(n) with a slow growth
Here we explore the following functions: f(n) = n^2 +1, f(n) = n^2 – 2, and f(n) = n^2 +3. While they are too simple to produce large prime factors useful for cryptosystems, the methodology can be used to build random number generators.
Figure 2: Largest prime factor associated with f(n) = n^2 +1 (sample)
In Figure 2, the numbers highlighted in red are largest prime factors that are smaller than n. Overall, about 80% of the largest prime factors of f(n), for n between 2 and 3,900, are larger than n. However this proportion drops as n gets bigger, eventually dropping to about 0% for very large values of n. Thus, it is interesting to study which values of n are most likely to produce large prime factors for f(n), or to find better functions for f(n).
We found the following results, for n is between 2 and 3,900:
- f(n) = n^2 + 1: About 80% of the largest prime factors of f(n) are larger than n. The median ratio for the largest prime factor of f(n), divided by n, is 13.
- f(n) = n^2 – 2: About 83% of the largest prime factors of f(n) are larger than n. The median ratio for the largest prime factor of f(n), divided by n, is 17.
- f(n) = n^2 + 3: About 76% of the largest prime factors of f(n) are larger than n. The median ratio for the largest prime factor of f(n), divided by n, is 9.
So among these three functions, the second one, f(n) = n^2 – 2, has the best performance, at least for small values of n up to n = 3,900. Note that to perform these tests, we used a table of the first million prime numbers, available here (the full table actually lists the first 100 million primes.) We use a brute force, rudimentary algorithm to find the largest prime factors. See next section to find out how to significantly improve the methodology.
We did a bit of exploratory analysis (pattern analysis), and we found that:
- If f(n) = n^2 + 1 and Mod(n, 10) = 4 (Mod is the modulo function) then the proportion of largest prime factors of f(n) that are greater than n, increases from 80% to 89% (for n between 2 and 3,900.)
- If f(n) = n^2 + 1 and Mod(n, 10) = 7, then the proportion decreases from 80% to 71%.
- If f(n) = n^2 – 2 and Mod(n, 10) = 8, then the proportion decreases from 83% to 77%
- If f(n) = n^2 – 2 and Mod(n, 10) = 9, then the proportion increases from 83% to 88%
- If f(n) = n^2 + 3 and Mod(n, 10) = 3, then the proportion decreases from 76% to 65%
- If f(n) = n^2 + 3 and Mod(n, 10) = 4, then the proportion increases from 76% to 84%
So finding a prime number greater than n can be performed by checking a few values (say n, n +10, n + 20) where (say) Mod(n, 10) =4, looking for the largest prime factor of f(n) until you find one that is bigger than n, usually after just one iteration. Factoring f(n) is not more complicated than factoring any large number, as it might have some small prime factors too. It might not be easier either, and some suggestions are provided in the next section, that involve partial factoring.
Improvements, potential research areas
Here is an interesting project to hone your data science kills: Extending my analysis to work with much larger n, using more sophisticated pattern analysis, testing various functions f(n), and developing a good algorithm to compute the largest prime factors. The first step could be to use the full table of 100 million primes, rather than only the first million, and possibly integrate this data using a distributed architecture.
Some algorithms to compute the largest prime factors can be found here, and this is just a starting point. A possible way to factor f(n) is to look for all primes smaller than (say) n that divide f(n), and check if the leftover of the division (the unfactored part) is a prime, using a primality test. Such tests are very efficient. Some very fast tests can tell you with an extremely high probability of being correct, whether the number in question is prime or not, without doing any real factoring. The test actually detects probable primes.
- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge
- Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics
- What is Data Science? 24 Fundamental Articles Answering This Question
- Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
- Advanced Machine Learning with Basic Excel