Inspired by the recent
googlewhacking
fad, I thought that by using the overlap of two searches, you should be able to estimate the size of the thing you are searching in. This assumes lots of unreasonable things about distribution, both in what you search for, and what exists on the web.
Here is the technical part. The results are at the end.
let T = the total number of pages indexed by Google.
define g(x) = number of hits google finds for x.
define g(x,y) = number of hits google finds for x y.
define g(x,y,z) = number of hits google finds for x y z.
etc.
let x
f be the frequency that the term x appears on any page indexed by google (it's from 0 to 1). i.e. if the term "x" appears on 50% of all pages, then x
f = 0.5
So,
g(x) = x
f * T
g(x,y) = x
f * y
f * T
g(x,y,z) = x
f * y
f * z
f * T
Solve for T in the second equation:
T = g(x,y) / (x
f * y
f)
T = g(x,y) / ((g(x)/T) * (g(y)/T)) (since x
f = g(x)/T)
T =g(x,y) / ((g(x) * g(y)) / T
2) (simplifying)
T =g(x,y) * T
2 / (g(x) * g(y)) (simplifying more)
Solve for T and you get
Result 1T = g(x) * g(y) / g(x,y).
Since you get a value of g by using google, you can get trial values for T whenever you want!
You can do a nastier version of the above to find that with n terms x
1, x
2, ... , x
n
Result 2in that case,
T = ((g(x
1) * g(x
2) * ... * g(x
n)) / g(x
1,x
2, ... ,x
n))^(1/(n-1))
Test Results!
Ok, now the tests! I will do 3 tests with
pairs, and 3 tests with
triplets. I will get the test words by doing "
random node" and taking the first word besides "
the" in the
node title.
T is the
The number of pages indexed by Google.
g(x) is the number of hits for "x" on google.
Test 1
g(
house) = 46,100,000
g(
Estrangle) = 94
g(
house,
Estrangle) = 24
T= 1.8e8
Test 2
g(
Hostess)= 450,000
g(
gene)= 6,240,000
g(
Hostess,
gene) = 7,310
T = 3.8e8
Test 3
g(
Toronto) = 7,050,000
g(
Brassclaw) = 836
g(
Toronto,
Brassclaw) = 2
T = 2.9e9
Test 4
g(
Unlikeness)= 3890
g(
Ophidiophobia) = 956
g(
Adobe) = 7,160,000
g(
Unlikeness,
Ophidiophobia,
Adobe) = 0
T = ?
Test 5
g(
condom) 689,000
g(
open) = 60,200,000
g(
Failing) = 2,200,000
g(
condom,
open,
Failing)= 5410
T = 1.3e8
Test 6
g(
Two) = 112,000,000
g(
Genesis) = 2,880,000
g(
brilliant) = 2,560,000
g(
Two,
Genesis,
brilliant)= 43,000
T = 1.4e8
Results
I got 5 results for T: 1.8e8 3.8e8 2.9e9 1.3e8 and 1.4e8. The average is 7.6e8 = 760,000,000.
After doing all this, someone told me that google actually claims to index 2e9 = 2,000,000,00 pages. So, the accuracy is not too horrible.
I am sure this has already been done in real math, and has a real name. Does anyone know it?
I know that google publicizes the number of pages they claim to index. However, this technique can be used on other search engines as well, whose credibility is in doubt.