History of Cyc
In 1984, virtually every university with a
Computer Science department had one or two
Artificial Intelligence projects, some so ambitious that their
leaders were positive they would surpass human reasoning in a matter
of years. Dr. Douglas B. Lenat was a professor at Stanford's CS department
who had his own radical ideas about the exciting new field, ideas
based in the current state of the art rather than the creation of
something entirely new. Lenat envisioned a twenty year long,
twenty-five million dollar program that would produce something
unprecedented in the world of AI: a program that had common sense.
He named the project Cyc, after encyclopedia.
Since Dr. Lenat
knew that no university would be willing to take on such a large
investment in time and money, he decided to look for money outside of
academia. He found his funding in the Microelectronics and Computer
Technology Corp., a Department of Defense research initiative also
backed by a few large corporations. They gave Lenat a long-term
contract and he begun work on Cyc, eventually hiring a staff of
fifteen people to work on the knowledge base. As it turned out, many of
the ambitious claims about AI turned out to be false, and the
bottom dropped out of the market in the late 80s. Fortunately, Dr.
Lenat's contract was for longer than that depression, and he was able
to continue work on Cyc all the way through it.
By 1995 Lenat's project was up to thirty employees, from areas as
diverse as linguistics, philosophy, and anthropology, all adding facts about
human existence to the database daily. He decided to create a company,
Cycorp, to market Cyc-based products and help integrate them with customers'
systems. Today Cycorp is still working hard on the knowledge base, having
entered over one million facts in the past fourteen years. Doug Lenat
is also at work finding new frontiers his software can be applied
to, and designing the system extensions that will allow them to go
there.
How Cyc Works
The Cyc knowledge
base is built entirely out of what Cycorp calls constants, which are
written in CycL, a proprietary language that's based on first-order
predicate calculus. Constants are simultaneously the rules which the
knowledge base uses in inferencing, and the subjects about which
inferences are made. That is to say, each constant is
self-descriptive, but it also has relations to all of the other
constants, so the Cyc evaluator knows how it relates to the rest of
the constants in the knowledge base. These CycL constants are as
diverse as things, collections, attributes, relationships, functions,
etc., and when put together they are Cyc's sum knowledge about the
real world, and the root of its apparent common sense.
Since CycL constants are written in a form of predicate calculus, they
relate to one another with strict logical formality. This allows well-defined
relationships to be found between any two constants in the knowledge
base. Also, logical inferences about the constants can be synthesized within the knowledge base to
create information that wasn't explicitly entered into it.
Inferencing is the basis of Cyc's apparent ability to reason, but
since it follows the rules of logic (modus ponens, resolvable existential
quantification, etc.) strictly, all the new
information is tractable and can be understood by a human if
necessary.
For a new concept to be introduced to Cyc, it
must obviously be stated in terms of a CycL constant. Once entered,
that constant will be integrated into the rest of the database, again
using the rules of logic. Resolution is done against all of the
pre-existing knowledge to make sure the new constant is at least a
prima facie valid description of reality. If resolution
succeeds, the constant becomes a part of the knowledge base, and thus
can itself be inferenced upon.
When information must be synthesized
for resolution (i.e. the new constant was created from a
question) it exists as an existential quantification within the
knowledge base. This quantification is automatically skolemized and
resolved, and the resolution is returned to the user.
Applications of Cyc
Cyc's ability to use human-like sensibilities to interpret data makes it perfect for
the currently popular field of Data Mining. Data Mining involves
parsing a database that's much too large to be evaluated by a human,
and extracting only the very little bit that is meaningful for the
task at hand. Traditionally, this is done by a programmer, who
writes SQL queries that look through the database and match one or
two features to the rules of what's needed. While this approach
works well for simply defined categories of data (e.g. doctors with
last name Smith, patients over the age of 60), it doesn't begin to
work with more abstract rules.
Cyc already has knowledge about the real world. When it is granted knowledge about the database
to be mined (usually as an interface programmed by Cycorp), it can synthesize the two and get
database results that make sense in reality. The data flow
might look something like this: A user asks a
question about the database to the Cyc front-end in plain English. Cyc's Natural Language
model converts the query into a statement in CycL, Cyc's proprietary
description language. That statement is back-chained with the rest
of the knowledge base until it matches information that's actually in
the database. All of the necessary fields in the database are
acquired using SQL queries built into Cycorp's interface.
Finally, the information is reified with the rest of Cyc's knowledge
base to make sure the results are consistent with reality, and
returned to the user.
Another, related,
field that Cyc can be used for is that of classifying opaque
information, media such as sound clips, video, pictures, or any other
non-text data. In the past this has been done by the extremely
old-fashioned method of a card catalog, via captioning. The data was
given a small text description (a caption), and the descriptions were
cross-referenced based on each word. Thus, a reporter looking for
stock pictures of bad used cars for a newspaper article would search
for, say, Mazda, and then leaf through the pictures, picking out all
of the clunkers on file. The reporter would probably go on to look
through all the Fords, Chryslers, and so forth to find all the
pictures he or she needed for the article.
If the data was
stored in a database with Cyc as a front end, this search would be
much easier for the reporter. Instead of doing a handful of searches
and leafing through hundreds of photographs, the journalist could
simply make a query like "Used cars that you wouldn't want to
buy." As with the Data Mining application, Cyc would turn this
into a query in its own language, evaluate it against the reality of
bad used cars, and return a list of photographs from the database.
Since the photographs had already been captioned for the
old-fashioned system, relatively little work would have to be done to
create the database that Cyc would use.
These two applications are the ones Dr. Lenat is advocating
most heavily for Cycorp's business, but there are plenty of other possibilities.
Indeed, Lenat sees an eventual use for Cyc in every situation
where a human needs to communicate with a computer. Applications would virtually
overnight start making much more sense and being more intuitive to use,
a copy of Cyc running behind each of them, overseeing their interface with the
user.