January 28, 1999; CS200 Undergraduate Colloqium
Roy Goldman is a PhD student
in Professor Jennifer Widom's database group working on the Lore XML
Database Management System (DBMS) Project. Mr. Goldman did his undergraduate
studies at Berkeley and originally planned on pursuing research in operating
systems, but found databases to be much more suitable than working with
low-level OS internals. The Lore project was started 3 years ago at Stanford
and consists of Prof. Widom, Mr. Goldman, and Jason McHugh among many others.
Today, Mr. Goldman discussed the history of relational databases, the new
XML standard and the hype surrounding it, the LORE DBMS, and also his own
research called DataGuides, which is an aid for searching and indexing
databases. He then discusses the problems and solutions involved in providing
users with the capability to search for keywords in XML databases and closes
with the direction he believes that XML and Lore will take in the future.
Database Overview
-
Relational databases (RDBs) have been the dominant type of databse in the
industry since they were first introduced by Ted Codd in 1970. Their popularity
is due to their ability to manage large amounts of information in a clean
mathematical model. The core companies in the RDB industry are Oracle,
Microsoft, IBM, Informix and Sybase, although he commented that the latter
two have been having difficulty recently.
-
What is a relation? It is a named table with column headings which is linked
with other such tables to form databases. For example, a table of books
which contains the names of the book authors could reduce storage by maintaining
joins
for each book by a certain author to another table of authors where the
name would be stored. The standard language for any RDBMS is SQL.
-
RDBs are generally only good for well-structured data and private data
within an organization. The relation model breaks down when applied to
irregular or fast-changing structures or when databases with different
schema try to interoperate.
XML
-
XML is a standard for describing data on the Web and its model is based
on nested, tagged data. The hype surrounding it is enormous,
considering the simplicity of the concept. XML databases can be thought
of as graph with links betweeen nodes, because it allows users to define
their own tags which can operate in forward or reverse. Advantages include
its simple format, ease of agreeing on a format and its flexibility when
applied to irregular or fast-changing data. Its disadvantages are that
tag definitions must still be agreed upon and that there is no existing
DB system for it. Enter Lore...
Lore
-
Lore is a DBMS that allows storage, management and querying of XML. It
focuses on defining a query language for XML called Lorel and performing
interactive searches and queries. Lorel (that's an "el" not a "one" on
the end) is similar to the SQL language and performs queries based on paths
rather than tables.
-
Research Challenges: Defining the Lorel language: Lorel was presented at
the W3C conference and was very well-received over other language "hacks"
such as XSL and XQL which have been pushed by the market influence of Microsoft.
(not an editorial comment- it was in the talk :-)
-
Query processing: It's been perfected in RBDs for years, but is a difficult
challenge for XML. This area of user interaction is where Goldman is pursuing
research.
Research
-
DataGuides: small piece of XML that summarizes all the data in a
DB. It differes from an RDB in that the data comes after the schema, whereas
RDBs model the schema after the data. In DataGuides, updates to the database
are reflected instantly in the DataGuide.
-
Expensive to create? For a tree structure, the computational cost is linear,
but it can reach exponential in the worst case. Goldman notes that the
group has a relatively small 5 MB file that maxes out their systems' capabilities.
-
Example:
<BookID>
<Title>
<Author ID>
<Name/>
<Bio/>
<Hometown/>
<Picture/>
</Author>
<Price/>
</Book>
-
Advantages: exact definition: every path in the DB exists exactly once
in DataGuide; good for indexing; good for storage of stats
for query processing
-
Keyword Search:
-
On the Web, people like to do keyowrd searches. However, this is awkward
for XML for a few reasons: XML has fewer, larger files compared to HTML,
which uses smaller files in greater numbers. The main problem is
that performing a proximity search on multiple words in an XML file does
not work because the words' relative position does not always correspond
to their meaning. Therefore, the XML data is conceptualized as a weighted
graph with interconnected data objects and weighted edges. Formal defintion:
Find X near Y. In graph G, rank nodes in 1 set (X) based on the shortest
path to objects in another set (Y). Example: Find Movie near Harrison.
Future of XML and Lore
-
XML is already being used in many organizations and Lore is a robust system
of over 100,000 lines of code.
-
If people can agree on tags and structures, and data is modeled in XML
as graphs instead of merely an encoding of another model, then XML and
Lore have the potential to become very successful.
Web Demo
-
Demo on Lore website: Using a database of people profiles of the members
of the Stanford Database group, one can find the people who like to vacation
near their homes. Lore finds their home and vacation place strings and
determines if one is a substring of the other (h grep v, v grep h). The
search was successful when it found Roy Goldman, who does vacation near
his home.
-
When Picture and China are looked up, it does not look for pictures of
China specifically, but finds pages that mention China and displays the
picture on that page. Not surprisingly, it brought up several pictures
of Chinese students. The first non-Chinese person was Prof. Widom, because
she has 3 Chinese students mentioned on her page.
January 28, 1999