OpenMechanics.org    

Architecture of OpenDBX, an Open Source Native XML DBMS

Ivan Mikushin, OpenMechanics / Ural State Technical University

OpenDBX is based on BerkeleyDB, an embedded database library written in C. BerkeleyDB is essentially an implementation of low level data access methods such as B-tree or Hash. It also provides automatic caching and manages things like transactions and concurrency. Another feature is a set of APIs allowing to use BerkeleyDB from a number of programming languages including C++, Java, TCL and Perl. What’s also important is that it is Open Source software distributed in source code which compiles on a very wide range of OS platforms from UNIX-like to Windows to VxWorks. As a result we have a powerful toolset for writing custom Database Management Systems. For example, for native XML storage.

OpenDBX conforms to Core Level 0 of XML:DB API specification and is written in Java programming language. This allows deployment on any platform on which BerkeleyDB compiles and for which there exists a Java 2 compliant JVM.

In XML:DB there are Collections of Resources. Resources in XML:DB Core Level 0 represent DOM Document objects. They may be retrieved from Collections by their IDs as XML text, DOM Nodes or as SAX events (reported to a client’s SAX ContentHandler). OpenDBX stores all Resources accessible from one Collection as a single BerkeleyDB database file in a directory corresponding to the Collection. Thus a Collection’s directory contains a file with accessible Resources and may contain other Collections’ directories.

One of design goals of OpenDBX is the ability to store large XML documents and provide access to them via DOM interfaces. This can be used, for example, for building content management systems and enterprise applications with large hierarchical structures. Another goal is to provide the capability to query a set of XML documents, stored in one XML:DB Collection. This might be useful to process large quantities of uniform documents like purchase orders.

OpenDBX aims to be fully compliant with W3C specifications such as DOM [2] and XML Information Set [4]. That is, to be able to store every type of DOM objects or XML Information Items. But for the first release this is not the primary goal, and such Information Items such as Processing Instruction, Document Type Declaration, Notation, Unparsed Entity are not supported by OpenDBX (this limitation will be removed in future releases). And since OpenDBX stores already parsed XML there is no need for the Unexpanded Entity Reference Information Items.

So, OpenDBX stores the following Information Items and provides DOM Level 2 Core interfaces to access them: Document, Element, Attribute, Character, Comment and Namespace. This is sufficient to build a broad class of applications. In fact, most of them are never concerned about the rest of the items.

Internal database structure

To store this information OpenDBX uses a set of BerkeleyDB databases. Each XML:DB Collection has the following databases (stored together in one database file):

Document root elements database

There is only one database of this kind for a Collection. This database stores data about Document Information Items. This database is identified by the name “ d ”. Since we don’t care about many of the information items (mostly useless anyway, though only in the first version), the only information we need is the document element property. Thus the data structure is the following:

            document_id               |           element_id

BerkeleyDB stores data as key-value pairs. The keys are document identifiers (external, unique in scope of the containing Collection) and the values are these documents’ root elements’ identifiers (internal to OpenDBX). Document identifiers are used to address XML documents in a given Collection.

The access method is B-tree since the keys might be arbitrary strings (can be specified by clients storing an XML document into an XML:DB Collection).

Element nodes databases

There are as many databases of this kind for a Collection as there are Element Information Items for all documents stored in the Collection. These databases store Element Information Items data. These databases are identified by the names in the form “ eid ” where “ id ” is the identifier (internal to OpenDBX) of the Element Information Item represented by the database. The data structure is the following:

                        1                      |           parent_id {nsURI}local_name

                        num                 |           eid

                        num                 |           ttext

                        num                 |           ctext

Databases of this kind use Recno (record number) access method with mutable record numbers. That is, the keys are non-negative integer numeric values (starting from ‘1’). “Mutable record numbers” means that the record numbers may change as records are added to and deleted from the database. The deletion of record number 4 causes any records numbered 5 and higher to be renumbered downward by 1; the addition of a new record after record number 4 causes any records numbered 5 and higher to be renumbered upward by 1.

The value of the first record (with key ‘1’) represents such Element Information Item properties as parent, namespace name and local name. The value of this record consists of the following parts:

If parent_id is in the form “ did ”, it means that this element is the root element of the document with identifier id. If parent_id is in the form “ eid ”, it means that this element’s parent element is (internally to OpenDBX) identified by id.

Other records (with keys greater than or equal to ‘2’) represent the children property of the Element Information Item represented by the database. Their values are either pointers to other Element Information Items or values representing sequences of Character Information Items (Text nodes in terms of DOM) or values representing Comment Information Items (Comment nodes in terms of DOM).

Element attributes databases

There are as many databases of this kind for a Collection as there are Element Information Items for all documents stored in the Collection. These databases store the data corresponding to the values of attributes property (Attribute Information Items) of Element Information Items. These databases are identified by the names in the form “ aid ” where “ id ” is the identifier (internal to OpenDBX) of the Element Information Item whose attributes property is represented by the database. The data structure is the following:

                        {nsURI}local_name            |           text

Databases of this kind use B-tree access method. The keys of records in these databases consist of the following parts:

Element namespaces databases

There are as many databases of this kind for a Collection as there are Element Information Items for all documents stored in the Collection. These databases store the data corresponding to the values of in-scope namespaces property (Namespace Information Items) of Element Information Items. These databases are identified by the names in the form “ nid ” where “ id ” is the identifier (internal to OpenDBX) of the Element Information Item whose in-scope namespaces property is represented by the database. The data structure is the following:

                        nsURI       |           prefix

Databases of this kind use B-tree access method.

References

  1. D. Knuth. The Art of Computer Programming, vol.3 Sorting and Searching, 2nd edition – Addison-Wesley, 1998

  2. Document Object Model (DOM) Level 2 Core Specification . – World Wide Web Consortium, 1 October 1998.

  3. XML Database API Specification. XML:DB Working Draft. – XML:DB Organization, 2001.

  4. XML Information Set. –World Wide Web Consortium, 24 October 2001.

© 2002/2010 OpenMechanics.org
SourceForge Logo