RESOURCE CENTER

“Really Strategies provides us with the third-party expertise we need.”

—Kinsey Wilson
USATODAY.com
Resource Center

Really [Solid] Architecture

by Joseph Barnes

Often, publishers can't find a content management product that meets enough of their requirements to bet the ranch on. This leaves little choice but to design and build a custom solution. And once the commitment is made to build rather than buy, the fun begins. Now the talk shifts from "will they?" and "could they?" to "now we have to" and "how do we?" Following a detailed requirements analysis, your next step involves the architecture and design of the system.

Your organization probably already has some architectural and platform preferences, such as whether to use J2EE or .NET, and which relational database to use. Regardless of those choices, you need to decide exactly how to store both data and documents along with their relationships to one another. Also, you need to determine how to reflect those relationships in user interfaces and editing tools.

Repository Needs

The core of any content management system is the storage architecture. The thing that differentiates content management storage repositories from other systems is the elevation of stored documents to be as or more important than data. In this instance, "data" refers to information that is relational in nature and further describes various aspects of the documents in your system; in the context of content management systems this data is often called metadata.

"Documents" can be hard to define, but most people instinctively know that they are different from data. One differentiator is that it typically makes sense to mark up documents in some fashion, e.g., with XML tags, with Word styles, etc. Also, people typically want to see (read, edit, write, etc.) an entire document, not just its parts. Data is typically much more granular than this—a single value in a field. In reality, the distinction between data and documents can often be ambiguous (and even arbitrary based on organizational history—we've always done it that way). For example, is an article's title data or part of the article document? Or is it both? The need to easily support the "both" option is one of many good reasons for managing documents with XML, where the distinction between data (high fielded information) and documents (less fielded information) can be blurred. From this point on, this article is primarily focused on XML documents as opposed to other types; the architecture choices for non-XML documents are limited due to their unpredictability.

Finding a way to implement the relationship among data and documents is what makes content management system architecture an interesting challenge. As part of requirements analysis prior to choosing a storage architecture, it's a good idea to initiate conversation about which content fits best under the document label, which fits best under the data label. Our approach is to list all of your content types under the headings, and then explore the relationships among the two groups. For example, do you need to capture metadata (e.g., assign index terms) at locations within a document? This is also a good time to explore versioning requirements, which you can think of as assigning metadata (dates and so on) to documents or data.

Repository Options

There are many architecture choices for capturing data and documents. The most commonly implemented can be grouped into these three primary categories:

  1. Separate storage: Data in a relational database and documents as files in the database, in a file system, or in an XML repository
  2. Combined storage in a relational database: Both data and documents in a relational database (i.e., the XML documents are broken down into their constituent elements for storage in database fields; if you also have non-XML documents, they would be stored according to the first option)
  3. Combined storage in an XML repository: All data is embedded in XML documents, which are stored in a file system or and XML repository.

Each of these is viable, but each is not appropriate for all requirements. The types of requirements that might lead you to Options 2 or 3 are:

  • A need for enhanced (XML-aware) searching
  • A need to re-use parts of documents (not just entire documents)
  • A need to edit parts of documents outside the context of the larger document
  • A need for version control of documents parts
  • A need to assign and track metadata to document parts
  • A need for intensive reporting against document parts
  • A need to support complex rearrangement of document parts or use of only some document parts on output (without a huge performance hit)

Option 1 is the architecture typically assumed by most software developers. This is a natural fit for many organizations for two reasons. First, most company's internal IT staff are very familiar with at least one relational database product, and the organization has the staff and infrastructure to support that product. Second, the XML support now offered in all the major relational database platforms makes managing the content itself a much easier task than even a couple years ago. These are mature products, and there are a wealth of development tools for working with them—no surprise they would be the default choice. In addition, most developers don't have any reason to question their assumption that Option 1 is the way to go because they aren't familiar with the nature of documents and the process of creating them. They don't understand that users might want to work with documents in a way that isn't easily supported by systems that consider them to be "blobs"—not to mention that users and those responsible for new products from documents have a hard time articulating their needs in a way that means something to the developers.

But, of course, there sometimes is no real need for document handling beyond what it is easily supported through storage in a database field. In this case Option 1 is appropriate. If you are teetering on the edge of the choice for or against Option 1 and your organization does not prefer to be on the bleeding edge of technology, then stick with Option 1 and find a way to scale your requirements appropriately. Other options will feel much less risky in just a few years.

Note that when a system is implemented with Option 1 architecture, some content elements (e.g., document title or author) usually end up getting duplicated in both the data and the documents.

Also, adding an XML repository to this system architecture specifically for document storage still results in an Option 1 scenario, but one with the potential for major improvements in the areas of search/reporting and content (document) delivery. Most vendors producing XML repositories offer a number of ways to connect their repositories to other disparate data sources including relational databases.

Option 2, breaking XML documents down into their parts for storage, is a solution sometimes proposed by developers who see that there is a real need to access document parts as well as entire documents. In our experience, this approach should be taken only with careful assessment of the risks and long-term costs. It is simply harder than it would seem on the surface to decompose XML documents for storage, and system performance often suffers. (Check out some of the XML discussion lists archives for more information on this topic.)

There are, however, some situations where this approach or a close alternative makes sense. For example, some publishers only need access to a predetermined list of document parts (e.g., tables and figures) and are willing to live with a system in which it's difficult to add new parts to the list of those that can be managed on their own. Such a system is easier to build than one in which the document needs to be completely broken down to the smallest element. An alternative architecture is to store some metadata about document components (e.g., categorical index terms, titles, etc.) in the database, but to actually store the document as a whole entity. In this case, the system must read through the document and create/read the appropriate database metadata, but the toughest complexities of document decomposition are avoided. Finally, it should also be noted that there are content management products and custom systems that use this approach and make it work. It just takes a commitment of both dollars and time to solve the nuttier issues.

Option 3, storing all content in an XML repository, can be very appropriate for publishers with the kind of requirements listed under Option 2. Using Option 3, you are in effect doing something very similar to what's described for Option 2, but you are using a product that is specifically designed to support the storage of XML and its hierarchical relationships rather than the relational data relationships supported by relational databases. The built-in XML awareness of these products can make development for XML document manipulation significantly easier. This choice tends to feel very familiar to long-time users of XML and SGML on file systems, because that group is used to storing their metadata in their documents (and without the benefits of a database). But it is also the least familiar choice to most software developers and so often feels riskier to them. This concern is exacerbated by the fact that there is a lot of inconsistency in the terminology and capabilities of the XML repositories available today (some call themselves XML databases, some XML servers, and so on), making it difficult to get your arms around the nature of this product.

In particular, developers struggle to understand how data can be properly managed in an XML repository. For example, how would one implement the usage of a multi-level controlled list of index terms in the assignment of document metadata? Using a relational database, most developers would end up with similar approaches. But, with XML repositories, it turns out there are almost as many different answers as there are products. Whether you can find a satisfactory answer to this question will be a determining factor in whether you choose this approach.

Finally, it should be noted that while there is currently a fairly clear distinction between relational databases and native XML databases, this distinction is likely to become less and less clear over time as each type of product adds more of the features typically associated with the other.

User Interfaces and Editorial Tools

When you're doing requirements analysis and as you head into system design, avoid making the assumption that how you store your content will limit your choices for how users interact with data. It doesn't have to. For example, it is possible to store both your data and documents together either as XML or in a relational database but to present them separately in multiple forms and editing tools; alternately, you can store data and documents separately but present them together in either a form or a desktop editing tool. The challenge at this point is pretty clear: how to manage distinct data types—data and documents—in order to facilitate both the editorial and output processes your content eventually feeds.

Bottom Line

As you can see, there are a number of options available for building a custom content management repository. The key (as always) is to clearly define your requirements up front and then cross-reference those requirements against possible architectures. Take into consideration cost, internal staff, and future extensibility (among other things) to help you end up with a choice that's right for you.

Privacy Policy |  Register |  Unsubscribe