RESOURCE CENTER

“Really Strategies provides us with the third-party expertise we need.”

—Kinsey Wilson
USATODAY.com
Resource Center

Standardizing XML Data Structures for Content Reuse

So, you've made a business case to enable content reuse by converting all your publications to XML. You realize that in order to achieve this you need more than just well-formed XML, you need a DTD or schema to define the content structures. Great! Those are big steps and good decisions for many publishers.

Define your data

You start writing that DTD and you begin trying to figure out how to accommodate all of the various content structures that exist in your publications.

For example

  • In book A, the chapters begin with a title followed by the author name, a summary and then the body text of the chapter.
  • In book B, the chapters begin with a title followed by the author name, but the summary comes at the end, after the body text of the chapter.
  • In book C, the chapters begin with a title followed by the body text of the chapter and then a summary and the author name.

No problem. These variations (and more) are easily expressed in a DTD with a simple element declaration like the following:

<!ELEMENT Chapter (Title, (Author|Summary|Body)+) >

This declaration means that a Chapter must contain a Title, followed by Author, Summary, and Body elements in any order.

All right. You finish your DTD and you have all your books converted to XML and maybe all that XML data is stored in a content management system that makes it easy to find.

Now you're ready to start reusing that content. You see an opportunity to create a new publication by pulling together a half-dozen chapters from book A, a couple from books B and C, and a few newly authored chapters. You have the new content developed and you pull it together with all the existing chapters that you'll reuse, and you flow the content into a desktop publishing tool to format it.

But wait. The placement of the author names and chapter summaries in your new publication is inconsistent. In the chapters from book A the summary comes before the body of the chapter; those from book B have the summary after the body; and those from book C have both the summary and the authors after the body. You want all the chapters in the new book to look like book B.

XSLT to the rescue?

Once again, there's a solution. This time in the form of XSLT (Extensible Stylesheet Language Transformations -- a language for transforming XML documents into different XML documents or other formats). XSLT can easily manipulate these content structures to reorder things so that all of the chapters in the new book look the same. The transformations are pretty simple stuff and can be invoked just before the content is flowed into your desktop publishing tool.

OK. Another problem bites the dust. Whenever you want to reuse content in a new pub you just write some XSLT scripts. Sounds pretty easy. But even in the simple example given here you'll have to write one transformation to convert the book A content to the new style and another for the book C content. Multiply that by x number of other potential transformations and consider that a new product might reuse content from 5, 10, or more different sources, and this can quickly become pretty unmanageable.

In fact, the problems resulting from your inconsistent content can often be much worse than illustrated in the example above. Maybe you plan to publish all your book chapters to the web. In this web site, you want to allow users to search for chapters by author name and to browse chapters by scrolling through their summaries. Problem is, the model you used to tag your content not only allowed Author, Summary, and Body in any order, it also allowed you to skip any two of the three elements. So, when you go to publish your chapters to the web, some of them might be completely missing Author or Summary elements.

Normalize your data

A better approach would be to settle on one, consistent, more carefully controlled data structure for the XML for all your books. When a variant structure is desired in a particular print product, an XSLT script can be used to create it. This makes it much easier to know what to expect when sharing content across publications and also places the costs of the variations with the product, making it easier to do a cost/benefit analysis of continuing to support these alternate output formats.

For example, a better model for our example chapters might be:

<!ELEMENT Chapter (Title, Author+, Summary, Body) >

In this model, at least one Author must be included, and a single Summary is required.

So, how do you settle on one, consistent data structure for the XML for all your publications? In relational database design the first step is to normalize the data: to analyze it and identify and eliminate the inconsistencies. These same principals can be applied to modeling XML content structures. Analyze the patterns of data structures used in your publications. See what the most common patterns are and make decisions about what will be supported and what should be standardized.

Some of the key steps in this process are:

  1. Determine the set of elements that require markup. This is driven by the requirements of the electronic and print products that will be created from the XML, and sometimes by the technical characteristics of the systems that produce the products (your desktop publishing system, for example).
  2. Analyze each publication and record the different element patterns found for each major content type you encounter (book chapters, journal articles, and so on).
  3. For each type, determine which patterns are most common and which publications vary from those patterns. Select a standardized structure that will be used to store all of your XML content.
  4. Do some cost-benefit analysis to determine which variant structures should continue to be supported and which should be eliminated. Be sure to look at consistency within each publication as well as across the set of publications.
  5. Clean up your publications. Eliminate any unintended variations as well as those that can't be justified.

Some of the benefits of standardized XML data structures are:

  • Makes content reuse easier.
  • Makes content development simpler—editors only need to learn the standard structure.
  • Places the costs of variant structures with product development making it easier to do cost/benefit analysis.
  • Separates content development from product development.

Wrapping up

Modeling your XML data to accommodate all of the variations found in your published content, although relatively easy to do, can be a costly mistake - especially if one of your goals is to reuse or repurpose content. Some of these content anomalies are intended; some may not be. But if you take a good hard look, you might find that even the intended variations in content structures, designed to differentiate print products and make them more interesting, have now become obstacles to reusing and activating your content in the digital world.

The bottom line: XML as an enabling technology can transform publishing processes. But not if you use the new technology to re-create the status quo.

Privacy Policy |  Register |  Unsubscribe

Really Strategies' Blog

Consultants and analysts blog about strategy, content, and XML. See what they are saying.

Newsletter Index

General Publishing

Composition

Collaboration

Content Management

Licensing/Syndication

Rich Data Products

Semantic Technology

Software Development

Standards

XML Editing

Production

Oh Really! 5 Questions With...

Inside the Brackets