RESOURCE CENTER

“Really Strategies provides us with the third-party expertise we need.”

—Kinsey Wilson
USATODAY.com
Resource Center

The "X" in XPress Does Not Stand for XML

Part 1: Translating XPress documents to XML

XML and Quark XPress are integral parts of many publishers' workflows, but they just don't play nicely together yet.

Let's dream a little

In the ideal world, users could choose a "Save As XML" function from within XPress and have the desired XML file(s) show up wherever the user wanted them to. They could also do the equivalent in batch mode (working with many files at a time). If the world was really perfect, this would be possible even with old and inconsistently styled XPress documents.

Wake up!

There are methods for achieving these dreams in some contexts and with some content, but it's typically more work than you'd expect before having studied the issues. The problem can't fairly be blamed on Quark XPress, although there's certainly room for improvements in XPress itself. At its heart, the problem is that the purpose of using XPress (or any other page design tool) is different from the purpose of using XML. There are many nuances to this difference, but it boils down to the fact that XPress documents are designed to be interpreted by human beings who are capable of inferring meaning and relationships from the content, placement, and design of text and graphics on a page. Compare that to XML, which is designed to make meanings and relationships explicit so they can be processed by XML-aware software.

When translating XPress documents to XML, publishers quickly realize that XML tagging structures are tree-like and linear, while designed documents are certainly not tree-like and are only somewhat linear. It is often difficult and not necessarily meaningful to define an exact order for the elements on a page, but it's required for XML. On a formatted page, the fact that a sidebar is next to a main article and has similar content is sufficient for a person to identify their relationship to each other. This is not true for most software, and so conversion becomes painful. The problem is even more extreme when graphical content is involved.

A second major problem, especially with older XPress files, is that XPress users have been more worried about how their pages look than about consistent usage of style sheets, which reduces a programmer's ability to write reliable conversion routines based on XPress style sheet names.

It would certainly be possible to mitigate these issues through changes to XPress itself, and the next version promises to do at least some of that. Meanwhile, here are some more details to help define and tackle this problem as it exists for you today. As you'll see, there is no one right approach—it depends on your content and workflow. Adjustments in each of these areas can sometimes help to achieve your overall goal.

Some approaches

These are the most widely used XPress-to-XML translation methods:

  1. Products like Easypress's Atomik that enable XPress style sheets and content to be mapped to an XML DTD. The content is then exported to files. These are interactive tools that must be set up by an expert and then can be used by an XPress user following training. Some have the added benefit of enabling users to re-import the XML files into XPress documents.
  2. Programmatic conversion from an xpresstags file export (xpresstags is the text file format natively supported by XPress). Note that it is possible to create an xpresstags file for each of the XPress document's text boxes; an article, a sidebar, each image, and each table that is not in-line with the article will be created as separate documents. Also note that the graphics are not exported, and so links to them must be established through programmatic or human means.
  3. Use of other automated or semi-automated conversion methods from a PDF or print version of the XPress documents. These range from OCR-based methods to processes that depend on artificial intelligence software that mimics how human beings might interpret the content structure of a page. Regardless of the specifics, there is typically a need for a training period in which content and layout variations are encountered and then understood by the people and/or software involved.
  4. Manual re-creation of content as XML.

The first two options can be augmented by the use of a separate product or custom script that enables files to be manipulated in batch (so entire directories of XPress documents can be converted, for example).

And, of course, some publishers realize that they can avoid the problem all together by using a different desktop publishing tool like FrameMaker that is more XML aware and also acceptable as a layout tool for their products.

Contact factors

How much content If you have a small amount of content, consider manual re-creation.
If you have very large volumes of content, consider talking to vendors with proven processes for conversion via OCR or similar methods that don't depend on the XPress markup. Even the methods using relatively unsophisticated software can be affordable if performed by an experienced vendor.
How much variability in the content and its layout High variability content is a problem for all the approaches.
Layout, especially the use of sidebars, figures, tables and other content that is not inline with the primary text flow The scripted approach cannot typically account for layouts that include unpredictable non-linear elements, and so requires manual post-processing. (If, for example, you always have your sidebars in the same location, this might not matter.)
The other approaches typically require people to identify the relationships among non-linear objects either before the content is converted to XML (products like Atomik).
Page jumps If content flows from page to page through linked text boxes, then most approaches will be able to maintain the continuity of those text flows without human involvement. This may not be the case for approaches that involve scanning.
Use of complex tables XPress (today) has weak table handling, and there is little that can be done programmatically to address this issue for complex tables (tables with spanned rows and columns, for example).
Use of style sheets Inconsistent use of style sheets makes programmatic conversion based on xpresstags or the use of a product much more difficult than otherwise.
Complexity and specificity of target DTD A complex target DTD (e.g., where different sections require different tags sets even though the content formatting is identical), is difficult for any automated approach. Human intervention might be needed.
Presence in the XPress files of all the needed content Sometimes XML content (e.g., attribute values, an image dimensions) needs to be derived from the XPress content. This typically requires programmatic or human effort.

Workflow factors

The effect of workflow factors tends to be more complicated to evaluate. Here are some questions to ask yourself:

  1. Who is responsible for the XML markup? Is it acceptable/appropriate to have same people creating the XPress documents responsible for XML tagging? If they are different people, then do the XML experts want to work with XPress?
  2. When in your process do you need XML to be available? Can it wait until an entire publication is done, or do you need the content for other purposes as soon as it's available? If it can wait, how long can it wait? Long enough to use an external vendor?
  3. What level of perfection do you expect from your automated processes? How will translation errors be resolved and by whom?

Choosing

When weighing the options, be sure get help from people with experience using your preferred approach, or who can help you choose an approach in the first place. XPress-to-XML translation is full of tiny pitfalls that are easy to avoid once you've been through the process.

Stay tuned to upcoming issues when we look at importing XML documents into XPress.

Privacy Policy |  Register |  Unsubscribe