“Really Strategies provides us with the third-party expertise we need.”
A good source for rich data content is statistical and tabular data. This type of data can generally be manipulated in interesting ways and combined with text documents to enrich content offerings for customers. Think of batting stats in a sidebar for a baseball article or a filmography listing for an article about a well known director.
Some publishers have untapped sources of this type of content; however, it may be "locked" in print or archive formats, not readily available to deliver in interesting online and rich data formats. It needs to be brought into some other neutral format, such as a relational database or XML, to make it available for repurposing.
We've worked with publishers on several projects taking many years' worth of printed statistical data, converting it into a ready-to-use format, and presenting it in interesting ways in online products.
Following is a description of those products followed by some key areas of consideration when taking on this type of activity.
CQ Press, a D.C.-based political science reference publisher took more than 200 years worth of election data found in its America Votes book series and other sources for its online Voting and Elections Collection . Besides offering some interesting query capabilities, the product displays color-coded maps based on election outcomes in the country, states, counties, and congressional districts.
In addition to the very familiar blue and red U.S. maps we've all seen from the nightly news, the Voting and Elections Collection offers a further visual breakdown of data, such as congressional district results in each state. Hovering a mouse over a specific district will present the voting results for each candidate.
To build the maps, CQ Press loaded all of its election data, previously found in book content, into a relational database. The site queries the database and uses Adobe's SVG technology to build the maps on the fly. The map data itself is "content." Sure the U.S. map would be relatively easy to create, but the CQ Press product also displays each state with congressional and county lines. There are 435 districts in the 50 states, and more than 3,000 counties. CQ Press licensed the map boundary data and combined it with its own history of election data to present valuable voting information to the elections scholar and student.
In its Supreme Court Collection product, CQ Press licensed the use of a Supreme Court data set previously used primarily in statistical software. The format of this data is great for that type of use, but is not easily transportable into web products. CQ Press went through a data mapping and conversion process to load the statistical data into a relational database, which sits behind the product.
Where does the data become rich? CQ Press includes written summaries of cases from several of its book series within the product. But for web presentation of the summary, the site queries the case database for specific voting information for the case, enhancing the summary text (see below). Internally, CQ Press refers to this query result as the case's "box score." This is transparent to the end user, who sees one cohesive web document that is pulled from multiple sources.

Like the Elections product, CQ Press also uses the database to offer sophisticated queries into the data set that allow analysis on justice voting histories, alignments on the court, and other information beneficial to the Supreme Court student and scholar.
Editorial in Projects in Education, the publisher of Education Week and Teacher Magazine, created an online tool, Education Counts, to run reports on several years of educational statistical data previously only found in its annual state-policy reports, Quality Counts and Technology Counts.
For example, the Technology Counts report includes many tables of indicators (or data points) such as the amount of money spent per pupil in each state or the number of students who have computer access. That data is loaded into the Education Counts system where it is available for analysis and download, including HTML tables, Excel files, or graphical maps (see below).

The Education Counts product allows users to analyze key data points and then automatically create a color-coded map of the U.S. or a bar chart to graphically represent that data, formally only found in snapshots that matched printed tables.
There were similarities in terms of lessons learned and obstacles encountered with both projects that are typical when attempting a large undertaking like this. This type of work presents some interesting challenges beyond the typical processes needed when producing online documents from print sources.
You'll need to create a data model for the content, whether it is a relational database model or XML schema. The most important part of this step, as in any data modeling exercise, is to define terms and relationships and make sure the content experts understand, agree, and approve them. Before any technical modeling begins, it is useful to list out terms and definitions for all of the data points and relationships in a non-technical format to make sure everyone is on the same page on the meaning behind the data. How you store and relate your data in the system will be critical to what you can do with it in your rich data products. Mary Grace Palumbo, Senior Business Analyst, from CQ Press says "Know your subject. This is important not only for editorial, but also for the technical staff that needs to come up with a database schema to support the changes in the system. Part of the challenge of the Elections project was being able to understand in detail how the different states run their elections, how that has changed over the years, and how to present it all together on a website."
Conversion processes can present another challenge as source content may exist in a number of various formats. In the products mentioned above, the content was found in print copies of books with no electronic files, Quark composition files, access databases, PDFs, statistical software formats, web sites, and other formats. Often you might find varying data sources that you will need to bring together in the final "end" repository. You'll need to make some decisions on whether the process is a one-time event (which may be necessary when manual clean up steps are involved) or something reusable (which will be important if you need to update your new repository from the source).
Sure, you have 20 years of data, but everyone thinks it is for the same "thing," so there shouldn't be surprises. But there always are, especially when incorporating data from many years. Over time, different editors approach the data in different ways that can create inconsistencies over time. Although these differences are perfectly acceptable for the editions of the printed book, they cause challenges when merging all of the data from different years to build consistencies in online presentation regardless of source or timeframe of data. A printed book is a snapshot of the data; the online world makes it more dynamic. No one sits down and compares books printed 10 years apart, but users can see the data in this way online.
Additionally, the data subjects themselves change over time. In the Elections and Voting product, CQ Press needed to account not only for congressional districts changes over time, but also the inclusion of states as well (Alaska and Hawaii became states in 1959). The majority of presidential tables made more sense to set up with a Democratic vs. Republican view, but that organization isn't feasible for data prior to 1828.
Besides advising to know your subject, Palumbo adds, "Know your sources. It wasn't discovered until well into the project that one of our sources only collected results for the candidates who received 5% or more of the total votes cast. We spent a good deal of time crafting a solution for that issue."
With the Supreme Court data, there was always an exception to any rule. As one project team member put it, "They can do anything they want. They are the Supreme Court." At some point, you need to make careful decisions on the 80/20 rule. You are trying to standardize a large set of data, and you need concentrate on commonalities of data and important areas of differences. But you may find that there was one small timeframe or data set that had some unique inconsistency on one data point. How important is it to build out your database and processes to handle that as opposed to have some generic "notation" capabilities? These are difficult decisions to make, but they are important to do for success.
For example in the CQ Press Elections product, Louisiana presented some challenges with some unique election laws. Starting in the 1970s, Louisiana abandoned the primary election. Instead, all candidates stand in the general election. If no one candidate receives the majority vote in the general election, a run off is held. This process is unique among the states.
In an example not provided in the products above, a medical publisher of drug information struggled over what to do with "grapefruit." Grapefruit is not a drug but has interactions with drugs that are important. Should the data model allow for "food substances" just to support grapefruit or should "grapefruit" be handled as a drug?
When facing anomalous data when laying out the print publication, you have more tools at your disposable to handle it. You are creating one version of a table for a very specific purpose. But when you dynamically create the data and allow for different views, the handling of anomalous data needs to be programmatic. There is no editorial or design review process between the user's query and the presentation of the data on the screen.
This is often a difficult thing to understand for the non technical content experts of editorial departments who compiled the data. It's not that the dynamically created data looks worse or "can't do" what the print table did. In fact, the opposite is truethe dynamically created data can do much, much more than its print counterparts. But the tradeoff is that you need to include standardization of the data for the rich data product, whereas over time editors and production staff tailored the look and feel of tables in each print publication.
Although "unlocking" content from printed tables can be an arduous and time-consuming process, especially on the large scale of the products mention here, the results are often worth the effort. By going through the process, you can present additional valuable data to the end user and enrich what would be offered if merely the snapshot of that tabular data was mirrored online.
The examples above all used relational databases to store the data in question. The data is queried from the database and rendered for online consumption. One of the limitations here is the database is queried separately than the XML textual files sometimes associated with it. The products mentioned here were created before the advent of strong XML repositories now emerging in the market. The data heregranular statistical data certainly does seem to lend itself to a relational database modelbut it would be interesting to experiment with storing the statistical data and the textual data together in an XML repository to see if any efficiency could be gained.
Consultants and analysts blog about strategy, content, and XML.
See what they are saying.