LexisNexis

The LexisNexis® Global Content Systems Group provides content to a wide array of market facing delivery systems, including print, CD-ROM, Lexis® for Microsoft® Office, and LEXIS.COM®. These services deliver access to content to more than a million end users. The LexisNexis content collection managed and operated by the Global Content group consists of more than 2.3 billion documents of various sizes. The collection size is more than 20 Terabytes. New documents are added to the collection every day. The raw text documents are prospectively enhanced by recognizing and resolving embedded citations, performing multiple topical classifications, recognizing entities, and creation of statistical summaries and other data mining activities. The older documents in the collection require periodic retrospective processing to apply new or modified topical classification rules, and to account for changes on the basis of the other data enhancements. Without the periodic retrospective processing, the collection of documents would become increasingly inconsistent. The inconsistent application of the above enhancements would seriously reduce the effectiveness of the data enhancements.

The Challenge

The LexisNexis Content management system had evolved over a 40 year period into a complex heterogeneous distributed environment consisting of Sun Solaris proprietary servers, commodity Linux servers, and proprietary IBM z/OS systems. The systems acting as repository nodes were separated from the systems that performed the data enhancements. The separation of the repository nodes from the processing systems required that copies of the documents be transmitted from the repository systems to the data enhancement system, and then transmitted back to the repository after the enhancement process completed. The transmission of the documents created additional processing latencies, and the elapsed time to perform a retrospective topical classification or indexing became several months. The delay to apply a new classification to the collection retrospectively created a situation where older documents might not be found by a researcher via the topical index when the index topic was new or recently modified. The lack of certainty about the coverage of the indexing required the researcher to conduct additional searches, especially when the classification covered a new or emerging topic.

The Solution

LexisNexis Global Content decided to consolidate the content management and document enhancement and mining systems onto the HPCC Systems® platform. The high performance computing cluster technology, HPCC Systems, is an open source technology platform using commodity servers running the Linux Operating System to behave as a single system.

The most obvious application of the HPCC Systems platform in the Content systems roadmap is in the content enrichment space. The HPCC Systems platform has a proven foundation in entity recognition/resolution, clustering, and content analytics. Enrichment must be applied across all the content simultaneously to provide a superior search result. The massively parallel nature of HPCC Systems provides both the processing and storage resources required to fulfill the dual missions of content storage and content enhancement. HPCC Systems was easily integrated with the existing Content Management workflow engine to provide document level locking and other editorial constraints. Another beneficial consideration of incorporating enrichment functionality directly into the content repository was the resource savings from minimizing input/output (I/O) cycles.

The migration of the content repository and data enhancement processing to the HPCC Systems platform involved creating several HPCC Systems “worker” clusters of varying sizes to perform data enrichments and a single HPCC Systems Data Management cluster to house the content. This configuration provides the ability to send document workloads of varying sizes to appropriately sized worker clusters while reserving a substantially sized Data Management cluster for content storage and update promotions. Interactive access is also provided to support search and browse operations.

Updates and re-enrichments to the content repository are managed automatically by the existing LexisNexis Global Content Systems workflow control infrastructure. Updates and subsequent exports to product will be performed multiple times per hour to ensure timely updates to our customers.

Custom search capability has been developed for the content platform tailored to the needs of the editorial users. This will provide advantages over being tied to customer centric search capabilities. For example, the custom search capability will allow editors to perform searches for specific punctuation patterns to find data issues that would not be possible to find in existing customer facing search platforms.

The Results

The system as designed achieves the goal of having a tightly integrated content management and enrichment system that takes full advantage of the HPCC Systems super computing capabilities for both computation and high speed data access. The possibilities that this platform opens up for data mining and enrichment is very exciting for the company. The enrichment capability coupled with the immense breadth of content targeted for this platform has great potential to “wow” the industry.

The elapsed time to perform an enrichment pass of the entire data collection has dropped from 6 to 8 weeks to less than a day. This change is so significant that we are already increasing the degree of enrichment into capabilities that were previously out of reach.

Alternatives Considered

Prior to the evaluation and selection of the HPCC Systems technology, the Global Content team evaluated a MarkLogic® solution. The data enhancement processing would be performed upon a Grid. The cost comparisons were surprising.

Content Platform Cost Comparison

MarkLogic/GridHPCC Systems
HWEngineeringHWEngineering
$13,084,032$1,616,300$7,845,514$2,683,150
$14,700,332$10,528,664

Assumptions:

  • Includes cost of enrichment platform, including 400 node Grid leveraged for MarkLogic.
  • Includes costs to develop Media Neutral Content Repository (MNCR) on MarkLogic.
  • Includes Enterprise Architecture requirements in both costs (e.g. REST, WIP, Cross schema).
  • Inventory Management Only included in HPCC Systems.
  • All estimates based on 3B document capacity (Includes News, other non-strategic sources).
  • Does not include MarkLogic software costs/maintenance.

Cost Considerations:

  • Sunk costs of ~$680k for MNCR (ML).
  • Incremental costs of ~$1.5M to rewrite MNCR (HPCC Systems).
  • HPCC Systems instance is full Disaster Recovery, while the MarkLogic instance is only a 50% Disaster Recovery.
  • HPCC Systems instance is full Certification Environment, while MarkLogic is only 10%.
  • Net $4M business savings utilizing HPCC Systems based on assumptions.
  • LexisNexis, Lexis and lexis.com are registered trademarks of Reed Elsevier Properties Inc., used under license. Other products and services may be trademarks or registered trademarks of their respective companies.