Automated processing, modeling, and analysis of unstructured text (news documents, web content, journal articles, etc.) is a key task in many data analysis and decision making applications. As data sizes grow, scalability is essential for deep analysis. In many cases, documents are modeled as term or feature vectors and latent semantic analysis (LSA) is used to model latent, or hidden, relationships between documents and terms appearing in those documents. LSA supplies conceptual organization and analysis of document collections by modeling high-dimension feature vectors in many fewer dimensions. While past work on the scalability of LSA modeling has focused on the SVD, the goal of our work is to investigate the use of distributed memory architectures for the entire text analysis process, from data ingestion to semantic modeling and analysis.
The ParaText project created a set of software components for distributed processing, modeling, and analysis of unstructured text as an integral part of the Titan™ toolkit. These components are chained-together into data-parallel pipelines that are replicated across processes on distributed-memory architectures. Individual components can be replaced or rewired to explore different computational strategies and implement new functionality. Text analysis functionality can be embedded in applications on any platform using the native C++ API, Python, or Java. A command-line MPI executable included with Titan provides a reference implementation of how to use and embed text analysis components in an application, and can be used for many serial and parallel analysis tasks. Titan text components have also been embedded in applications ranging from standalone GUI applications to sophisticated client-server web services.
Parallel LSA is immediately available as part of the Titan Toolkit, which includes source code for all of components and executables. Contact us for details and assistance on how to get started.
Parallel LSA has been used in applications spanning a broad set of domains, including-but-not-limited-to the following:
The parallel LSA components were tested in an environment with 256 compute nodes, dual 3.6 GHz Intel EM64T processors, 6 GB RAM, Infiniband interconnect, and Lustre file system (15 GB/second bandwidth), with HTML documents from the 2007 Spock Challenge. On 64 processors: 2,458 documents; 669,940 terms (0.12% matrix density) On 512 processors: 45,945 adocuments; 4,440,327 terms (0.017% matrix density)