
Enterprise executives understand that timely, accurate knowledge can mean improved business performance. Two technologies have been central in improving the quantitative and qualitative value of the knowledge available to decision makers: business intelligence and knowledge management. Business intelligence has applied the functionality, scalability, and reliability of modern database management systems to build ever-larger data warehouses, and to utilize data mining techniques to extract business advantage from the vast amount of available enterprise data. Knowledge management technologies, while less mature than business intelligence technologies, are now capable of combining today's content management systems and the Web with vastly improved searching and text mining capabilities to derive more value from the explosion of textual information. We believe that these systems will blend over time, borrowing techniques from each other and inspiring new approaches that can analyze data and text together, seamlessly. We call this blended technology BIKM. In this paper, we describe some of the current business problems that require analysis of both text and data, and some of the technical challenges posed by these problems. We describe a particular approach based on an OLAP (on-line analytical processing) model enhanced with text analysis, and describe two tools that we have developed to explore this approach—eClassifier performs text analysis, and Sapient integrates data and text through an OLAP-style interaction model. Finally, we discuss some new research that we are pursuing to enhance this approach.
A critical component for the success of the modern enterprise is its ability to take advantage of all available information. This challenge becomes more difficult with the constantly increasing volume of information, both internal and external to an enterprise. It is further exacerbated because many enterprises are becoming increasingly “knowledge-centric,” and therefore a larger number of employees need access to a greater variety of information to be effective. The explosive growth of the World Wide Web clearly compounds this problem.
Enterprises have been investing in technology in an effort to manage the information glut and to glean knowledge that can be leveraged for a competitive edge. Two technologies in particular have shown good return on investment in some applications and are benefiting from a large concentration of research and development. The technologies are business intelligence (BI) and knowledge management (KM).
Business intelligence technology has coalesced in the last decade around the use of data warehousing and on-line analytical processing (OLAP). Data warehousing is a systematic approach to collecting relevant business data into a single repository, where it is organized and validated so that it can be analyzed and presented in a form that is useful for business decision-making.1 The various sources for the relevant business data are referred to as the operational data stores (ODS). The data are extracted, transformed, and loaded (ETL) from the ODS systems into a data mart. An important part of this process is data cleansing, in which variations on schemas and data values from disparate ODS systems are resolved. In the data mart, the data are modeled as an OLAP cube (multidimensional model), which supports flexible drill-down and roll-up analyses. Tools from various vendors (e.g., Hyperion, Brio, Cognos) provide the end user with a query and analysis front end to the data mart. Large data warehouses currently hold tens of terabytes of data, whereas smaller, problem-specific data marts are typically in the 10 to 100 gigabytes range.
Knowledge management definitions span organizational behavioral science, collaboration, content management, and other technologies. In this context, we are using it to address technologies used for the management and analysis of unstructured information, particularly text documents. It is conjectured that there is as much business knowledge to be gleaned from the mass of unstructured information available as there is from classical business data. We believe this to be true and assert that unstructured information will become commonly used to provide deeper insights and explanations into events discovered in the business data. The ability to provide insights into observed events (e.g., trends, anomalies) in the data will clearly have applications in business, market, competitive, customer, and partner intelligence as well as in many domains such as manufacturing, consumer goods, finance, and life sciences.
The variety of textual information sources is extremely large, including business documents, e-mail, news and press articles, technical journals, patents, conference proceedings, business contracts, government reports, regulatory filings, discussion groups, problem report databases, sales and support notes, and, of course, the Web. Knowledge and content management technologies are used to search, organize, and extract value from all of these information sources and are a focus of significant research and development.2,3 These technologies include clustering, taxonomy building, classification, information extraction, and summarization. An increasing number of applications, such as expertise location,4,5 knowledge portals, customer relationship management (CRM), and bioinformatics, require merging these unstructured information technologies with structured business data analysis.
It is our belief that over time techniques from both BI and KM will blend. Today's disparate systems will use techniques from each and will, in turn, inspire new techniques that will seamlessly span the analysis of both data and text. With this in mind, we describe our contributions in this direction. First, we briefly describe some business problems that motivate this integration and some of the technical challenges that they pose. Then we describe eClassifier, a comprehensive text analysis tool that provides a framework for integrating advanced text analytics. Next, we present an example that motivates our particular approach toward integrating data and text analysis and describe our architecture for a combined data and document warehouse and associated tooling. Finally, we discuss some current research directions in extracting information from documents that can increase the value of a data cube.
Motivation for BIKM
The desire to extend the capabilities of business intelligence applications to include textual information has existed for quite some time. The major inhibitors have included the separation of the data on different data management systems, typically across different organizations, and the immaturity of automated text analysis techniques for deriving business value from large amounts of text. The current focus on information integration in many enterprises is rapidly diminishing the first inhibitor, and advances in machine learning, information retrieval, and statistical natural language processing are eroding the second.
Examples of BIKM problems. To understand the importance of BIKM, it is useful to look at some real business problems and to determine how this technology can provide a return on the investment (ROI). The ROI can be achieved, in general, in one of two ways: (1) through cost reductions and identification of inefficiencies (improved productivity), and (2) through identification of revenue opportunities and growth. Here are some typical scenarios in which our customers believe their business analyses would benefit substantially from data and text integration:
1. Understanding sales effectiveness. A telemarketing revenue data cube can help identify products that are most successfully sold over the phone, sales representatives who generate the most sales, and customers who are the most receptive to this sales approach. Unfortunately, the particular sales techniques used by these successful sales representatives in various situations are not captured by quantitative measures in the OLAP cube. However, these sales conversations are now frequently recorded and converted to text. The text of conversations associated with high-revenue sales representatives and high-yield customers can be analyzed by various language processing or pattern detection techniques to find patterns in the use of phrases or phrase sequences.
2. Improving support and warranty analysis. Frequently in business applications, short text descriptions, from, for example, customer complaints, are recorded in a database but are then encoded into short classification codes by a person. The code fields then become the basis for any business analysis of the set of customer complaints. Variations in the assignment of codes by different people can cause emerging trends or problem situations to be overlooked. The application of modern linguistic and machine-learning techniques (i.e., classification) to the text could provide a more consistent encoding, or at least a validation of the human encoding, as the basis for the business analysis.
3. Relating CRM to profitability. Data cubes for understanding revenues achieved over a set of customers frequently omit the costs associated with individual customers. In some industries these costs can substantially reduce the profit from a customer. The costs can include the number of calls the customer made into the business for problem resolution, complaint handling, or just “hand-holding.” Extracting measures of these costs (e.g., time spent on the phone with the customer) and measures of the customer's loyalty for continued business (e.g., sentiment analysis of the customer interaction) from a customer relationship management (CRM) system and merging these measures into the revenue cube would provide a more complete picture of the profitability derived from a customer.6