Description
Learn advanced analytical techniques and leverage existing tool kits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems that go beyond the basics of classification, clustering, and recommendation. Pro Hadoop Data Analytics emphasizes best practices to ensure coherent, efficient development. A complete example system will be developed using standard third-party components that consist of the tool kits, libraries, visualization and reporting code, as well as support glue to provide a working and extensible end-to-end system. The book also highlights the importance of end-to-end, flexible, configurable, high-performance data pipeline systems with analytical components as well as appropriate visualization results. You’ll discover the importance of mix-and-match or hybrid systems, using different analytical components in one application. This hybrid approach will be prominent in the examples. What You’ll Learn Build big data analytic systems with the Hadoop ecosystem Use libraries, tool kits, and algorithms to make development easier and more effective Apply metrics to measure performance and efficiency of components and systems Connect to standard relational databases, noSQL data sources, and more Follow case studies with example components to create your own systems Who This Book Is For Software engineers, architects, and data scientists with an interest in the design and implementation of big data analytical systems using Hadoop, the Hadoop ecosystem, and other associated technologies. Kerry Koitzsch is a software engineer and interested in the early history of science, particularly chemistry. He frequently publishes papers and attends conferences on scientific and historical topics, including early chemistry and alchemy, and sociology of science. He has presented many lectures, talks, and demonstrations on a variety of subjects for the United States Army, the Society for Utopian Studies, American Association for Artificial Intelligence (AAAI), Association for Studies in Esotericism (ASE), and others. He has also published several papers and written two historical books. Kerry was educated at Interlochen Arts Academy, MIT, and the San Francisco Conservatory of Music. He served in the United States Army and United States Army Reserve, and is the recipient of the United States Army Achievement Medal. He has been a software engineer specializing in computer vision, machine learning, and database technologies for 30 years, and currently lives and works in Sunnyvale, California. [PART I: CONCEPTS] Chapter 1: Overview: Building Data Analytic Systems with Hadoop In this chapter we discuss what analytic systems using Hadoop are, why they are important, data sources which may be used, and applications which are — and are not suitable for a distributed system approach using Hadoop. Subtopics: 1. Introduction: The Need for Distributed Analysis 2. How the Hadoop Ecosystem Implements Big Data Analysis 3. A Survey of the Hadoop Ecosystem 4. Architectures for Building 5. Summary Chapter 2: Programming Languages: A Scala and Python Refresher This chapter consists of a concise overview of the Scala and Python programming languages, and details why these languages are important ingredients of most modern Hadoop analytical systems. The chapter is primarily aimed at Java/C++ programmers who need a quick review/introduction to the Scala and Python programming languages.Subtopics: 1. Motivation: Selecting the Right Language(s) Defines the Application 1. Review of Scala 2. Review of Python 3. Programming Applications and Examples 4. Summary Chapter 3: Necessary Ingredients: Standard Toolkits for Hadoop and Analytics In this chapter we describe an example system which we develop throughout the remainder of the book using standard toolkits from the Hadoop ecosystem, and other analytical toolkits in combination with development components such as Maven, openCV, Apache Mahout, and others to create a Hadoop-based system appropriate for a variety of applications. Subtopics: 1. Libraries, Components, and Toolkits: A Survey 2. Numerical and Statistical Libraries; R, Weka, and Others 3. Hadoop Toolkits for Analysis: Mahout and Friends 4. Apache Spark Libraries and Components: H20, Sparkling Water, and More 5. Examples of Use and System Building 6. Summary Chapter 4: Relational, noSQL, and Graph Databases In this chapter we describe relational databases, such as mysql, noSQL databases such as Cassandra, and graph databases such as neo4j, how to integrate them with the Hadoop ecosystem, and how to create customized data sources and sinks using Apache Camel. Subtopics: 1. Introduction to Databases: Relational, NoSQL, and Graph 2. Relational Data Sources 3. noSQL Data Sources: Cassandra 4. Gra ph Databases: Neo4j 5. Integrating Data with the Analytical Engine 6. Summary Chapter 5: Data Pipelines and How to Construct Them In this chapter we describe how to construct basic data pipelines using data sources and the Hadoop ecosystem. We provide an end-to-end example of how data sources may be linked and processed using Hadoop and other analytical components, and how this is s imilar to a standard ETL process. Subtopics: 1. The Basic Data Pipeline 2. Data Sources and Sinks 3. Computation and Transformation 4. Visualizing and Reporting the Results 5. Summary Chapter 6: Advanced Search Techniques with Hadoop, Lucene, and Solr In this chapter we describe the structure and use of the Lucene and Solr third-party search engine components, how to use them with Hadoop, and how to develop advanced search capability customized for an analytical application. Subtopics: 1. Introduction to Customized Search Engines 2. Distributed Search Techniques 3. Basic Examples: A Custom Search Component 4. Extended Examples: Scaling, Tuning, and Customizing the Search Component 5. Summary [ PART II: ARCHITECTURES AND ALGORITHMS] Chapter 7: An Overview of Analytical Techniques and Algorithms In this chapter, we provide an overview of four categories of algorithm: statistical, Bayesian, ontology-driven, and hybrid algorithms which leverage the more basic algorithms found in standard libraries to perform more in-depth and accurate analyses using Hadoop. Subtopics: 1. Survey of Algorithm Types 2. Statistical / Numerical Techniques 3. Bayesian Techniques 4. Ontology Driven Algorithms 5. Hybrid Algorithms: Combining Algorithm Types 6. Code Ex amples 7. Summary Chapter 8: Rule Engines, System Control, and System Orchestration In this chapter, we describe the Drools rule engine and how it may be used to control and orchestrate Hadoop analysis pipelines. We describe an example rule-based controller which can be used for a variety of data types and applications in combination with the Hadoop ecosystem. Subtopics: 1. Introduction to Rule Systems: Drools 2. Rule-Based Software System C ontrol3. System Orchestration with Drools 4. Analytical Engine Example with Rule Control 5. Summary Chapter 9: Putting it All Together: Designing a Complete Analytical System In this chapter, we describe an end-to-end design example, using many of the components discussed so far, as well as ‘best practices’ to use during the requirements acquisition, planning, architecting, development, and test phases of the system development project. Subtopics: 1. Goals and Requirements for Analytical System Building 2. Architecture 3. Initial Code Framework Example 4. Extended Code Framework Example 5. Summary [PART III: COMPONENTS AND SYSTEMS] Chapter 10: Using Library Components for Statistical Analytics and Data Mining In this chapter, we describe four standard statistical analysis packages: R/Weka, MLib, Mahout, and Numpy Extended. These toolkits ar e used to develop a data mining example using a Hadoop cluster and a variety of the Hadoop ecosystem components to provide a dashboard-based result report. Subtopics: 1. A Survey of Data Mining Techniques and Applications 2. R/Weka Example 3. Numpy Extended Example 4. Integration with Hadoop Analytical Components 5. Data Mining Example 6. Summary Chapter 11: Semantic Web Technologies and Natural Language Processing In this chapter, we describe the use of knowledge information sources such as taxonomies, ontologies, and grammars, why they are useful, and how to integrate them with Hadoop analytical components as well as with natural language processing components to provide an added layer of ease-of-use to an analytical system. Subtopics: 1. Introduction to Semantic Web Technologies 2. Semantic Web For Hadoop (Examples) 3. Data Integration with Semantic Web Technologies 4. Code Examples with Data Integration using Apache Camel 5. Extended Example 6. Summary Chapter 12: Machine Learning Components with Hadoop In this chapter, we discuss a number of machine learning components including neural net, genetic algorithm, Markov modeling, and hybrid components, and how they may be used with the Hadoop ecosystem to provide cognitive computing elements to an analytical engine. Subtopics: 1. Introduction: The Need for Machine Learning 2. Machine Learning Toolkits and Hadoop 3. Code Examples using Apache Mahout 4. Extended Code Examples 5. Neural Nets, Genetic Algorithms, and Hybrids 6. Summary Chapter 13: Data Visualizers: Seeing and Interacting with the Analysis In this chapter, we discuss how to create data visualization components, connect them with the analytical modules of the system, and how to provide the user with the ability to interact with the charts, dashboards, and reports. Subtopics: 1. Introduction to Data Visualization : The Need to See Results 2. Visualizers for Simple Data: Some Examples 3. Data Visualizers and Hadoop: Some Examples 4. Visualizers for more than Two Dimensions (three-D examples and extended plots/charting) 5. Summary: Future Directions for Data Visualization [PART IV: CASE STUDIES AND APPLICATIONS] Chapter 14: A Case Study in Bioinformatics: Analyzing Microscope Slide Data In this chapter, we describe an application to analyze microscopic slide data such as might be found in medical examinations of patient samples. We illustrate how a Hadoop system might be used on a small Hadoop cluster to organize, analyze, and correlate bioinformatic data. Subtopics: 1. Introduction to Bioinformatics 2. Analyzing Microscope Slide Data Automatically 3. Basic Examples 4. Extended Examples 5. Summary Chapter 15: A Bayesian Analysis Software Component: Identifying Credit Card Fraud In this chapter, we describe a Bayesian analysis component plugin which may be used to analyze credit card transactions in order to identify fraudulent use of the credit card by illicit users. Subtopics: 1. Introduction to Bayesian Analysis 2. The Problem of Credit Fraud and Possible Solutions 3. Basic Applications of the Data Models 4. Examples of Fraud Detection 5. Summary Chapter 16: Searching for Oil: Geological Data Analysis with Mahout In this chapter, we describe a system which uses geospatial data, ontologies, and other semantic web information to predict where geological resources, such as oil or bauxite (aluminum ore) might be found. Subtopics: 1. Introduction to the Geospatial Data Arena 2. Components and Architecture^3. Data Sources for Geospatial Data 4. Basic Examples and Visualizations 5. Extended Examples 6. Summary Chapter 17: ‘Image as Big Data’ Systems: Some Case Studies In this chapter, we describe the use of ‘images as big data’ and how image data may be used in combination with the Hadoop ecosystem to provide information for a variety of systems. Subtopics: 1. Introduction to the Image as Big Data Concept 2. Components and Architecture 3. Data Sources for Imagery and How to Use Them 4. The Image as Big Data Pipeline 5. Examples 6. Summary Chapter 18: A Generic Data Pipeline Analytical System In this chapter, we detail and end-to-end analytical system using many of the techniques we discussed throughout the book to provide an evaluation system the user may extend and edit to create her own Hadoop data analysis system. Subtopics:1. Architecture and Description of Example System 2. How to obtain and run the system 3. Basic examples 4. Extended Examples 5. How to extend the system for custom applications 6. Summary Chapter 19: Conclusions and The Future of Big Data Analysis In this chapter we sum up what we have learned in the previous chapters and discuss some of the developing trends in big data analysis including ‘incubator’ projects and ‘young’ projects for data analysis, and we speculate on what the future holds for big data analysis and the Hadoop ecosystem (it can only continue to grow) Subtopics: 1. Conclusions: The Current state of Hadoop Data Analytics




