Data just right : introduction to large-scale data & analytics / Michael Manoochehri.

By:

Manoochehri, Michael

Material type: Text

TextPublication details: Upper Saddle River, NJ : Addison-Wesley, [2014]Description: xxiii, 215 pages : illustrations, map ; 24 cmISBN:

9780321898654 (pbk. : alk. paper)
0321898656 (pbk. : alk. paper)

Subject(s):

DDC classification:

005.74 MAN

Contents:

Machine generated contents note: I.Directives in the Big Data Era -- 1.Four Rules for Data Success -- When Data Became a BIG Deal -- Data and the Single Server -- The Big Data Trade-Off -- Build Solutions That Scale (Toward Infinity) -- Build Systems That Can Share Data (On the Internet) -- Build Solutions, Not Infrastructure -- Focus on Unlocking Value from Your Data -- Anatomy of a Big Data Pipeline -- The Ultimate Database -- Summary -- II.Collecting and Sharing a Lot of Data -- 2.Hosting and Sharing Terabytes of Raw Data -- Suffering from Files -- The Challenges of Sharing Lots of Files -- Storage: Infrastructure as a Service -- The Network Is Slow -- Choosing the Right Data Format -- XML: Data, Describe Thyself -- JSON: The Programmer\'s Choice -- Character Encoding -- File Transformations -- Data in Motion: Data Serialization Formats -- Apache Thrift and Protocol Buffers -- Summary -- 3.Building a NoSQL-Based Web App to Collect Crowd-Sourced Data -- Relational Databases: Command and Control -- The Relational Database ACID Test -- Relational Databases versus the Internet -- Cap Theorem and Base -- Nonrelational Database Models -- Key -- Value Database -- Document Store -- Leaning toward Write Performance: Redis -- Sharding across Many Redis Instances -- Automatic Partitioning with Twemproxy -- Alternatives to Using Redis -- NewSQL: The Return of Codd -- Summary -- 4.Strategies for Dealing with Data Silos -- A Warehouse Full of Jargon -- The Problem in Practice -- Planning for Data Compliance and Security -- Enter the Data Warehouse -- Data Warehousing\'s Magic Words: Extract, Transform, and Load -- Hadoop: The Elephant in the Warehouse -- Data Silos Can Be Good -- Concentrate on the Data Challenge, Not the Technology -- Empower Employees to Ask Their Own Questions -- Invest in Technology That Bridges Data Silos -- Convergence: The End of the Data Silo -- Will Luhn\'s Business Intelligence System Become Reality? -- Summary -- III.Asking Questions about Your Data -- 5.Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets -- What Is a Data Warehouse? -- Apache Hive: Interactive Querying for Hadoop -- Use Cases for Hive -- Hive in Practice -- Using Additional Data Sources with Hive -- Shark: Queries at the Speed of RAM -- Data Warehousing in the Cloud -- Summary -- 6.Building a Data Dashboard with Google BigQuery -- Analytical Databases -- Dremel: Spreading the Wealth -- How Dremel and MapReduce Differ -- BigQuery: Data Analytics as a Service -- BigQuery\'s Query Language -- Building a Custom Big Data Dashboard -- Authorizing Access to the BigQuery API -- Running a Query and Retrieving the Result -- Caching Query Results -- Adding Visualization -- The Future of Analytical Query Engines -- Summary -- 7.Visualization Strategies for Exploring Large Datasets -- Cautionary Tales: Translating Data into Narrative -- Human Scale versus Machine Scale -- Interactivity -- Building Applications for Data Interactivity -- Interactive Visualizations with R and ggplot2 -- matplotlib: 2-D Charts with Python -- D3.js: Interactive Visualizations for the Web -- Summary -- IV.Building Data Pipelines -- 8.Putting It Together: MapReduce Data Pipelines -- What Is a Data Pipeline? -- The Right Tool for the Job -- Data Pipelines with Hadoop Streaming -- MapReduce and Data Transformation -- The Simplest Pipeline: stdin to stdout -- A One-Step MapReduce Transformation -- Extracting Relevant Information from Raw NVSS Data: Map Phase -- Counting Births per Month: The Reducer Phase -- Testing the MapReduce Pipeline Locally -- Running Our MapReduce Job on a Hadoop Cluster -- Managing Complexity: Python MapReduce Frameworks for Hadoop -- Rewriting Our Hadoop Streaming Example Using mrjob -- Building a Multistep Pipeline -- Running mrjob Scripts on Elastic MapReduce -- Alternative Python-Based MapReduce Frameworks -- Summary -- 9.Building Data Transformation Workflows with Pig and Cascading -- Large-Scale Data Workflows in Practice -- It\'s Complicated: Multistep MapReduce Transformations -- Apache Pig: Ixnay on the Omplexitycay -- Running Pig Using the Interactive Grunt Shell -- Filtering and Optimizing Data Workflows -- Running a Pig Script in Batch Mode -- Cascading: Building Robust Data-Workflow Applications -- Thinking in Terms of Sources and Sinks -- Building a Cascading Application -- Creating a Cascade: A Simple JOIN Example -- Deploying a Cascading Application on a Hadoop Cluster -- When to Choose Pig versus Cascading -- Summary -- V.Machine Learning for Large Datasets -- 10.Building a Data Classification System with Mahout -- Can Machines Predict the Future? -- Challenges of Machine Learning -- Bayesian Classification -- Clustering -- Recommendation Engines -- Apache Mahout: Scalable Machine Learning -- Using Mahout to Classify Text -- MLBase: Distributed Machine Learning Framework -- Summary -- VI.Statistical Analysis for Massive Datasets -- 11.Using R with Large Datasets -- Why Statistics Are Sexy -- Limitations of R for Large Datasets -- R Data Frames and Matrices -- Strategies for Dealing with Large Datasets -- Large Matrix Manipulation: bigmemory and biganalytics -- ff: Working with Data Frames Larger than Memory -- biglm: Linear Regression for Large Datasets -- RHadoop: Accessing Apache Hadoop from R -- Summary -- 12.Building Analytics Workflows Using Python and Pandas -- The Snakes Are Loose in the Data Zoo -- Choosing a Language for Statistical Computation -- Extending Existing Code -- Tools and Testing -- Python Libraries for Data Processing -- NumPy -- SciPy: Scientific Computing for Python -- The Pandas Data Analysis Library -- Building More Complex Workflows -- Working with Bad or Missing Records -- iPython: Completing the Scientific Computing Tool Chain -- Parallelizing iPython Using a Cluster -- Summary -- VII.Looking Ahead -- 13.When to Build, When to Buy, When to Outsource -- Overlapping Solutions -- Understanding Your Data Problem -- A Playbook for the Build versus Buy Problem -- What Have You Already Invested In? -- Starting Small -- Planning for Scale -- My Own Private Data Center -- Understand the Costs of Open-Source -- Everything as a Service -- Summary -- 14.The Future: Trends in Data Technology -- Hadoop: The Disruptor and the Disrupted -- Everything in the Cloud -- The Rise and Fall of the Data Scientist -- Convergence: The Ultimate Database -- Convergence of Cultures -- Summary.

Holdings
Item type	Current library	Call number	Copy number	Status	Date due	Barcode
Standard Loan	Thurles Library Main Collection	005.74 MAN (Browse shelf(Opens below))	1	Available		39002100654640

Enhanced descriptions from Syndetics:

Making Big Data Work: Real-World Use Cases and Examples, Practical Code, Detailed Solutions

Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on "Big Data" have been little more than business polemics or product catalogs. Data Just Right is different: It's a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist.

Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that's where you can derive the most value.

Manoochehri shows how to address each of today's key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You'll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today's leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery.

Coverage includes

Mastering the four guiding principles of Big Data success--and avoiding common pitfalls Emphasizing collaboration and avoiding problems with siloed data Hosting and sharing multi-terabyte datasets efficiently and economically "Building for infinity" to support rapid growth Developing a NoSQL Web app with Redis to collect crowd-sourced data Running distributed queries over massive datasets with Hadoop, Hive, and Shark Building a data dashboard with Google BigQuery Exploring large datasets with advanced visualization Implementing efficient pipelines for transforming immense amounts of data Automating complex processing with Apache Pig and the Cascading Java library Applying machine learning to classify, recommend, and predict incoming information Using R to perform statistical analysis on massive datasets Building highly efficient analytics workflows with Python and Pandas Establishing sensible purchasing strategies: when to build, buy, or outsource Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist

Includes bibliographical references and index.

Table of contents provided by Syndetics

Foreword (p. xv)
Preface (p. xvii)
Acknowledgments (p. xxv)
About the Author (p. xxvii)
I Directives in the Big Data Era (p. 1)
1 Four Rules for Data Success (p. 3)
When Data Became a BIG Deal (p. 3)
Data and the Single Server (p. 4)
The Big Data Trade-Off (p. 5)
Build Solutions That Scale (Toward Infinity) (p. 6)
Build Systems That Can Share Data (On the Internet) (p. 7)
Build Solutions, Not Infrastructure (p. 8)
Focus on Unlocking Value from Your Data (p. 8)
Anatomy of a Big Data Pipeline (p. 9)
The Ultimate Database (p. 10)
Summary (p. 10)
II Collecting and Sharing a Lot of Data (p. 11)
2 Hosting and Sharing Terabytes of Raw Data (p. 13)
Suffering from Files (p. 14)
The Challenges of Sharing Lots of Files (p. 14)
Storage: Infrastructure as a Service (p. 15)
The Network Is Slow (p. 16)
Choosing the Right Data Format (p. 16)
XML: Data, Describe Thyself (p. 18)
JSON: The Programmer's Choice (p. 18)
Character Encoding (p. 19)
File Transformations (p. 21)
Data in Motion: Data Serialization Formats (p. 21)
Apache Thrift and Protocol Buffers (p. 22)
Summary (p. 23)
3 Building a NoSQL-Based Web App to Collect Crowd-Sourced Data (p. 25)
Relational Databases: Command and Control (p. 25)
The Relational Database ACID Test (p. 28)
Relational Databases versus the Internet (p. 28)
CAP Theorem and BASE (p. 30)
Nonrelational Database Models (p. 31)
Key-Value Database (p. 32)
Document Store (p. 33)
Leaning toward Write Performance: Redis (p. 35)
Sharding across Many Redis Instances (p. 38)
Automatic Partitioning with Twemproxy (p. 39)
Alternatives to Using Redis (p. 40)
NewSQL: The Return of Codd (p. 41)
Summary (p. 42)
4 Strategies for Dealing with Data Silos (p. 43)
A Warehouse Full of Jargon (p. 43)
The Problem in Practice (p. 45)
Planning for Data Compliance and Security (p. 46)
Enter the Data Warehouse (p. 46)
Data Warehousing's Magic Words: Extract, Transform, and Load (p. 48)
Hadoop: The Elephant in the Warehouse (p. 48)
Data Silos Can Be Good (p. 49)
Concentrate on the Data Challenge, Not the Technology (p. 50)
Empower Employees to Ask Their Own Questions (p. 50)
Invest in Technology That Bridges Data Silos (p. 51)
Convergence: The End of the Data Silo (p. 51)
Will Luhn's Business Intelligence System Become Reality? (p. 52)
Summary (p. 53)
III Asking Questions about Your Data (p. 55)
5 Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets (p. 57)
What Is a Data Warehouse? (p. 57)
Apache Hive: Interactive Querying for Hadoop (p. 60)
Use Cases for Hive (p. 60)
Hive in Practice (p. 61)
Using Additional Data Sources with Hive (p. 65)
Shark: Queries at the Speed of RAM (p. 65)
Data Warehousing in the Cloud (p. 66)
Summary (p. 67)
6 Building a Data Dashboard with Google BigQuery (p. 69)
Analytical Databases (p. 69)
Dremel: Spreading the Wealth (p. 71)
How Dremel and MapReduce Differ (p. 72)
BigQuery: Data Analytics as a Service (p. 73)
BigQuery's Query Language (p. 74)
Building a Custom Big Data Dashboard (p. 75)
Authorizing Access to the BigQuery API (p. 76)
Running a Query and Retrieving the Result (p. 78)
Caching Query Results (p. 79)
Adding Visualization (p. 81)
The Future of Analytical Query Engines (p. 82)
Summary (p. 83)
7 Visualization Strategies for Exploring Large Datasets (p. 85)
Cautionary Tales: Translating Data into Narrative (p. 86)
Human Scale versus Machine Scale (p. 89)
Interactivity (p. 89)
Building Applications for Data Interactivity (p. 90)
Interactive Visualizations with R and ggplot2 (p. 90)
matplotlib: 2-D Charts with Python (p. 92)
D3.js: Interactive Visualizations for the Web (p. 92)
Summary (p. 96)
IV Building Data Pipelines (p. 97)
8 Putting It Together: MapReduce Data Pipelines (p. 99)
What Is a Data Pipeline? (p. 99)
The Right Tool for the Job (p. 100)
Data Pipelines with Hadoop Streaming (p. 101)
MapReduce and Data Transformation (p. 101)
The Simplest Pipeline: stdin to stdout (p. 102)
A One-Step MapReduce Transformation (p. 105)
Extracting Relevant Information from Raw NVSS Data: Map Phase (p. 106)
Counting Births per Month: The Reducer Phase (p. 107)
Testing the MapReduce Pipeline Locally (p. 108)
Running Our MapReduce Job on a Hadoop Cluster (p. 109)
Managing Complexity: Python MapReduce Frameworks for Hadoop (p. 110)
Rewriting Our Hadoop Streaming Example Using mrjob (p. 110)
Building a Multistep Pipeline (p. 112)
Running mrjob Scripts on Elastic MapReduce (p. 113)
Alternative Python-Based MapReduce Frameworks (p. 114)
Summary (p. 114)
9 Building Data Transformation Workflows with Pig and Cascading (p. 117)
Large-Scale Data Workflows in Practice (p. 118)
It's Complicated: Multistep MapReduce Transformations (p. 118)
Apache Pig: "Ixnay on the Omplexitycay" (p. 119)
Running Pig Using the Interactive Grunt Shell (p. 120)
Filtering and Optimizing Data Workflows (p. 121)
Running a Pig Script in Batch Mode (p. 122)
Cascading: Building Robust Data-Workflow Applications (p. 122)
Thinking in Terms of Sources and Sinks (p. 123)
Building a Cascading Application (p. 124)
Creating a Cascade: A Simple JOIN Example (p. 125)
Deploying a Cascading Application on a Hadoop Cluster (p. 127)
When to Choose Pig versus Cascading (p. 128)
Summary (p. 128)
V Machine Learning for Large Datasets (p. 129)
10 Building a Data Classification System with Mahout (p. 131)
Can Machines Predict the Future? (p. 132)
Challenges of Machine Learning (p. 132)
Bayesian Classification (p. 133)
Clustering (p. 134)
Recommendation Engines (p. 135)
Apache Mahout: Scalable Machine Learning (p. 136)
Using Mahout to Classify Text (p. 137)
MLBase: Distributed Machine Learning Framework (p. 139)
Summary (p. 140)
VI Statistical Analysis for Massive Datasets (p. 143)
11 Using R with Large Datasets (p. 145)
Why Statistics Are Sexy (p. 146)
Limitations of R for Large Datasets (p. 147)
R Data Frames and Matrices (p. 148)
Strategies for Dealing with Large Datasets (p. 149)
Large Matrix Manipulation: bigmemory and biganalytics (p. 150)
ff: Working with Data Frames Larger than Memory (p. 151)
biglm: Linear Regression for Large Datasets (p. 152)
RHadoop: Accessing Apache Hadoop from R (p. 154)
Summary (p. 155)
12 Building Analytics Workflows Using Python and Pandas (p. 157)
The Snakes Are Loose in the Data Zoo (p. 157)
Choosing a Language for Statistical Computation (p. 158)
Extending Existing Code (p. 159)
Tools and Testing (p. 160)
Python Libraries for Data Processing (p. 160)
NumPy (p. 160)
SciPy: Scientific Computing for Python (p. 162)
The Pandas Data Analysis Library (p. 163)
Building More Complex Workflows (p. 167)
Working with Bad or Missing Records (p. 169)
iPython: Completing the Scientific Computing Tool Chain (p. 170)
Parallelizing iPython Using a Cluster (p. 171)
Summary (p. 174)
VII Looking Ahead (p. 177)
13 When to Build, When to Buy, When to Outsource (p. 179)
Overlapping Solutions (p. 179)
Understanding Your Data Problem (p. 181)
A Playbook for the Build versus Buy Problem (p. 182)
What Have You Already Invested In? (p. 183)
Starting Small (p. 183)
Planning for Scale (p. 184)
My Own Private Data Center (p. 184)
Understand the Costs of Open-Source (p. 186)
Everything as a Service (p. 187)
Summary (p. 187)
14 The Future: Trends in Data Technology (p. 189)
Hadoop: The Disruptor and the Disrupted (p. 190)
Everything in the Cloud (p. 191)
The Rise and Fall of the Data Scientist (p. 193)
Convergence: The Ultimate Database (p. 195)
Convergence of Cultures (p. 196)
Summary (p. 197)
Index (p. 199)

Author notes provided by Syndetics

Michael Manoochehri is an entrepreneur, writer, and optimist. With the help of his many years of experience working with enterprise, research, and nonprofit organizations, his goal is to help make scalable data analytics more affordable and accessible. Michael has been a member of Google's Cloud Platform Developer Relations team, focusing on cloud computing and data developer products such as Google BigQuery. In addition, Michael has written for the tech blog ProgrammableWeb.com, has spent time in rural Uganda researching mobile phone use, and holds a master's degree in information management and systems from the University of California, Berkeley's School of Information.

Back to results

1 Wolfgang Laib :
by Ottman, Klaus.
2 Wolfgang Laib /
by Laib, Wolfgang,