Data Just Right Introduction to large-scale data and analytics Michael Manoochehri - May 14, 2015 A data pipeline requires the coordination of a collection of different technologies for different parts of a data life cycle. This article is an excerpt from Data Just Right: Introduction to Large-Scale Data & Analytics, a completely practical guide for every Big Data decision-maker, implementer, and strategist. Michael Manoochehri, a former Google engineer and data hacker, writes for people who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that's where you can derive the most value. Manoochehri shows how to address each of today's key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions.Build Systems That Can Share Data (on the Internet)For public data to be useful, it must be accessible. The technological choices made during the design of systems to deliver this data depend completely on the intended audience.Consider the task of a government making public data more accessible to citizens. In order to make data as accessible as possible, data files should be hosted on a scalable system that can handle many users at once. Data formats should be chosen that are easy to work with by researchers and reported. Perhaps an API should be created to allow developers to query data programmatically. And of course, it is most advantageous to build a web-based dashboard to allow to ask questions about data without having to do any processing.In other words, making data truly accessible to a public audience takes more effort than simply uploading a collection of XML files to a privately run server. Unfortunately, this type of "solution" still happens more often than it should. Systems should be designed to share data with the intended audience.This concept extends to the private sphere as well. In order for organizations to take advantage of the data they have, employees must be able to ask questions themselves. In the past, people used the data warehouse app, in an attempt to merge everything in a single, manageable space. Now, the concept of becoming a data-driven organization might include the concept of simply keeping data in whatever silo is the best fit for the use case, and building tools that can glue different systems together. In this case, the focus is more on keeping data where it works best and finding ways to share and process it when the need arises.Build Solutions, Not InfrastructureWith apologies to true ethnographers everywhere, my observations into the natural world of the wild software developer have uncovered an amazing finding: software developers usually hope to build cool software, and don't want to spend as much time installing hard drives or operating systems, or worry about that malfunctioning power supply in the server rack.Affordable infrastructure as a service technology (inevitably named using every available spin on the concept of "clouds") has allowed developers to worry less about hardware, and instead to focus on building web-based applications on platforms that can scale to large number of users on demand.As soon as your business requirements involve purchasing, installing, and administering physical hardware, I would recommend using this as a sign that you have hit a roadblock. Whatever business or project you are working on, my guess is that if you are interested in solving data challenges, your core competency is not necessarily in building hardware.There are a growing number of companies that specialize in providing infrastructure as a service — some by providing fully featured virtual servers run on hardware managed in huge data centers, and accessed over the Internet. Despite new paradigms in the infrastructure as a service industry, the mainframe business, such as that embodied by IBM, is still alive and well. Some companies provide sales or leases of in-house equipment, and provide both administration via the Internet and physical maintenance when necessary.This is not to say that there are no caveats to using cloud-based services. Just like everything featured in this book, there are trade-offs to building on virtualized infrastructure, as well as critical privacy and compliance implications for users. However, it's becoming clear that buying and building applications hosted in the cloud should be considered the rule, not the exception.Focus on Unlocking Value from Your DataWhen working with developers implementing a massive scale data solution, a common mistake I have noticed is that solution architects will start with the technology first, then work their way backwards to the problem they are trying to solve. There is nothing wrong with exploring various types of technology, but in terms of making investments in a particular strategy, always keep in mind the business question that your data solution is meant to solve.This compulsion to focus on technology first is the driving motivation for people to completely disregard relational database management systems because of "NoSQL" database hype, or to start worrying about collecting massive amounts of data when the answer to a question can be found by statistical analysis of 10,000 data points.Time and time again, I've observed that the key to unlocking value from data is to clearly articulate the business questions that you are trying to answer. Sometimes, the answer to a perplexing data question can be found with a sample of a small amount of data, using common desktop business productivity tools. Other times, the problem is more political than technical — overcoming the inability of admins across different departments to break down data silos can be the true challenge.Collecting massive amounts of data itself doesn't provide any magic value to your organization. The real value in data comes from understanding pain points in your business, asking practical questions, and using the answers and insights gleaned to support decision making.Anatomy of a Big Data PipelineIn practice, a data pipeline requires the coordination of a collection of different technologies for different parts of a data life cycle.Let's explore a real-world example: a common use case of tackling the challenge of collecting and analyzing data from a web-based application that aggregates data from many users. In order for this type of application to handle data input from thousands, or even millions of users at a time, it must be highly available. Whatever database is used, the primary design goal of the data collection layer is that it can handle input without becoming too slow or unresponsive. In this case, a key-value data store, examples of which are MongoDB, Redis, Amazon's DynamoDB, or Google's App Engine Datastore, might be the best solution.While this data is constantly streaming in and always being updated, it's useful to have a cache, or a "source of truth." This cache may be less performant, and perhaps only needs to be updated at intervals, but it should provide consistent data when required. This layer could also be used to provide data snapshots in formats that provide interoperability with other data software or visualization systems. This might be flat files in a scalable cloud-based storage solution, or it could be a relational database back end. In some cases, developers have built the collection layer and the cache from the same software. In other cases, this layer can be made with a hybrid of relational and non-relational database management systems.Finally, in an application like this, it's important to provide a mechanism to ask aggregate questions about the data. Software that provides quick, near-real time analysis of huge amounts of data is often designed very differently from databases that are designed to collect data from thousands of users over a network.In between these different stages in the data pipeline is the possibility that data needs to be transformed. For example, data collected from a web front end may need to be converted into XML files in order to be interoperable with another piece of software. Or, this data may need to be transformed into JSON, or a data serialization format, such as Thrift, to make moving the data as efficient as possible. In large-scale data systems, transformations are often to slow to take place on a single machine. Much as in the case of scalable database software, transformations are often best implemented using distributed computing frameworks such as Hadoop.In the era of Big Data trade-offs, building a system data lifecycle that can scale to massive amounts of data requires specialized software for different parts of the pipeline. This excerpt from Data Just Right: Introduction to Large-Scale Data & Analytics is reprinted with permission from the publisher.