Big data is (clearly) a broadly defined and overused term. It’s been used to describe everything from general “information overload” to specific data mining and analytics to large-scale databases. In Internap’s hosting and cloud customer base, we see two main approaches to big data. In order to make better decisions about the infrastructure required to achieve your goals, you need to understand these different approaches and know where your needs fall.
There is a haystack, go find needles
One class of big data can be thought of as the “needle in a haystack” type. In this scenario, you have mountains of data already, and a very broad idea about the possibility of insights, analytics, and interrelationships within the data. Therefore, your goal is to crunch the data and find the relationships that allow you to understand and gain insight about the data over time. This type of static “big” data requires big backend processing power from technologies such as Hadoop. These applications tend to be mostly batch jobs with sporadic and often unpredictable infrastructure needs.
Massive real-time “big” database
The term “big data” is also used to describe the more mainstream, real-time database applications that have a scale problem to solve well beyond the means of traditional SQL databases. Real-time big data applications, such as Mongo DB, Cassandra and others deliver needed scale and performance for modern scale-out applications. Relational databases are often too limiting for large amounts of unstructured data. NoSQL and key value databases are better suited for the task, but they require high performance storage, high IOPs and the ability to rapidly scale in place. These requirements are vastly different from those of the data-crunching needle in a haystack type of big data, yet the same term is often used to describe both.
The performance question
Performance isn’t unimportant in the first type of big data, but it has a different meaning versus the real-time database scenario. For large data-mining applications, real-time data insertion isn’t as important, because you already have the data. The importance of performance in this case is the ability to extract the data fast enough and process it quickly, and this depends on the type of data you are mining and the business application of it. With that said, the type of infrastructure has a big impact on how long it takes to process your “big data” job. If you can reduce the processing time from three days to two days thanks to a more powerful cloud infrastructure, that can change how you define your business model.
For real-time big database applications, I/O becomes critical. For example, mobile advertising technology companies require real-time data insertion and performance in order to capture the right data at the right time and subsequently deliver timely, relevant ads. What really happens when millions of users simultaneously “check in” at their favorite restaurants and then at the movies via a social media mobile app? Extracting and capturing this information relies on real-time data insertion, but quickly processing and learning from that data relies on compute performance. The ads you see are formulated and delivered based on your real-time location information, behavior patterns and preferences. Dynamic, real-time data requires high I/O storage and superior compute performance in order to provide such targeted ads.
From the proverbial needle-in-a-haystack backend processing to modern, real-time database applications, the term “big data” is used for both. Once you understand the distinct qualities of each type, you can make better decisions regarding the infrastructure and IaaS (Infrastructure-as-a-Service) models that fit one versus the other. Your organization likely has both types of “big data” challenges. Talk to Internap to find out how we can help you meet the needs of both.