In our previous video, we described big data in terms of volume, velocity, and variety of information and then looked at some use cases of big data.
In this video, we’ll discuss the basic frameworks for big data implementations.
In general, big data implementations must support three major core capabilities:
- Store – First, it must be able to store all the data. There must be software and infrastructure necessary to capture and store high volumes of data.
- Process – Next, it must also be able to process the stored data by having the compute power to organize, enrich, and analyze the data.
- Access – Lastly, a big data framework must be able to access all of your data to retrieve, search, integrate, and visualize the data when and how required.
Along with these capabilities, the architectural building blocks of big data must integrate these core capabilities into three major layers.
Infrastructure services – The foundation of every big data architecture project is the infrastructure. Innovations over the last 10 years, including infrastructure services APIs, open source configuration management software, and the wide adoption of virtualization have allowed for more efficient deployment of servers, storage, and networking.
Essentially, what used to take months from procurement to configuration to load and deployment now takes minutes. In addition, data management software – which we’ll talk about more in a second – allows projects to be reliably coordinated across multiple commodity servers.
Scaling out commodity servers rather than scaling up expensive, customized, brand-name appliances is obviously a much less expensive proposition per bit of data captured and analyzed.
When considering hosting solutions for big data deployments, multi-tenant public cloud architectures usually have performance trade-offs to reach scale. The virtual, shared, and oversubscribed aspects of multi-tenant public clouds can lead to problems with noisy neighbors resulting in degradation of performance of your big data workloads.
To alleviate such problems for your big data workloads, a good alternative is to build out a dedicated infrastructure with bare-metal server nodes for several significant reasons. First, bare-metal servers provide fully dedicated compute resources for your big data workloads, eliminating the noisy neighbor problem of multi-tenant environments. Second, bare-metal servers can be deployed in flexible, cloud-like model, meaning they can be provisioned and de-provisioned instantly, depending on demand. And lastly, bare-metal solutions provide fully dedicated storage, meaning that all disks are local and can be configured with SSDs to achieve higher IOPS for the horizontally scalable distributed data management services, which is what we will discuss next.
Data management services – It builds horizontally scalable distributed data management services on top of the infrastructure services layer. Three types of technologies work together to manage big data in this layer.
First, data stream processing technology enables the filtering and capturing of high velocity information streams using parallel processing. Second, a distributed file management system, such as the Hadoop distributed file system (HDFS) handles routine file operations using a flexible array of storage and processing nodes to provide fault tolerance and scalability. And lastly, NoSQL databases that trade off integrity guarantees for theoretically unlimited scalability while maintaining database flexibility.
In terms of flexibility, the NoSQL databases eliminate the rigid schema of relational databases by allowing you to adapt to evolving data capture and management needs.
The third layer of the architectural building blocks of big data sits on top of the data management layer and is a class of middleware that leverages the data management layer to conduct query, analysis, and some transactional processing. An example of this is Pig, a data flow language, or Hive, a data warehouse framework in the Hadoop framework.
In our next and final big data video, Brian Bulkowski at Aerospoke will introduce how various data management and processing tools are used in a NoSQL big data deployment.
Watch next: Aerospike discusses NoSQL for Big Data