Big data is the buzzword in the IT industry these days. While traditional data warehousing involves terabytes of human-generated transactional data to record facts, big data involves petabytes of human and machine-generated data to harvest facts. Big data becomes supremely valuable when it is captured, stored, searched, shared, transferred, deeply analyzed and visualized.
The platform that is frequently cited as the enabler for all of these things is Hadoop, the open source project from Apache that has become the major technology movement for big data. Hadoop has emerged as the preferred way to handle massive amounts of not only structured data, but also complex petabytes of semi-structured and unstructured data generated daily by humans and machines.
The major components of Hadoop include Hadoop Distributed File System (HDFS) as well as implementation of MapReduce. HDFS distributes and replicates ﬁles across a cluster of standardized computers/servers. MapReduce parses the data into workable portions across the cluster, so they can be concurrently processed based on a map function configured by the user. Hadoop relies on each compute node to process its own chunk of data allowing for efficient “scaling-out” without degrading performance.
Hadoop’s popularity is largely due to its ability to store, analyze and access large amounts of data, quickly and cost effectively across these clusters of commodity hardware. Some use cases include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, mobile device location-based marketing within an almost endless variety of verticals. Although Hadoop is not considered a direct replacement for traditional data warehouses, it enhances enterprise data architectures with potential for deep analytics to attain true value big data.
When building and deploying big data solutions with scale-out architecture, cloud is a natural consideration. The value of a virtualized IaaS solution, like our own AgileCLOUD is clear – configuration options are extensive, provisioning is fast and easy, and the use cases are wide-ranging. When considering hosting solutions for Hadoop deployments, shared public cloud architectures usually have performance trade-offs to reach scale, such as I/O bottlenecks that can arise when MapReduce workloads scale. Moreover, virtualization and shared tenancy can impact CPU and RAM performance. Purchasing larger and larger virtual instances or additional services to reach higher IOPS to compensate for those bottlenecks can get expensive and/or lack the desired results.
Hence the beauty of on demand bare metal cloud solutions for many resource intensive use cases: Disks are local and can be configured with SSDs to achieve higher IOPS. RAM and storage are fully dedicated and server nodes can be provisioned and deprovisioned programmatically depending on demand. Depending on the application and use case, a single bare-metal server can support greater workloads than multiple similarly sized VMs. Under the right circumstances, the use of both virtualized and bare metal server nodes can yield significant cost savings and better performance.