Industry-Standard Benchmarks for Big Data Platforms


A mature data management industry based primarily on relational database (RDBMS) technology has been established over the last two decades. However, the emergence of the Big Data phenomenon, characterized by the 3Vs (Volume, Velocity, and Variety) of data and agile development of data-driven applications, has introduced a new set of challenges, and a variety of technologies have emerged to address them. Benchmarks provide a method for comparing performance of systems and are often used for evaluating suitability of these systems for procurement. The advent of new techniques and technologies for Big Data creates the imperative for industry standard benchmarks for evaluating such systems.

Big Data systems are characterized by their flexibility in processing diverse data genres, such as text, images, video, geo-locations, and sensor data, using a variety of methods. Because of the many sources and methods of analyzing Big Data, a single benchmark that characterizes all use-cases could not exist. However, a study of several Big Data platform use-cases indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads.

Beginning in late 2011, the Center or Large-scale Data Systems Research (CLDS) at the San Diego Supercomputer Center (SDSC), in collaboration with several industry players, initiated a community activity in Big Data benchmarking. The goal was to define reference benchmarks that capture the essence of Big Data application scenarios and to help characterize and understand hardware and system performance and the price-to-performance ratio of Big Data platforms. Founding members of this benchmarking initiative include Dr. Chaitan Baru (CLDS), Raghunath Nambiar (Cisco), Meikel Poess (Oracle), and Tilmann Rabl (University of Toronto), in addition to Greenplum, a division of EMC.

As a result of these initial activities, a Workshop Series on Big Data Benchmarking (WBDB) was organized, sponsored by the National Science Foundation. These workshops and associated meeting series validated the initial ideas for a Big Data benchmark to include definitions of the data along with a data generation procedure; a representative workload for emerging Big Data applications; and a set of metrics, run rules and full disclosure reports for fair comparisons of technologies and platforms.

A formal specification of this benchmarking suite is under-way, and will be announced at O’Reilly’s Strata Conference on February 28, 2013, in a session I will conduct with Chaitan Baru. We will be unveiling the current status of our effort in a Big Data Top 100 List and encourage you to participate in this community-based endeavor in defining an end-to-end application-layer benchmark for Big Data applications.

Continue Reading
Would you like to read more like this?

Related Posts

Hitting the Accelerator with GPUs

As organizations work to meet the performance demands of new data-intensive workloads, accelerated computing is gaining momentum in mainstream data centers. As data center operators struggle to stay ahead of … READ MORE

Janet Morss August 16th, 2019
Click to Load More
All comments are moderated. Unrelated comments or requests for service will not be published, nor will any content deemed inappropriate, including but not limited to promotional and offensive comments. Please post your technical questions in the Support Forums or for customer service and technical support contact Dell EMC Support.