TPCx-BB New Data Analytics and Machine Learning Benchmark

comp-2013-iss46-business-anual-image1

A new data analytics and machine learning benchmark has been released by the Transaction Processing Performance Council (TPC) measuring real-world performance of Hadoop-based systems, including MapReduce, Apache Hive, and Apache Spark Machine Learning Library (MLlib).

Called the TPCx-BB benchmark and downloadable at the TPC site, it executes queries frequently performed by companies in the retail industry running customer behavior analytics.

The TPCx-BB (BB stands for “Big Benchmark”) is designed to incorporate complex customer analytical requirements of retailers. Whereas online retailers have historically recorded only completed customer transactions, today deeper insight is needed into consumer behavior, with relatively straightforward shopping basket analysis replaced by detailed behavior modeling. According to the TPC, the benchmark compares various analytics solutions in a real-world scenario, providing performance-vs.-cost tradeoffs.

The benchmark tests various data management primitives – such as selects, joins and filters – and functions. Where necessary, it utilizes procedural programs written using Java, Scala and Python. For use cases requiring machine learning data analysis techniques, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms by providing an input dataset to the algorithms processed during the data management phase.

The benchmark exercises the compute, I/O, memory and efficiency of various Hadoop software stacks (Hive, MapReduce, Spark, Tez) and runs tasks resembling applications developed by an end-user with a cluster deployed in a datacenter, providing realistic usage of cluster resources.

It also utilizes, when necessary, procedural programs written using Java, Scala and Python. For machine learning use cases, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms during the data management phase.

Other phases of the benchmark include:

Load: tests how fast raw data can be read from the distributed file system, permuted by applying various optimizations, such as compression, data formats (ORC, text, Parquet).

Power: tests the system using short-running jobs with less demand on cluster resources, and long-running jobs with high demand on resources.

Throughput: tests the efficiency of cluster resources by simulating a mix of short and long-running jobs, executed in parallel.

For the record, according to HPE, the 12-node Proliant cluster used in the first test run on the benchmark had three master/management nodes and nine worker nodes with RHEL 6.x OS and CDH 5.x Hadoop Distribution. It ran a dataset of about 3TB. Comparing current- versus previous-generation Proliant servers, HPE reported a 27 percent performance gain and cost reduction of 9 percent.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s