TPCx-BB New Data Analytics and Machine Learning Benchmark


A new data analytics and machine learning benchmark has been released by the Transaction Processing Performance Council (TPC) measuring real-world performance of Hadoop-based systems, including MapReduce, Apache Hive, and Apache Spark Machine Learning Library (MLlib).

Called the TPCx-BB benchmark and downloadable at the TPC site, it executes queries frequently performed by companies in the retail industry running customer behavior analytics.

The TPCx-BB (BB stands for “Big Benchmark”) is designed to incorporate complex customer analytical requirements of retailers. Whereas online retailers have historically recorded only completed customer transactions, today deeper insight is needed into consumer behavior, with relatively straightforward shopping basket analysis replaced by detailed behavior modeling. According to the TPC, the benchmark compares various analytics solutions in a real-world scenario, providing performance-vs.-cost tradeoffs.

The benchmark tests various data management primitives – such as selects, joins and filters – and functions. Where necessary, it utilizes procedural programs written using Java, Scala and Python. For use cases requiring machine learning data analysis techniques, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms by providing an input dataset to the algorithms processed during the data management phase.

The benchmark exercises the compute, I/O, memory and efficiency of various Hadoop software stacks (Hive, MapReduce, Spark, Tez) and runs tasks resembling applications developed by an end-user with a cluster deployed in a datacenter, providing realistic usage of cluster resources.

It also utilizes, when necessary, procedural programs written using Java, Scala and Python. For machine learning use cases, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms during the data management phase.

Other phases of the benchmark include:

Load: tests how fast raw data can be read from the distributed file system, permuted by applying various optimizations, such as compression, data formats (ORC, text, Parquet).

Power: tests the system using short-running jobs with less demand on cluster resources, and long-running jobs with high demand on resources.

Throughput: tests the efficiency of cluster resources by simulating a mix of short and long-running jobs, executed in parallel.

For the record, according to HPE, the 12-node Proliant cluster used in the first test run on the benchmark had three master/management nodes and nine worker nodes with RHEL 6.x OS and CDH 5.x Hadoop Distribution. It ran a dataset of about 3TB. Comparing current- versus previous-generation Proliant servers, HPE reported a 27 percent performance gain and cost reduction of 9 percent.

New App Container Tools from CoreOS and Puppet

The expanding application container and micro-services infrastructure got another boost this week with the introduction of a new set of tools for managing distributed software used to orchestrate micro-services.

CoreOS, announced a new open source distributed storage system designed to provide scalable storage to clusters orchestrated by the Kubernetes container management platform.

Puppet, the IT automation specialist based in Portland, Ore., recently released a suite of tools under the codename Project Blueshift that provides modules for running container software from CoreOS, Docker and Mesosphere along with Kubernetes cluster manager. This week it released a new set of Docker images for running its software on the Docker Hub.

Blueshift software tools could now be deployed and run on top of Docker. Running within the application container platform makes it easier to scale Puppet.

A new agent to manage Linux virtual machines running on IBM z Systems and LinuxOne platforms. In addition, it announced new modules for IBM WebSphere application and integration middleware along with a module for supporting a Cisco System’s line of Nexus switches. The modules are intended to automate IT management while speeding application deployment across hybrid cloud infrastructure.

IBM Websphere module is available now, and a new agent with packages supporting Red Hat Enterprise Linux 6 along with SUSE Linux Enterprise Server 11 and 12 would be available later this summer.

Meanwhile, San Francisco-based CoreOS rolled out a new open source distributed storage effort this week designed to address persistent storage in container clusters. The company said its Torus distributed storage platform aims to deliver scalable storage for container clusters orchestrated by the Kubernetes container manager. A prototype version of Torus is available on GitHub.

CoreOS said Torus aims to solve common storage issues associated with running distributed applications. “While it is possible to connect legacy storage to container infrastructure, the mismatch between these two models convinced us that the new problems of providing storage to container clusters warranted a new solution,” the company noted in a statement announcing the open source storage effort.

Operating on the premise that large clusters of applications containers require persistent storage, CoreOS argues that storage for clusters of lightweight virtual machines must be uniformly available across a network as processing shifts among containers.

Torus runs on the CoreOS distributed key value store called etcd that is used to store data across a cluster of machines. The storage building block is deployed in “thousands” of production deployments, CoreOS claims. That configuration allows Torus to zero in on custom persistent storage configurations. The tool also is designed as a building block for delivering different types of storage, including distributed block devices or large object storage.