BigData (Hadoop,Spark & Memcached)

Overview

Apache Hadoop and Spark are gaining prominence in handling Big Data and analytics. Similarly, Memcached in Web 2.0 environment is becoming important for large-scale query processing. These middleware are traditionally written with sockets and do not deliver best performance on datacenters with modern high performance networks. In this tutorial, we will provide an in-depth overview of the architecture of Hadoop components (HDFS, MapReduce, RPC, HBase, etc.), Spark and Memcached. We will examine the challenges in re-designing the networking and I/O components of these middleware with modern interconnects, protocols (such as InfiniBand, iWARP, RoCE, and RSocket) with RDMA and storage architecture. Using the publicly available software packages in the High-Performance Big Data (HiBD, http://hibd.cse.ohio-state.edu) project, we will provide case studies of the new designs for several Hadoop/Spark/Memcached components and their associated benefits. Through these case studies, we will also examine the interplay between high performance interconnects, storage systems (HDD and SSD), and multi-core platforms to achieve the best solutions for these components.

Journals (3)

1 N. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-Benchmark Suite for Evaluating HDFS Operations on Modern Clusters , Special Issue of LNCS on papers from WBDB '12 Workshop , May 2012.
2 D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters , The Journal of Supercomputing - Springer , Jun 2016.
3 M. W. Rahman, N. Islam, X. Lu, and D. K. Panda, A Comprehensive Study of MapReduce over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters , IEEE Transactions on Parallel and Distributed Systems , Jul 2016.

Conferences & Workshops (40)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Ph.D. Disserations (1)

1 J. Jose, Designing High Performance and Scalable Unified Communication Runtime (UCR) for HPC and Big Data Middleware, Aug 2014

M.S. Thesis (1)

1 A. Bhat, RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed Filesystem, Aug 2015