Big Data and Google's Three Papers I - GFS and MapReduce

Big data is a pretty new concept that came up only serveral years ago. It emerged along with three papers from Google, Google File System(2003), MapReduce(2004), and BigTable(2006).

Today I want to talk about some of my observation and understanding of the three papers, their impacts on open source big data community, particularly Hadoop ecosystem, and their positions in big data area according to the evolvement of Hadoop ecosystem.

Google File System (GFS)

Hadoop Distributed File System (HDFS) is an open sourced version of GFS, and the foundation of Hadoop ecosystem. Its fundamental role is not only documented clearly in Hadoop’s official website, but also reflected during the past ten years as big data tools evolve.

One example is that there have been so many alternatives to Hadoop MapReduce and BigTable-like NoSQL data stores coming up. For MapReduce, you have Hadoop Pig, Hadoop Hive, Spark, Kafka + Samza, Storm, and other batch/streaming processing frameworks. For NoSQL, you have HBase, AWS Dynamo, Cassandra, MongoDB, and other document, graph, key-value data stores. You can find out this trend even inside Google, e.g. 1) Google released DataFlow as official replacement of MapReduce, I bet there must be more alternatives to MapReduce within Google that haven’t been annouced 2) Google is actually emphasizing more on Spanner currently than BigTable.

But I havn’t heard any replacement or planned replacement of GFS/HDFS.

HDFS makes three essential assumptions among all others:

  1. it runs on a large number of commodity hardwards, and is able to replicate files among machines to tolerate and recover from failures
  1. it only handles extremely large files, usually at GB, or even TB and PB
  1. it only support file append, but not update

These properties, plus some other ones, indicate two important characteristics that big data cares about:

  1. it is able to persist files or other states with high reliability, availability, and scalability. It minimizes the possibility of losing anything; files or states are always available; the file system can scale horizontally as the size of files it stores increase
  1. it handles big data! And as a tradeoff, it prefers throughput than low latency

In short, GFS/HDFS have proven to be the most influential component to support big data. Long live GFS/HDFS!


Frankly, I’m not a big fan of MapReduce.

Google’s MapReduce paper is actually composed of two things: 1) A data processing model named MapReduce 2) A distributed, large scale data processing paradigm. The first is just one implementation of the second, and to be honest, I don’t think that implementation is a good one.

1. A data processing model named MapReduce

I first learned map and reduce from Hadoop MapReduce. It has been an old idea, and is orginiated from functional programming, though Google carried it forward and made it well-known.

MapReduce can be strictly broken into three phases:

  1. Map
  2. Sort/Shuffle/Merge
  3. Reduce

Map and Reduce is programmable and provided by developers, and Shuffle is built-in. Map takes some inputs (usually a GFS/HDFS file), and breaks them into key-value pairs. Sort/Shuffle/Merge sorts outputs from all Map by key, and transport all records with the same key to the same place, guaranteed. Reduce does some other computations to records with the same key, and generates the final outcome by storing it in a new GFS/HDFS file.

From a data processing point of view, this design is quite rough with lots of really obvious practical defects or limitations. For example, it’s a batching processing model, thus not suitable for stream/real time data processing; it’s not good at iterating data, chaining up MapReduce jobs are costly, slow, and painful; it’s terrible at handling complex business logic; etc.

From a database stand pint of view, MapReduce is basically a SELECT + GROUP BY from a database point.

There’s no need for Google to preach such outdated tricks as panacea.

2. A distributed, large scale data processing paradigm

This part in Google’s paper seems much more meaningful to me. It describes an distribued system paradigm that realizes large scale parallel computation on top of huge amount of commodity hardware.
Though MapReduce looks less valuable as Google tends to claim, this paradigm enpowers MapReduce with a breakingthough capability to process large amount of data unprecedentedly.

There are three noticing units in this paradigm.

  1. Move computation to data, rather than transport data to where computation happens. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack.
  1. Put all input, intermediate output, and final output to a large scale, highly reliable, highly available, and highly scalable file system, a.k.a. GFS/HDFS, to have the file system take cares lots of concerns.
  1. Take advantage of an advanced resource management system. That system is able to automatically manage and monitor all work machines, assign resources to applications and jobs, recover from failure, and retry tasks.

The first point is actually the only innovative and practical idea Google gave in MapReduce paper. As data is extremely large, moving it will also be costly. So, instead of moving data around cluster to feed different computations, it’s much cheaper to move computations to where the data is located.

The secondly thing is, as you have guessed, GFS/HDFS.

Lastly, there’s a resource management system called Borg inside Google. Google has been using it for decades, but not revealed it until 2015. Even with that, it’s not because Google is generous to give it to the world, but because Docker emerged and stripped away Borg’s competitive advantages. Google didn’t even mention Borg, such a profound piece in its data processing system, in its MapReduce paper - shame on Google!

Now you can see that the MapReduce promoted by Google is nothing significant. It’s an old programming pattern, and its implementation takes huge advantage of other systems.

That’s also why Yahoo! developed Apache Hadoop YARN, a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.

I will talk about BigTable and its open sourced version in another post