Google‘s BigTable 原理（翻译）

来源：CSDN作者：accesine9602005-08-31出处：pcdog.com

<iframe width="468" scrolling="no" height="15" frameborder="0" allowtransparency="true" hspace="0" vspace="0" marginheight="0" marginwidth="0" src="http://pagead2.googlesyndication.com/pagead/ads?client=ca-pub-1572879403720716&dt=1189502584330&lmt=1188830280&alt_color=F5FAFA&format=468x15_0ads_al_s&output=html&correlator=1189502584329&channel=3984443469&url=http%3A%2F%2Fwww.pcdog.com%2Fedu%2Fdevelop-tools%2F2005%2F08%2Fb067929.html&color_bg=F5FAFA&color_text=000000&color_link=000000&color_url=F5FAFA&color_border=F5FAFA&ref=http%3A%2F%2Fwww.google.cn%2Fsearch%3Faq%3Dt%26oq%3D%26complete%3D1%26hl%3Dzh-CN%26newwindow%3D1%26q%3Djeff%2Bdean%2Bgoogle%26btnG%3DGoogle%2B%25E6%2590%259C%25E7%25B4%25A2%26meta%3Dlr%253Dlang_zh-CN%257Clang_zh-TW&cc=100&ga_vid=1291650157.1189502584&ga_sid=1189502584&ga_hid=375073220&flash=9&u_h=1200&u_w=1920&u_ah=1150&u_aw=1920&u_cd=24&u_tz=480&u_his=1&u_java=true&u_nplug=10&u_nmime=98" name="google_ads_frame"></iframe>

.net 集群数据库硬盘 http

Google's BigTable <nobr onmousemove="kwM(2);" onmouseout="kwL(event, this);" onmouseover="kwE(event,2, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key2">原理</nobr> （翻译）

<iframe width="336" scrolling="no" height="280" frameborder="0" allowtransparency="true" hspace="0" vspace="0" marginheight="0" marginwidth="0" src="http://pagead2.googlesyndication.com/pagead/ads?client=ca-pub-1572879403720716&dt=1189502586596&hl=zh-CN&lmt=1188830280&alternate_ad_url=http%3A%2F%2Fwww.pcdog.com%2Fjs%2F336.htm&prev_fmts=468x15_0ads_al_s&format=336x280_as&output=html&correlator=1189502584329&channel=2957605308&url=http%3A%2F%2Fwww.pcdog.com%2Fedu%2Fdevelop-tools%2F2005%2F08%2Fb067929.html&color_bg=F5FAFA&color_text=000000&color_link=1F3A87&color_url=0000FF&color_border=F5FAFA&ad_type=text_image&ref=http%3A%2F%2Fwww.google.cn%2Fsearch%3Faq%3Dt%26oq%3D%26complete%3D1%26hl%3Dzh-CN%26newwindow%3D1%26q%3Djeff%2Bdean%2Bgoogle%26btnG%3DGoogle%2B%25E6%2590%259C%25E7%25B4%25A2%26meta%3Dlr%253Dlang_zh-CN%257Clang_zh-TW&cc=100&ga_vid=1291650157.1189502584&ga_sid=1189502584&ga_hid=375073220&flash=9&u_h=1200&u_w=1920&u_ah=1150&u_aw=1920&u_cd=24&u_tz=480&u_his=1&u_java=true&u_nplug=10&u_nmime=98" name="google_ads_frame"></iframe>

题记：google 的成功除了一个个出色的创意外，还因为有 Jeff Dean 这样的软件架构天才。

------ 编者

官方的 Google Reader blog 中有对BigTable 的解释。这是Google 内部开发的一个用来处理大数据量的<nobr onmousemove="kwM(5);" onmouseout="kwL(event, this);" onmouseover="kwE(event,5, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 0px dotted; text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key5">系统</nobr>。这种系统适合处理半结构化的数据比如 RSS 数据源。 以下发言 是 Andrew Hitchcock 在 2005 年10月18号基于： Google 的工程师 Jeff Dean 在华盛顿大学的一次谈话 (Creative Commons License).

首先，BigTable 从 2004 年初就开始研发了，到现在为止已经用了将近8个月。（2005年2月）目前大概有100个左右的<nobr onmousemove="kwM(8);" onmouseout="kwL(event, this);" onmouseover="kwE(event,8, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key8">服务</nobr>使用BigTable，比如： Print,Search History,Maps和 Orkut。根据<nobr onmousemove="kwM(10);" onmouseout="kwL(event, this);" onmouseover="kwE(event,10, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key9">Google</nobr>的一贯做法，内部开发的BigTable是为跑在廉价的PC机上设计的。BigTable 让Google在提供新服务时的运行成本降低，最大限度地利用了计算<nobr onmousemove="kwM(0);" onmouseout="kwL(event, this);" onmouseover="kwE(event,0, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key0">能力</nobr>。BigTable 是建立在 GFS ，Scheduler ，Lock Service 和 MapReduce 之上的。

每个Table都是一个多维的稀疏图 sparse map。Table 由行和列组成，并且每个存储单元 cell 都有一个时间戳。在不同的时间对同一个存储单元cell有多份拷贝，这样就可以记录数据的变动情况。在他的例子中，行是URLs ，列可以定义一个名字，比如：contents。Contents 字段就可以存储文件的数据。或者列名是：”language”，可以存储一个“EN”的语言代码字符串。

为了<nobr onmousemove="kwM(1);" onmouseout="kwL(event, this);" onmouseover="kwE(event,1, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key1">管理</nobr>巨大的Table，把Table根据行分割，这些分割后的数据统称为：Tablets。每个Tablets大概有 100-200 MB，每个机器存储100个左右的 Tablets。底层的架构是：GFS。由于GFS是一种分布式的文件系统，采用Tablets的机制后，可以获得很好的负载均衡。比如：可以把经常响应的表移动到其他空闲机器上，然后快速重建。

Tablets在系统中的存储方式是不可修改的 immutable 的SSTables，一台机器一个日志文件。当系统的内存满后，系统会压缩一些Tablets。由于Jeff在论述这点的时候说的很快，所以我没有时间把听到的都记录下来，因此下面是一个大概的说明：

压缩分为：主要和次要的两部分。次要的压缩仅仅包括几个Tablets，而主要的压缩时关于整个系统的压缩。主压缩有<nobr onmousemove="kwM(7);" onmouseout="kwL(event, this);" onmouseover="kwE(event,7, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key7">回收</nobr>硬盘空间的功能。Tablets的位置实际上是存储在几个特殊的BigTable的存储单元cell中。看起来这是一个三层的系统。

客户端有一个指向METAO的Tablets的指针。如果METAO的Tablets被频繁使用，那个这台机器就会放弃其他的tablets专门支持METAO这个 Tablets。METAO tablets 保持着所有的META1的tablets的记录。这些tablets中包含着查找tablets的实际位置。（老实说<nobr onmousemove="kwM(4);" onmouseout="kwL(event, this);" onmouseover="kwE(event,4, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key4">翻译</nobr>到这里，我也不太明白。）在这个系统中不存在大的瓶颈，因为被频繁调用的数据已经被提前获得并进行了缓存。

现在我们返回到对列的说明：列是类似下面的形式： family:optional_qualifier。在他的例子中，行：www.search-analysis.com 也许有列：”contents:其中包含html页面的代码。 “ anchor:cnn.com/news” 中包含着相对应的url，”anchor:www.search-analysis.com/” 包含着链接的文字部分。列中包含着类型信息。

(翻译到这里我要插一句，以前我看过一个关于万能数据库的文章，当时很激动，就联系了作者，现在回想起来，或许google的 bigtable 才是更好的方案，切不说分布式的特性，就是这种建华的表结构就很有用处。)

注意这里说的是列信息，而不是列类型。列的信息是如下信息，一般是：属性/规则。比如：保存n份数据的拷贝或者保存数据n天长等等。当 tablets 重新建立的时候，就运用上面的规则，剔出不符合条件的记录。由于设计上的原因，列本身的创建是很容易的，但是跟列相关的功能确实非常复杂的，比如上文提到的类型和规则信息等。为了优化读取速度，列的功能被分割然后以组的方式存储在所建索引的机器上。这些被分割后的组作用于列 ,然后被分割成不同的 SSTables。这种方式可以<nobr onmousemove="kwM(3);" onmouseout="kwL(event, this);" onmouseover="kwE(event,3, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key3">提高</nobr>系统的性能，因为小的，频繁读取的列可以被单独存储，和那些大的不经常访问的列隔离开来。

在一台机器上的所有的 tablets 共享一个log，在一个包含1亿的tablets的集群中，这将会导致非常多的文件被打开和写操作。新的log块经常被创建，一般是<chmetcnv unitname="m" sourcevalue="64" hasspace="False" negative="False" numbertype="1" tcsc="0" w:st="on">64M</chmetcnv>大小，这个GFS的块大小相等。当一个机器down掉后，控制机器就会重新发布他的log块到其他机器上继续进行处理。这台机器重建tablets然后询问控制机器处理结构的存储位置，然后直接对重建后的数据进行处理。

这个系统中有很多冗余数据，因此在系统中大量使用了压缩<nobr onmousemove="kwM(6);" onmouseout="kwL(event, this);" onmouseover="kwE(event,6, this);" oncontextmenu="return false;" target="_blank" onclick="return kwC();" style="border-bottom: 1px dotted rgb(102, 0, 255); text-decoration: underline; color: rgb(102, 0, 255); background-color: transparent;" id="key6">技术</nobr>。

Dean 对压缩的部分说的很快，我没有完全记下来，所以我还是说个大概吧：压缩前先寻找相似的行，列，和时间数据。

他们使用不同版本的： BMDiff 和 Zippy 技术。

BMDiff 提供给他们非常快的写速度： 100MB/s – 1000MB/s 。Zippy 是和 LZW 类似的。Zippy 并不像 LZW 或者 gzip 那样压缩比高，但是他处理速度非常快。

Dean 还给了一个关于压缩 web 蜘蛛数据的例子。这个例子的蜘蛛包含 2.1B 的页面，行按照以下的方式命名：“com.cnn.www/index.html:http”.在未压缩前的web page 页面大小是：45.1 TB ，压缩后的大小是：4.2 TB ，只是原来的 9.2%。Links 数据压缩到原来的 13.9% , 链接文本数据压缩到原来的 12.7%。

Google 还有很多没有添加但是已经考虑的功能。

1. 数据操作表达式，这样可以把脚本发送到客户端来提供修改数据的功能。
2. 多行数据的事物支持。
3. 提高大数据存储单元的效率。
4. BigTable 作为服务运行。
好像：每个服务比如： maps 和 search history 历史搜索记录都有他们自己的集群运行 BigTable。
他们还考虑运行一个全局的 BigTable 系统，但这需要比较公平的分割资源和计算时间。