====== Turn from: http://www.it165.net/admin/html/201506/5699.html======
Introduction
This situation is often encountered in the
work, and a large amount of data in HDFS is required to be imported into HBase. This article uses Spark+HBase to import data from RDD into HBase. There is no way to use the newapihadooprdd interface provided by the official website. Use this article to import the data into HBase, 7000W data, which takes about 20 minutes. The number of available nucleus in this article is 20.
This article uses the spark version of 1.3.0, and the HBase version is 0.98.1
HBase table structure is: table name table, column family Family, listed as Qualifier.
code is as follows:
Val Readfile = Sc.TextFile ("/Path/to/File"). Map (x => x.split (",", "))
VAL TABLENAME = "Table"
Readfile.ForeachPartition {
x => {{
value myconf = hbaseconfiguration.create ()
myconf.set ("hbase.zookeeeper.quorum", "web102, web101, web100")
Myconf.set ("hbase.zookeeeper.property.clientport", "2181")
Myconf.set ("hbase.defaults.forsee", "true")
value mytable = New HTable (myconf, tableename.valueof (tableneame))
mytable.setAutoflush (false, false) // Key points 1
MyTable.setWriteBuffersize (3*1024*1024) // Key points 2
x.Foreach {y => {{
Println (y (0) + ":: ":" + y (1))
Val P = New Put (bytes.tobytes (y (y (0)))
p.add ("family" .getBytes, "qualifier" .Getbytes, bytes.tobytes (y (1)))
mytable.put (p)
}
}
mytable.flushcommits () // Key points 3
}
}
This program uses the ForeachPartitation function of RDD. There are three critical places in this program.
Key points 1_: It will be automatically submitted and closed. If it is not closed, it will be submitted for each piece of data.
Key points 2: Set the cache size. When the cache is greater than the setting value, HBase will be submitted automatically. Here you can try the size. Generally, the amount of large data is set to 5m, and the settings are set to 3M.
Key points 3: FlushCommits () is performed after each piece is over. If it is not executed, when the HBase finally cache is less than the set value above, it will not be submitted, causing data loss.
Note: In addition, if you want to increase SPARK writing data such as HBASE speed, you can increase the number of spark availability.