Find the neutral neutral and the combination IT for the given number

2023-01-21   ES  

====== Turn from: http://www.it165.net/admin/html/201506/5699.html======

Introduction

This situation is often encountered in the

work, and a large amount of data in HDFS is required to be imported into HBase. This article uses Spark+HBase to import data from RDD into HBase. There is no way to use the newapihadooprdd interface provided by the official website. Use this article to import the data into HBase, 7000W data, which takes about 20 minutes. The number of available nucleus in this article is 20.
This article uses the spark version of 1.3.0, and the HBase version is 0.98.1
HBase table structure is: table name table, column family Family, listed as Qualifier.

code is as follows:

Val Readfile = Sc.TextFile ("/Path/to/File"). Map (x => x.split (",", ")) 
 VAL TABLENAME = "Table" 
 Readfile.ForeachPartition { 
   x => {{ 
     value myconf = hbaseconfiguration.create () 
     myconf.set ("hbase.zookeeeper.quorum", "web102, web101, web100") 
     Myconf.set ("hbase.zookeeeper.property.clientport", "2181") 
     Myconf.set ("hbase.defaults.forsee", "true") 
     value mytable = New HTable (myconf, tableename.valueof (tableneame)) 
     mytable.setAutoflush (false, false) // Key points 1 
     MyTable.setWriteBuffersize (3*1024*1024) // Key points 2 
     x.Foreach {y => {{ 
       Println (y (0) + ":: ":" + y (1)) 
       Val P = New Put (bytes.tobytes (y (y (0))) 
       p.add ("family" .getBytes, "qualifier" .Getbytes, bytes.tobytes (y (1))) 
       mytable.put (p) 
     } 
     } 
     mytable.flushcommits () // Key points 3 
   } 
 }

This program uses the ForeachPartitation function of RDD. There are three critical places in this program.
Key points 1_: It will be automatically submitted and closed. If it is not closed, it will be submitted for each piece of data.
Key points 2: Set the cache size. When the cache is greater than the setting value, HBase will be submitted automatically. Here you can try the size. Generally, the amount of large data is set to 5m, and the settings are set to 3M.
Key points 3: FlushCommits () is performed after each piece is over. If it is not executed, when the HBase finally cache is less than the set value above, it will not be submitted, causing data loss.
Note: In addition, if you want to increase SPARK writing data such as HBASE speed, you can increase the number of spark availability.

source

Related Posts

Since the system buffer space is insufficient or the queue is full, the solution to the operation on the socket cannot be performed

2022 Computer 2nd Test WPS Office Advanced Application and Design Sprint Questions and Answers

02-title, paragraph, code Sismphus

Android componentization DEMO implementation and sharing of pits

Find the neutral neutral and the combination IT for the given number

Random Posts

10.31 NOIP simulation race (Morning)

Hangdian ACM2089 Do not 62JOHN

Little Bai Xue Python —— Use Baidu Translation API to achieve translation function

Linux, the previous N IPPAN with the highest number of visits according to the visits log

CHROME Performance Analysis Panel Performance