SPARK uses Java to read HIVE data across the cluster through HiveServer2 JDBC, cross -collecting cluster,
1, environmental information preparation
jdbc connect to the URL, usually the connection of the port is 10000
jdbc user name
jdbc password
2, code actual combat
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Spark-Read-Hive-by-Java")
.setMaster("local[*]");
SparkSession sparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
HiveSqlDialect hiveSqlDialect = new HiveSqlDialect();
JdbcDialects.registerDialect(hiveSqlDialect);
Dataset<Row> rowDataset = sparkSession.read()
.format("jdbc")
.option("url", "jdbc:hive2://ip:10000/db_name")
.option("dbtable", "table_name")
.option("user", "username")
.option("password", "passwd")
.option("driver", "org.apache.hive.jdbc.HiveDriver").load().filter("`table_name.ds`='20210112'");
rowDataset.show();
}
need to be specifically explained here, why add two lines as follows, and the implementation of the HIVESQLDIALECT class
HiveSqlDialect hiveSqlDialect = new HiveSqlDialect();
JdbcDialects.registerDialect(hiveSqlDialect);
hivesqldialet is implemented as follows:
import org.apache.spark.sql.jdbc.JdbcDialect;
public class HiveSqlDialect extends JdbcDialect {
@Override
public boolean canHandle(String url){
return url.startsWith("jdbc:hive2");
}
@Override
public String quoteIdentifier(String colName) {
return colName.replace("\"","");
}
}
When the method is not added as the above method class, the correct data has been obtained. At one time, the code I wrote was written. After checking the spark jdbc source code, you can see a problem. “The name of the list will cause no data to return, so you need to remove this excess dual quotation” “.
/**
* Quotes the identifier. This is used to put quotes around the identifier in case the column
* name is a reserved keyword, or in case it contains characters that require quotes (e.g. space).
*/
def quoteIdentifier(colName: String): String = {
s""""$colName""""
}
3, backstage log view
3.1 Log in to Cloudera Manager Find the address of the HIVESERVER2 Web UI corresponding to the IP in the previous URL and open it
3.2 can clearly see that a local IP address connection appears in active connection in Active Sessions, and a SQL query of the corresponding account appears in Open Queries;
3.3 In the historical record, you can clearly see two query sentences, one of which is “Select * from watch where 1 = 0”; a query style of the Spark JDBC method with obvious code, if you read it, you can see it if you read it. The source of the source code knows that as long as the SPARK JDBC method is accessed, a query of “Select * from watch 1 = 0” style will appear before executing SQL. This statement is to check the database table and obtain SCHEMA.
4, the result is as follows, the successful display is successful