c language -io operation (1) wangs7

2023-01-23   ES  

SPARK uses Java to read HIVE data across the cluster through HiveServer2 JDBC, cross -collecting cluster,

1, environmental information preparation

jdbc connect to the URL, usually the connection of the port is 10000

jdbc user name

jdbc password

2, code actual combat

public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("Spark-Read-Hive-by-Java")
                .setMaster("local[*]");
        SparkSession sparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate();
        
        HiveSqlDialect hiveSqlDialect = new HiveSqlDialect();
        JdbcDialects.registerDialect(hiveSqlDialect);

        Dataset<Row> rowDataset = sparkSession.read()
                .format("jdbc")
                .option("url", "jdbc:hive2://ip:10000/db_name")
                .option("dbtable", "table_name")
                .option("user", "username")
                .option("password", "passwd")
                .option("driver", "org.apache.hive.jdbc.HiveDriver").load().filter("`table_name.ds`='20210112'");
        rowDataset.show();

    }

need to be specifically explained here, why add two lines as follows, and the implementation of the HIVESQLDIALECT class

HiveSqlDialect hiveSqlDialect = new HiveSqlDialect();
JdbcDialects.registerDialect(hiveSqlDialect);

hivesqldialet is implemented as follows:

import org.apache.spark.sql.jdbc.JdbcDialect;

public class HiveSqlDialect extends JdbcDialect {

    @Override
    public boolean canHandle(String url){
        return url.startsWith("jdbc:hive2");
    }

    @Override
    public String quoteIdentifier(String colName) {
        return colName.replace("\"","");
    }

}

When the method is not added as the above method class, the correct data has been obtained. At one time, the code I wrote was written. After checking the spark jdbc source code, you can see a problem. “The name of the list will cause no data to return, so you need to remove this excess dual quotation” “.

  /**
   * Quotes the identifier. This is used to put quotes around the identifier in case the column
   * name is a reserved keyword, or in case it contains characters that require quotes (e.g. space).
   */
  def quoteIdentifier(colName: String): String = {
    s""""$colName""""
  }

3, backstage log view

3.1 Log in to Cloudera Manager Find the address of the HIVESERVER2 Web UI corresponding to the IP in the previous URL and open it

3.2 can clearly see that a local IP address connection appears in active connection in Active Sessions, and a SQL query of the corresponding account appears in Open Queries;

3.3 In the historical record, you can clearly see two query sentences, one of which is “Select * from watch where 1 = 0”; a query style of the Spark JDBC method with obvious code, if you read it, you can see it if you read it. The source of the source code knows that as long as the SPARK JDBC method is accessed, a query of “Select * from watch 1 = 0” style will appear before executing SQL. This statement is to check the database table and obtain SCHEMA.

4, the result is as follows, the successful display is successful

source

Related Posts

HDU -1847 Good Luck in CET -4 everybody! (SG function)

Python+Selenium Webdriver Mange Usage-Farewell to Manually Download Driver

C ++

wp listbox traversing datatemplate (get all controls)

c language -io operation (1) wangs7

Random Posts

Shenzhou III number serial port 2 Send experimental program

Spring Source Code-the implementation of IC container: Beanfatory rear processing

js array, JS object, JSON array, JSON object, JSON string difference and connection

C language analog semaphore solve the problem of synchronization and mutual exclusion

CCPC-Wannafly Winter Camp Day1 (DIV2, Once) E flow flow (tree-shaped DP)