Reference blog and document
https://blog.csdn.net/genus_yang/article/details/88053170
https://github.com/apache/griffin/blob/master/griffin-doc/deploy/deploy-guide.md
http://griffin.apache.org/docs/quickstart-cn.html
griffin0.4.0 Installation
Installation steps
Installation dependencies
Declax griffin compression package
Create Griffin users in MySQL
Griffin dependency table creation
Hadoop and HIVE
livy configuration
elasticsearch configuration
GRIFFIN configuration file modification
application.properties configuration
Quartz.properties configuration
sparkproperties.json configuration
env_batch.json configuration
service/pom.xml file configuration (particularly important)
Use Maven command to compile and pack
copy jar package
The default port of the Spark Master Web UI
Start Hive source data service
Start Griffin
Started successful verification
Installation steps
Apache Griffin is the open source data quality solution of big data, which supports batch processing and flow mode. It provides a unified process that can measure your data quality from different angles, help you build trustworthy data assets, thereby improving your confidence in your business.
Installation dependencies
Apache Hadoop: Batch data source, storage indicator data
Apache Hive: Hive Metastore
Apache spark: Calculate batch, real -time indicators
Apache Livy: Provide the Restful API for the service to call Apache Spark
MySQL: Service metadata
Elasticsearch: Storage indicator data
Maven: Project Management Tool Software, which is used to pack the GRIFFIN project, followed by GRIFFIN to run in JAR. If it is a production library, put the JAR package in the platform to run the GRIFFIN in the platform because Maven runs because Maven is running. Many components will be installed, so the external network will be required.
Related links:
Hadoop Installation: https://blog.csdn.net/genus_yang/article/details/87917853
hive Installation: https://blog.csdn.net/genus_yang/article/details/879387966
spark Installation: https://blog.csdn.net/genus_yang/article/details/88018392
Livy Installation: https://blog.csdn.net/genus_yang/article/details/8802779999
MySQL Installation: https://blog.csdn.net/genus_yang/article/details/87939556
elasticsearch installation: https://blog.csdn.net/genus_yang/article/details/88051980
Maven download link:
http://maven.apache.org/download.cgi
virtual machine NAT connection external network link: https://blog.csdn.net/qq_40612124/article/details/7908427666
Declapping Griffin compressed package
[[email protected] ~]$ unzip griffin-0.4.0-source-release.zip
Create Griffin users in MySQL
[[email protected] ~]# mysql -u root -p123
mysql> create user ‘griffin’ identified by ‘123’;
mysql> grant all privileges on . to ‘griffin’@’%’ with grant option;
mysql> grant all privileges on . to [email protected] identified by ‘123’;
mysql> flush privileges;
griffin dependency table creation
Griffin uses the Quartz scheduler scheduling task. You need
[[email protected] ~]# mysql -h master -u griffin -p123 -e “create database quartz “
[[email protected] ~]# mysql -h master -u griffin -p123 quartz < /home/hadoop/griffin-0.4.0/service/src/main/resources/Init_quartz_mysql_innodb.sql
Hadoop and HIVE
# Create/home/spark_conf directory
[[email protected] ~]$ hadoop fs -mkdir -p /home/spark_conf
# Upload the hive configuration file hive-size.xml
[[email protected] ~]$ hadoop fs -put /home/hadoop/hive-3.1.1/conf/hive-site.xml /home/spark_conf/
livy configuration
update livy.conf configuration file under livy/conf
[[email protected] ~]$ cd livy-0.5.0/conf/
[[email protected] conf]$ vi livy.conf
livy.server.host = 169.254.1.100
livy.server.port = 8998
livy.spark.master = yarn
#livy.spark.deploy-mode = client
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true
Attachment of Yarn-CLUSTER and Yarn-Client mode: https://blog.csdn.net/zxr717110454/article/details/806365699
Start Livy:
[[email protected]~] $ Livy-Server Start # Start Start Stop Status State
elasticsearch configuration
Start ES (may be a bit slow)
[[email protected] ~]$ ./elasticsearch-6.6.1/bin/elasticsearch
[[email protected] ~]$ ./elasticsearch-6.6.1/bin/elasticsearch
[[email protected] ~]$ ./elasticsearch-6.6.1/bin/elasticsearch
Create Griffin index in ES
[[email protected] ~]$ curl -H “Content-Type: application/json” -XPUT http://master:9200/griffin -d ’
{
“aliases”: {},
“mappings”: {
“accuracy”: {
“properties”: {
“name”: {
“fields”: {
“keyword”: {
“ignore_above”: 256,
“type”: “keyword”
}
},
“type”: “text”
},
“tmst”: {
“type”: “date”
}
}
}
},
“settings”: {
“index”: {
“number_of_replicas”: “2”,
“number_of_shards”: “5”
}
}
}
’
The correct display result is:
{“acknowledged”:true,“shards_acknowledged”:true,“index”:“griffin”}
If you do not add parameters -H “Content -Type: Application/JSON”, the error result is:
{“error”:“Content-Type header [application/x-www-form-urlencoded] is not supported”,“status”:406}
Griffin configuration file modification
The
Griffin directory includes four modules: Griffin-Doc, Measure, Service, and UI. Among them, Griffin-DOC is responsible for storing Griffin documents. Measure is responsible for interacting with Spark to perform statistical tasks. Service uses Spring Boot as a service. It is responsible for UI. The module provides the RESTFUL API required for interaction to save statistical tasks and display statistical results.
source code is introduced after the construction is completed, and the configuration file needs to be modified.
Enter the directory of the configuration file
[[email protected] resources]$ cd /home/hadoop/griffin-0.4.0/service/src/main/resources
Due to the large number of parameters, I will shine thicken and modify it myself for the parameters to be modified.
Application.properties configuration
[[email protected] resources]$ vi application.properties
#APACHE GRIFFIN application name
spring.application.name=griffin_service
#Mysql database configuration information
spring.datasource.url=jdbc:mysql://169.254.1.100:3306quartz?useSSL=false
spring.datasource.username=griffin
spring.datasource.password=123
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
#hive metastore configuration information
hive.metastore.uris=thrift://master:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
#Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
#kafka schema registry, configure it on demand
kafka.schema.registry.url=http://master:8081
#Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
#Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
#schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
#external properties directory location
external.config.location=
#external BATCH or STREAMING env
external.env.location=
#login strategy (“default” or “ldap”)
login.strategy=default
#LDAP, the login strategy is configured when ldap
ldap.url=ldap://hostname:port
[email protected]
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
#hdfs default name
fs.defaultFS=
#Lasticsearch configuration
elasticsearch.host=master
elasticsearch.port=9200
elasticsearch.scheme=http
#elasticsearch.user = user
#elasticsearch.password = password
#Livy configuration
livy.uri=http://master:8998/batches
#yarn url configuration
yarn.uri=http://master:8088
#griffin event listener
internal.event.listeners=GriffinJobEventHook
Quartz.properties configuration
[[email protected] resources]$ vi quartz.properties
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
#If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
#If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
#If you use h2 as your database, it’s ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
sparkproperties.json configuration
[[email protected] resources]$ vi sparkProperties.json
{
“file”: “hdfs:///griffin/griffin-measure.jar”,
“className”: “org.apache.griffin.measure.Application”,
“name”: “griffin”,
“queue”: “default”,
“numExecutors”: 2,
“executorCores”: 1,
“driverMemory”: “1g”,
“executorMemory”: “1g”,
“conf”: {
“spark.yarn.dist.files”: “hdfs:///home/spark_conf/hive-site.xml”
},
“files”: [
]
}
No need to modify the default,
hdfs: ///griffin/griffin-measure.jar: The position uploaded by the measure jar package
hdfs: ///home/spark_conf/hive-site.xml: The location of the hive configuration file above
env_batch.json configuration
[[email protected] resources]$ vi env/env_batch.json
{
“spark”: {
“log.level”: “WARN”
},
“sinks”: [
{
“type”: “CONSOLE”,
“config”: {
“max.log.lines”: 10
}
},
{
“type”: “HDFS”,
“config”: {
“path”: “hdfs:///griffin/persist”,
“max.persist.lines”: 10000,
“max.lines.per.file”: 10000
}
},
{
“type”: “ELASTICSEARCH”,
“config”: {
“method”: “post”,
“api”: “http://master:9200/griffin/accuracy”,
“connection.timeout”: “1m”,
“retry”: 10
}
}
],
“griffin.checkpoint”: []
}
service/pom.xml file configuration (particularly important)
Edit service/pom.xml file No. 113, remove mysql jdbc dependency annotation:
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.java.version}</version>
</dependency>
Otherwise, starting Griffin will report an error:
nested exception is java.lang.Illeg
alStateException: Cannot load driver class: com.mysql.jdbc.Driver
Use Maven command to compile and pack
After the
[[email protected] ~]$ cd griffin-0.4.0/
[[email protected] griffin-0.4.0]$ mvn -Dmaven.test.skip=true clean install
command is executed, you will see the service-0.4.0.jar and measure-0.0.jar in the target directory of the service and the Measure module, respectively.
copy jar package
Measure-0.4.0.jar renamed the name, the same as the name in SparkProperties.json
[[email protected] ~]$ cd griffin-0.4.0/measure/target/
[[email protected] target]$ mv measure-0.4.0.jar griffin-measure.jar
Create HDFS/GRIFFIN directory
[[email protected] target]$ hadoop fs -mkdir /griffin/
Uploaded the renamed Griffin-MEASUR
[[email protected] target]$ hadoop fs -put griffin-measure.jar /griffin/
Copy the service-0.0.jar to the main directory
[[email protected] ~]$ cp /home/hadoop/griffin-0.4.0/service/target/service-0.4.0.jar .
The default port of the Spark Master Web UI
Because Griffin’s Sprint Root’s default startup port is 8080, conflict with Spark’s default port, you can change the SPARK port to avoid conflicts.
[[email protected] ~]$ cd spark-2.4.0/sbin/
[[email protected] sbin]$ vi start-master.sh
Positioning to line 61
if [ “$SPARK_MASTER_WEBUI_PORT” = “” ]; then
SPARK_MASTER_WEBUI_PORT=8087
fi
Modify to other port number
Start Hive source data service
[[email protected] ~]$ cd hive-3.1.1/
[[email protected] hive-3.1.1]$ bin/hive –service metastore &
Otherwise, the Hive database will not be found when creating Measure, and the error is:
Start Griffin
Run the service-0.4.0.jar, start the GRIFFIN management background
[[email protected] ~]$ nohup java -jar service-0.4.0.jar>service.out 2>&1 &
Started successful verification
Visit the UI interface 169.254.1.100:8080, the following figure showed that the installation was successful!
Insert a picture description here
———————
Author: yangjab
Source: CSDN
Original: https://blog.csdn.net/genus_yang/article/details/88053170
Copyright Statement: This article is an original article of bloggers. Please attach a blog post link for reprint!