相比于 Kylin 3.x,Kylin 4.0 实现了全新 spark 构建引擎和 parquet 存储,使 kylin 不依赖 hadoop 环境部署成为可能。与在 AWS EMR 之上部署 Kylin 3.x 相比,直接在 AWS EC2 实例上部署 Kylin 4.0 存在以下优势:
1. 节省成本。相比 AWS EMR 节点,AWS EC2 节点的成本更低。
2. 更加灵活。在 EC2 节点上,用户可以更加自主选择自己所需的服务以及组件进行安装部署。
3. 去 Hadoop。Hadoop 生态比较重,需要投入一定的人力成本进行维护,去 Hadoop 可以更加贴近云原生。

在实现了支持在 Spark Standalone 模式下进行构建和查询的功能之后,我们在 AWS 的 EC2 实例上对无 Hadoop 部署 Kylin 4.0 做了尝试,并成功构建 Cube 和进行了查询。

环境准备

  • 按照需求申请 AWS EC2 Linux 实例
  • 创建 Amazon RDS for Mysql 作为 Kylin 以及 Hive 元数据库
  • S3 作为 Kylin 存储

组件版本信息

此处提供的版本信息是我们在测试时选用的版本信息,如果用户需要使用其他的版本进行部署,可以自行更换,保证组件版本之间兼容即可。

  • JDK 1.8
  • Hive 2.3.9
  • Zookeeper 3.4.13
  • Kylin 4.0 for spark3
  • Spark 3.1.1
  • Hadoop 3.2.0(不需要启动)

安装过程

1 配置环境变量

  • 配置环境变量并使其生效

    vim /etc/profile
      
    # 在 profile 文件末尾添加以下内容
    export JAVA_HOME=/usr/local/java/jdk1.8.0_291
    export JRE_HOME=${JAVA_HOME}/jre
    export HADOOP_HOME=/etc/hadoop/hadoop-3.2.0
    export HIVE_HOME=/etc/hadoop/hive
    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
    export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
      
    # 保存以上文件内容后执行以下命令
    source /etc/profile
    

2 安装 JDK 1.8

  • 下载 jdk1.8 到准备好的 EC2 实例,解压到 /usr/local/java 目录:

    mkdir /usr/local/java
    tar -xvf java-1.8.0-openjdk.tar -C /usr/local/java
    

3 配置 Hadoop

  • 下载 Hadoop 并解压

    wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
    mkdir /etc/hadoop
    tar -xvf hadoop-3.2.0.tar.gz -C /etc/hadoop
    
  • copy 连接 S3 所需 jar 包到 hadoop 类加载路径,否则可能会出现 ClassNotFound 类型报错

    cd /etc/hadoop
    cp hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar hadoop-3.2.0/share/hadoop/common/lib/
    cp hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar hadoop-3.2.0/share/hadoop/common/lib/
    
  • 修改 core-site.xml,配置 aws 账号信息以及 endpoint,以下为示例内容

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
      
        http://www.apache.org/licenses/LICENSE-2.0
      
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
      
    <!-- Put site-specific property overrides in this file. -->
      
    <configuration>
      <property>
        <name>fs.s3a.access.key</name>
        <value>SESSION-ACCESS-KEY</value>
      </property>
      <property>
        <name>fs.s3a.secret.key</name>
        <value>SESSION-SECRET-KEY</value>
      </property> 
      <property>
        <name>fs.s3a.endpoint</name>
        <value>s3.$REGION.amazonaws.com</value>
      </property>
    </configuration> 
    

4 安装 Hive

  • 下载 Hive 并解压

    wget https://downloads.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
    tar -xvf apache-hive-2.3.9-bin.tar.gz -C /etc/hadoop
    mv /etc/hadoop/apache-hive-2.3.9-bin /etc/hadoop/hive
    
  • 编辑 hive 配置文件 vim ${HIVE_HOME}/conf/hive-site.xml,请提前启动 Amazon RDS for Mysql database,获取连接 URI、用户名和密码。

    注意:正确配置 VPC 和安全组,以保证 EC2 实例可以正常访问数据库。

    hive-site.xml文件示例内容如下:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
       Licensed to the Apache Software Foundation (ASF) under one or more
       contributor license agreements.  See the NOTICE file distributed with
       this work for additional information regarding copyright ownership.
       The ASF licenses this file to You under the Apache License, Version 2.0
       (the "License"); you may not use this file except in compliance with
       the License.  You may obtain a copy of the License at
      
           http://www.apache.org/licenses/LICENSE-2.0
      
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
    --><configuration>
      <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
      <!-- WARNING!!! Any changes you make to this file will be ignored by Hive.   -->
      <!-- WARNING!!! You must make your changes in hive-site.xml instead.         -->
      <!-- Hive Execution Parameters -->
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>password</value>
        <description>password to use against metastore database</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://host-name:3306/hive?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>admin</value>
        <description>Username to use against metastore database</description>
      </property>
      <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
        <description>
          Enforce metastore schema version consistency.
          True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic
                schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
                proper metastore schema migration. (Default)
          False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
        </description>
      </property>
    </configuration>
    
  • Hive 元数据初始化

    # 下载 mysql-jdbc 的 jar 包放置在 $HIVE_HOME/lib 目录下
    cp mysql-connector-java-5.1.47.jar $HIVE_HOME/lib
    bin/schematool -dbType mysql -initSchema
    mkdir $HIVE_HOME/logs
    nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/hivemetastorelog.log 2>&1 &
    

    注意:如果在这个步骤中出现了如下报错:

    java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
    

    这是由于 hive2 中 guava 包版本与 hadoop3 的 guava 版本不一致导致的,请使用 $HADOOP_HOME/share/hadoop/common/lib/ 目录下的 guava jar 替换 $HIVE_HOME/lib 目录中的 guava jar。

  • 为防止后续过程中出现 jar 包冲突,需要从 hive 的类加载路径中移除一些 spark 以及 scala 相关的 jar 包

    rm $HIVE_HOME/lib/spark-* $HIVE_HOME/spark_jar
    rm $HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar
    

    注:此处只列出了我们在测试过程中遇到的产生冲突的 jar 包,如果用户在遇到类似 jar 包冲突的问题,可以根据类加载路径判断哪些 jar 包产生了冲突并移除相关 jar 包。建议当相同 jar 包产生版本冲突时,保留 spark 类加载路径下的 jar 包版本。

5 部署 Spark Standalone

  • 下载 Spark 3.1.1 并解压

    wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
    tar -xvf spark-3.1.1-bin-hadoop3.2.tgz -C /etc/hadoop
    mv /etc/hadoop/spark-3.1.1-bin-hadoop3.2 /etc/hadoop/spark 
    export SPARK_HOME=/etc/hadoop/spark 
    
  • Copy 连接 S3 所需 jar 包

    cp $HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar $SPARK_HOME/jars
    cp $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
    cp mysql-connector-java-5.1.47.jar $SPARK_HOME/jars
    
  • Copy hive 配置文件及 mysql-jdbc

    cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf
    
  • 启动 Spark master 和 worker

    $SPARK_HOME/bin/start-master.sh
    $SPARK_HOME/bin/start-worker.sh spark://hostname:7077
    

6 部署 Zookeeper 伪集群

  • 下载 zookeeper 安装包并解压

    wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
    tar -xvf zookeeper-3.4.13.tar.gz -C /etc/hadoop
    mv /etc/hadoop/zookeeper-3.4.13 /etc/hadoop/zookeeper
    
  • 修改 zookeeper 配置文件,启动三节点 zookeeper 伪集群

    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo1.cfg
    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo2.cfg
    cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo3.cfg
    
  • 依次修改上述三个配置文件,添加如下内容:

    server.1=localhost:2287:3387
    server.2=localhost:2288:3388
    server.3=localhost:2289:3389
    dataDir=/tmp/zookeeper/zk1/data
    dataLogDir=/tmp/zookeeper/zk1/log
    clientPort=2181
    
  • 创建所需文件夹和文件

    mkdir /tmp/zookeeper/zk1/data
    mkdir /tmp/zookeeper/zk1/log
    mkdir /tmp/zookeeper/zk2/data
    mkdir /tmp/zookeeper/zk2/log
    mkdir /tmp/zookeeper/zk3/data
    mkdir /tmp/zookeeper/zk3/log
    vim /tmp/zookeeper/zk1/data/myid 
    vim /tmp/zookeeper/zk2/data/myid 
    vim /tmp/zookeeper/zk3/data/myid 
    
  • 启动 zookeeper 集群

    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo1.cfg
    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo2.cfg
    /etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo3.cfg
    

7 启动 kylin

  • 下载 kylin 4.0 二进制包并解压

    wget https://mirror-hk.koddos.net/apache/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
    tar -xvf apache-kylin-4.0.0-bin.tar.gz /etc/hadoop
    export KYLIN_HOME=/etc/hadoop/apache-kylin-4.0.0-bin
    mkdir $KYLIN_HOME/ext
    cp mysql-connector-java-5.1.47.jar $KYLIN_HOME/ext
    
  • 修改配置文件 vim $KYLIN_HOME/conf/kylin.properties

    kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://hostname:3306/kylin,username=root,password=password,maxActive=10,maxIdle=10
    kylin.env.zookeeper-connect-string=hostname
    kylin.engine.spark-conf.spark.master=spark://hostname:7077
    kylin.engine.spark-conf.spark.submit.deployMode=client
    kylin.env.hdfs-working-dir=s3://bucket/kylin
    kylin.engine.spark-conf.spark.eventLog.dir=s3://bucket/kylin/spark-history
    kylin.engine.spark-conf.spark.history.fs.logDirectory=s3://bucket/kylin/spark-history
    kylin.query.spark-conf.spark.master=spark://hostname:7077
    
  • 执行 bin/kylin.sh start

  • Kylin 启动时可能会遇到 ClassNotFound 类型报错,可参考以下方法解决后重启 kylin:

    # 下载 commons-collections-3.2.2.jar 
    cp commons-collections-3.2.2.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    # 下载 commons-configuration-1.3.jar
    cp commons-configuration-1.3.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    cp $HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-1.11.563.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
    cp $HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar $HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/