Monday, February 21, 2011

Store Hive Metadata into RDBMS

The Metastore stores all the information about the tables, their partitions, the schemas, the columns and their types, the table locations etc. This information can be queried or modified using a thrift interface and as a result it can be called from clients in different programming languages. As this information needs to be served fast to the compiler, they have chosen to store this information on a traditional RDBMS.

By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.

The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into a relational schema and vice versa. They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The DataNucleus layer allows them to plugin many different RDBMS technologies.

Facebook uses MySQL to store metadata information.

Store Hive metadata into MySQL:

We need to perform following steps to setup a metastore in a MySQL server.  

      1.   Install mysql-server.
            sudo apt-get install mysql-server
      2.   Use su command. you will see password promt. Type your root
            userpassword.
      3.   If your password correct, you'll see like this; root@yourdesktopname#.
      4.   Now, Start mysql server.
            /etc/init.d/mysql start.
      5.   When mysql server started, type mysql -u root mysql.
      6.   Next, I'm going to create a new MySQL user for hadoop/hive.
            mysql> CREATE USER 'hadoop'@'localhost' IDENTIFIED BY 'hadoop';

            mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'localhost' WITH
            GRANT OPTION;

            mysql> exit;
      7.   su – userName  // In my case userName is hadoop. So, switch to that user.
      8.   mysql -h localhost -u userName -p.
      9.   Now, we need to change the hive configuration so it can use MySQL:
     10.  Go to hive/conf directory,  then open hive-default.xml and perform
            following changes.
           <!-- In local metastore setup, each Hive Client will open a connection to
           the datastore and make SQL queries against it.-->
           <property>
           <name>hive.metastore.local</name>
           <value>true</value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionURL</name>
           <value>
            jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
           </value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionDriverName</name>
           <value>com.mysql.jdbc.Driver</value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionUserName</name>
           <value>Username</value><!-- In my case UserName is hadoop-->
           </property>

           <property>
           <name>javax.jdo.option.ConnectionPassword</name>
           <value>UserPassword</value><!-- In my case password is hadoop-->
           </property>

      11. Hive needs to have the MySQL jdbc drivers, so we need to download and
      12. If all settings are done correctly, we can do this:-
            copy it to hive/lib folder.  
            impetus@ubuntu:~$ cd HIVE_HOME

            impetus@ubuntu:~/hive$ bin/hive
            Hive history file=/tmp/ankit 
            /hive_job_log_ankit_201102211937_456962737.txt

            hive>
   
            hive>create table temp(info INT);
            OK
            Time taken: 4.745 seconds

            hive> show tables;
            OK
            temp
            Time taken: 0.137 seconds
            hive>

      13. Again, start MySQL server.

            impetus@ubuntu:~$mysql -h localhost -u hadoop -p
            Enter password: 
            Welcome to the MySQL monitor.  Commands end with ; or \g.
            Your MySQL connection id is 84
            Server version: 5.1.41-3ubuntu12.8 (Ubuntu) 

            Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
            mysql>

      14. Use following command to view all existing databases.

            mysql> show databases;
            +--------------------+
            | Database           |
            +--------------------+
            | information_schema | 
            | hive               |
            | mysql              |
            +--------------------+
            3 rows in set (0.05 sec)

            mysql> use hive;   //Select database.
            Reading table information for completion of table and column names
            You can turn off this feature to get a quicker startup with -A
            Database changed
            mysql>

      15. Use following command to view all exiting tables.
   
            mysql> show tables;
           +--------------------+
            | Tables_in_hive     |
           +--------------------+
            | BUCKETING_COLS     |
            | COLUMNS            |
            | DBS                |
            | NUCLEUS_TABLES     |
            | PARTITIONS         |
            | PARTITION_KEYS     |
            | PARTITION_KEY_VALS |
            | PARTITION_PARAMS   |
            | SDS                |
            | SD_PARAMS          |
            | SEQUENCE_TABLE     |
            | SERDES             |
            | SERDE_PARAMS       |
            | SORT_COLS          |
            | TABLE_PARAMS       |
            | TBLS               |
            +--------------------+
            16   rows in set (0.00 sec)

            mysql>

      16. Use following command to view the metadata information of hive table
            columns.
  
            mysql> select * from COLUMNS;

            +-------+---------+-------------+-----------+-------------+
            |SD_ID|COMMENT|COLUMN_NAME|TYPE_NAME|INTEGER_IDX|
            +-------+---------+-------------+-----------+-------------+
            |   1 | NULL    | info        | int       |           0 |
            +-------+---------+-------------+-----------+-------------+
            1 rows in set (0.00 sec)
            mysql>
            //Where info is the column of temp table.

      17. Similarly, you can View the information of partiotions, Bucketing,
            tables, etc.