Friday, April 29, 2011

Build Hive Source code

We need to perform following steps to build hive source code.
1. Download hive source code.
2. Go to hive source directory and run 'ant clean'.
3. Remove the contents of ~/.ant/cache/hadoop/core/sources.
4. Copy the following files to ~/.ant/cache/hadoop/core/sources:
            hadoop-0.17.2.1.tar.gz
            hadoop-0.17.2.1.tar.gz.asc
            hadoop-0.18.3.tar.gz
            hadoop-0.18.3.tar.gz.asc
            hadoop-0.19.0.tar.gz
            hadoop-0.19.0.tar.gz.asc
            hadoop-0.20.0.tar.gz
            hadoop-0.20.0.tar.gz.asc
5. If you do not care about support for Hadoop 0.17.2.1, or 0.18, etc, you can disable support for these versions (and skip the download) by removing the references to these versions from shims/build.xml and shims/ivy.xml.
6. Again, go to hive source directory and run 'ant tar'.
7. Go to /src/build directory, where u will get hive-*.tar.gz.

Monday, February 21, 2011

Store Hive Metadata into RDBMS

The Metastore stores all the information about the tables, their partitions, the schemas, the columns and their types, the table locations etc. This information can be queried or modified using a thrift interface and as a result it can be called from clients in different programming languages. As this information needs to be served fast to the compiler, they have chosen to store this information on a traditional RDBMS.

By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.

The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into a relational schema and vice versa. They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The DataNucleus layer allows them to plugin many different RDBMS technologies.

Facebook uses MySQL to store metadata information.

Store Hive metadata into MySQL:

We need to perform following steps to setup a metastore in a MySQL server.  

      1.   Install mysql-server.
            sudo apt-get install mysql-server
      2.   Use su command. you will see password promt. Type your root
            userpassword.
      3.   If your password correct, you'll see like this; root@yourdesktopname#.
      4.   Now, Start mysql server.
            /etc/init.d/mysql start.
      5.   When mysql server started, type mysql -u root mysql.
      6.   Next, I'm going to create a new MySQL user for hadoop/hive.
            mysql> CREATE USER 'hadoop'@'localhost' IDENTIFIED BY 'hadoop';

            mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'localhost' WITH
            GRANT OPTION;

            mysql> exit;
      7.   su – userName  // In my case userName is hadoop. So, switch to that user.
      8.   mysql -h localhost -u userName -p.
      9.   Now, we need to change the hive configuration so it can use MySQL:
     10.  Go to hive/conf directory,  then open hive-default.xml and perform
            following changes.
           <!-- In local metastore setup, each Hive Client will open a connection to
           the datastore and make SQL queries against it.-->
           <property>
           <name>hive.metastore.local</name>
           <value>true</value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionURL</name>
           <value>
            jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
           </value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionDriverName</name>
           <value>com.mysql.jdbc.Driver</value>
           </property>

           <property>
           <name>javax.jdo.option.ConnectionUserName</name>
           <value>Username</value><!-- In my case UserName is hadoop-->
           </property>

           <property>
           <name>javax.jdo.option.ConnectionPassword</name>
           <value>UserPassword</value><!-- In my case password is hadoop-->
           </property>

      11. Hive needs to have the MySQL jdbc drivers, so we need to download and
      12. If all settings are done correctly, we can do this:-
            copy it to hive/lib folder.  
            impetus@ubuntu:~$ cd HIVE_HOME

            impetus@ubuntu:~/hive$ bin/hive
            Hive history file=/tmp/ankit 
            /hive_job_log_ankit_201102211937_456962737.txt

            hive>
   
            hive>create table temp(info INT);
            OK
            Time taken: 4.745 seconds

            hive> show tables;
            OK
            temp
            Time taken: 0.137 seconds
            hive>

      13. Again, start MySQL server.

            impetus@ubuntu:~$mysql -h localhost -u hadoop -p
            Enter password: 
            Welcome to the MySQL monitor.  Commands end with ; or \g.
            Your MySQL connection id is 84
            Server version: 5.1.41-3ubuntu12.8 (Ubuntu) 

            Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
            mysql>

      14. Use following command to view all existing databases.

            mysql> show databases;
            +--------------------+
            | Database           |
            +--------------------+
            | information_schema | 
            | hive               |
            | mysql              |
            +--------------------+
            3 rows in set (0.05 sec)

            mysql> use hive;   //Select database.
            Reading table information for completion of table and column names
            You can turn off this feature to get a quicker startup with -A
            Database changed
            mysql>

      15. Use following command to view all exiting tables.
   
            mysql> show tables;
           +--------------------+
            | Tables_in_hive     |
           +--------------------+
            | BUCKETING_COLS     |
            | COLUMNS            |
            | DBS                |
            | NUCLEUS_TABLES     |
            | PARTITIONS         |
            | PARTITION_KEYS     |
            | PARTITION_KEY_VALS |
            | PARTITION_PARAMS   |
            | SDS                |
            | SD_PARAMS          |
            | SEQUENCE_TABLE     |
            | SERDES             |
            | SERDE_PARAMS       |
            | SORT_COLS          |
            | TABLE_PARAMS       |
            | TBLS               |
            +--------------------+
            16   rows in set (0.00 sec)

            mysql>

      16. Use following command to view the metadata information of hive table
            columns.
  
            mysql> select * from COLUMNS;

            +-------+---------+-------------+-----------+-------------+
            |SD_ID|COMMENT|COLUMN_NAME|TYPE_NAME|INTEGER_IDX|
            +-------+---------+-------------+-----------+-------------+
            |   1 | NULL    | info        | int       |           0 |
            +-------+---------+-------------+-----------+-------------+
            1 rows in set (0.00 sec)
            mysql>
            //Where info is the column of temp table.

      17. Similarly, you can View the information of partiotions, Bucketing,
            tables, etc. 

Sunday, November 21, 2010

Hive Basic

Hive Introduction:-

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.



The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this blog, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map- reduce jobs that are executed using Hadoop.


Hive Installation:-

1. Requirement:-
  • Java 1.6
  • Hadoop 0.17.x to 0.20.x.

3. You need to unpack the tarball.
$tar -xzvf hive-x.y.z.tar.gz

4. Set the environment variable HIVE_HOME to point to the installation directory

5. Hive uses hadoop that means:
  • you must have hadoop in your path OR
  • export HADOOP_HOME=<hadoop-install-dir>

6.  In addition, you must create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before a table can be created in Hive.

$ $HADOOP_HOME/bin/hadoop fs –mkdir  /tmp
$ $HADOOP_HOME/bin/hadoop fs –mkdir  /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs  -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs  -chmod g+w /user/hive/warehouse

Hive Command Line Interface:-

$ $HIVE_HOME/bin/hive

DDL Operation:-

Create Hive Table:-
1.     hive> Create table info(name INT , add STRING);

2.    hive>Create table data(name INT , add STRING) PARTITIONED BY (date STRING);
(Creates a table called data with two columns and a partition column called date .The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.)

Show table:-

1.     hive>Show tables;

2.     hive>Show tables ‘.*s’;
(List all the table end that end with s).

Describe table:-
1.    hive>Describe info;
(Show list of all columns)

Alter table:-
1.     hive>Alter table info ADD COLUMNS (PNO  INT);

2.     hive>Alter table info  RENAME  TO information;

Drop table:-
1.     hive>drop table data;

DML Operation:-

Load data:-
1.     hive>LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE info;
(Loads a file that contains two columns separated by ctrl-a into info table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS.)

2.     hive>LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE data PARTITION (date='20-11-2010');

3.     hive>LOAD DATA INPATH 'hdfs://localhost:9000/user/hive/warehouse/kv2.txt' OVERWRITE INTO TABLE data PARTITION (date='21-11-2010');
(The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory.)

SQL Operation:-

Select:-
1.     Select * from info;

2.     SELECT a.name FROM info a WHERE (a.date='20-11-2010');


Count:-
1.     SELECT COUNT(*) FROM info;

Creating custom UDFs:-

1.     Create a new class that extends UDF, with one or more methods named evaluate.

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF
{
public Text evaluate(final Text s)
{
            if (s == null) { return null; }
            return new Text(s.toString().toLowerCase());
            }
        }
(Note that there's already a built-in function for this, it's just an easy example).

2.     Now, Compile the function and make jar.

3.   Deploying jars for User Defined Functions
                              hive> add jar my_jar.jar;
Added my_jar.jar to class path

     
      4.     Once hive is started up with your jars in the classpath, the final step is to register your function: 
                              create temporary function my_lower as ‘Lower’;
     
      5 . Now you can start using it.
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);