This tutorial assumes that you have successfully installed Hadoop & SQOOP and both are up and running.
Lets start with a simple example to import data from MySQL to HDFS. SQOOP can operate on data with minimum instructions and we are going to start with it and then move on to other advanced options.
We have a MySQL table like this
CREATE TABLE IF NOT EXISTS `employees` (
`id` float NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`department` varchar(255) NOT NULL,
`salary` float NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Or Postgres Table like this
This command will trigger the Map-Reduce job to import the data from MySQL Table to HDFS.
Lets start with a simple example to import data from MySQL to HDFS. SQOOP can operate on data with minimum instructions and we are going to start with it and then move on to other advanced options.
We have a MySQL table like this
CREATE TABLE IF NOT EXISTS `employees` (
`id` float NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`department` varchar(255) NOT NULL,
`salary` float NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Or Postgres Table like this
CREATE TABLE employees (
id integer NOT NULL,
name character varying(255),
department character varying,
salary numeric,
created_at timestamp without time zone
);
Insert some rows into it
INSERT INTO `employees` VALUES
(1, 'Vineet Sethi', 'HR', 25000, '2013-02-25 00:00:00'),
(2, 'Monika Sethi', 'Finance', 15000, '2014-02-21 00:00:00'),
(3, 'Aleen Sethi', 'Admin', 25700.5, '2013-04-02 00:00:00'),
(4, 'Rahul Sethi', 'Finance', 20000, '2013-05-07 00:00:00');
Since, you can use Sqoop with any other JDBC-compliant database. Download the MySQL JDBC driver connector from MySQL http://dev.mysql.com/downloads/connector/j/, untar it, save it, and move mysql-connector-java-5.1.26-bin.jar into your SQOOP_HOME/lib or for Postgres download the Postgres JDBC driver connector from http://jdbc.postgresql.org/download.html and move postgresql-9.2-1003.jdbc4.jar (in my case) into your SQOOP_HOME/lib, so that it remains available to SQOOP & MR job.
Thats it! we are ready to import the data to HDFS. Run Sqoop with minimum instructions to import data from MySQL/Postgres to HDFS like this:-
Minimal import command
For MySQL
#sqoop import --connect jdbc:mysql://<HOST>/<DATABASE> --table employees --username <DB USERNAME> -P
For Postgres
#sqoop import --connect jdbc:postgresql://<HOST>/<DATABASE> --table employees --username <DB USERNAME> -P
Thats it! we are ready to import the data to HDFS. Run Sqoop with minimum instructions to import data from MySQL/Postgres to HDFS like this:-
Minimal import command
For MySQL
#sqoop import --connect jdbc:mysql://<HOST>/<DATABASE> --table employees --username <DB USERNAME> -P
For Postgres
#sqoop import --connect jdbc:postgresql://<HOST>/<DATABASE> --table employees --username <DB USERNAME> -P
This command will trigger the Map-Reduce job to import the data from MySQL Table to HDFS.
--connect is a JDBC connection string and we have used MySQL connection string. Don't forget to replace <DATABASE> with yours.
--table is a MySQL table to export from. It can also be a MySQL View
--username MySQL DB username. Don't forget to replace it with yours.
-P option allows to enter password on the console rather that passing it as an argument to the above command. The
By default, Sqoop will import a table named
So modifying the above command a bit (for better output results) by specifying the -m option and --delete-target-dir if /user/hduser/employees dir already exists.
#sqoop import --connect jdbc:mysql://ubuntu/<DATABASE> --table employees --username <DB USERNAME> -P -m 1 --delete-target-dir
After the job is run, the dir /user/hduser/employees looks like this
#hadoop fs -ls /user/hduser/employees/
Found 2 items
-rw------- 3 hduser supergroup 0 2013-08-28 04:56 /user/hduser/employees/_SUCCESS
-rw------- 3 hduser supergroup 211 2013-08-28 04:55 /user/hduser/employees/part-m-00000
and looking at the contents of /user/hduser/employees/part-m-00000
#hadoop fs -cat /user/hduser/employees/part-m-00000
1.0,Vineet Sethi,HR,25000.0,2013-02-25 00:00:00.0
2.0,Monika Sethi,Finance,15000.0,2014-02-21 00:00:00.0
3.0,Aleen Sethi,Admin,25700.5,2013-04-02 00:00:00.0
4.0,Rahul Sethi,Finance,20000.0,2013-05-07 00:00:00.0
You can see by default Sqoop creates the delimited text file, ',' as field seperator character & newline(\n) as end-of-line character.
Change delimited text
If we want to delimit the text file with '$' as field seperator & '|' as end-of-line character pass on the options like this
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --delete-target-dir --fields-terminated-by \$ --lines-terminated-by \|
You might have to escape the shell characters with backslash '\'. The output of the above command would be like this
1.0$Vineet Sethi$HR$25000.0$2013-02-25 00:00:00.0|2.0$Monika Sethi$Finance$15000.0$2014-02-21 00:00:00.0|3.0$Aleen Sethi$Admin$25700.5$2013-04-02 00:00:00.0|4.0$Rahul Sethi$Finance$20000.0$2013-05-07 00:00:00.0|
Column selection & ordering
By default, while importing all the table columns imported in their natural order. You can limit the columns to import and change their natural ordering by --columns option, So, if just want to include name, department & salary while skipping id & created_at fields
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --delete-target-dir --columns "name, department, salary"
Output file formats
Delimited text is the default file format. The other file formats in which the data can be imported is Sequence file format and AVRO file format. You can use the --as-sequencefile or --as-avrodatafile respectively to change the output file format. SequenceFiles are a binary format that store individual records in custom record-specific data types and is appropriate for storing binary data. MapReduce programs reads and write this format faster, giving performance boost. Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages.
Incremental Imports
SQOOP provides an incremental import so that only the rows which are new are appended to already imported set of rows. So, lets say you have 4 new more rows in your table
INSERT INTO `employees` (`id`, `name`, `department`, `salary`) VALUES
--table is a MySQL table to export from. It can also be a MySQL View
--username MySQL DB username. Don't forget to replace it with yours.
-P option allows to enter password on the console rather that passing it as an argument to the above command. The
-P argument is the preferred method over using the --password argument. By default, Sqoop will import a table named
employees to a directory named employees inside your home directory in HDFS. For example, if your username is hduser, then the import tool will write to /user/hduser/employees/(files). Since, we are using very small dataset, we can reduce the number of parallelism & hence output files generated by minimizing the number of mappers for this job. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. It controls the degree of parallelism to employ. So modifying the above command a bit (for better output results) by specifying the -m option and --delete-target-dir if /user/hduser/employees dir already exists.
#sqoop import --connect jdbc:mysql://ubuntu/<DATABASE> --table employees --username <DB USERNAME> -P -m 1 --delete-target-dir
After the job is run, the dir /user/hduser/employees looks like this
#hadoop fs -ls /user/hduser/employees/
Found 2 items
-rw------- 3 hduser supergroup 0 2013-08-28 04:56 /user/hduser/employees/_SUCCESS
-rw------- 3 hduser supergroup 211 2013-08-28 04:55 /user/hduser/employees/part-m-00000
and looking at the contents of /user/hduser/employees/part-m-00000
#hadoop fs -cat /user/hduser/employees/part-m-00000
1.0,Vineet Sethi,HR,25000.0,2013-02-25 00:00:00.0
2.0,Monika Sethi,Finance,15000.0,2014-02-21 00:00:00.0
3.0,Aleen Sethi,Admin,25700.5,2013-04-02 00:00:00.0
4.0,Rahul Sethi,Finance,20000.0,2013-05-07 00:00:00.0
You can see by default Sqoop creates the delimited text file, ',' as field seperator character & newline(\n) as end-of-line character.
Change delimited text
If we want to delimit the text file with '$' as field seperator & '|' as end-of-line character pass on the options like this
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --delete-target-dir --fields-terminated-by \$ --lines-terminated-by \|
You might have to escape the shell characters with backslash '\'. The output of the above command would be like this
1.0$Vineet Sethi$HR$25000.0$2013-02-25 00:00:00.0|2.0$Monika Sethi$Finance$15000.0$2014-02-21 00:00:00.0|3.0$Aleen Sethi$Admin$25700.5$2013-04-02 00:00:00.0|4.0$Rahul Sethi$Finance$20000.0$2013-05-07 00:00:00.0|
Column selection & ordering
By default, while importing all the table columns imported in their natural order. You can limit the columns to import and change their natural ordering by --columns option, So, if just want to include name, department & salary while skipping id & created_at fields
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --delete-target-dir --columns "name, department, salary"
Output file formats
Delimited text is the default file format. The other file formats in which the data can be imported is Sequence file format and AVRO file format. You can use the --as-sequencefile or --as-avrodatafile respectively to change the output file format. SequenceFiles are a binary format that store individual records in custom record-specific data types and is appropriate for storing binary data. MapReduce programs reads and write this format faster, giving performance boost. Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages.
Incremental Imports
SQOOP provides an incremental import so that only the rows which are new are appended to already imported set of rows. So, lets say you have 4 new more rows in your table
INSERT INTO `employees` (`id`, `name`, `department`, `salary`) VALUES
(5, 'Fabio Russo', 'HR', 25000, '2013-02-25 00:00:00'),
(6, 'Nick Villaume', 'Finance', 15000, '2014-02-21 00:00:00'),
(7, 'Mary Villaume', 'Admin', 25700.5, '2013-04-02 00:00:00'),
(8, 'John Bat', 'Finance', 20000, '2013-05-07 00:00:00');
and you want to incrementally import it. So, after running the command with options --check-column <column name>, --incremental <append or lastmodified> and --last-value <value>
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --check-column "id" --incremental "append" --last-value 4
This command will search the rows whose id > 4 and gets appended to the last imported result. So, if we look at the /user/hduser/employees dir
Found 3 items
-rw------- 3 hduser supergroup 0 2013-08-30 04:31 /user/hduser/employees/_SUCCESS
-rw------- 3 hduser supergroup 107 2013-08-30 04:31 /user/hduser/employees/part-m-00000
-rw------- 3 hduser supergroup 210 2013-08-30 04:53 /user/hduser/employees/part-m-00001
You can see part-m-00001 file is appended to last imported records whose content is
5.0,Micheal Sethi,HR,25000.0,2013-02-25 00:00:00.0
6.0,Monica Sethi,Finance,15000.0,2014-02-21 00:00:00.0
7.0,Nick Sethi,Admin,25700.5,2013-04-02 00:00:00.0
8.0,Paul Sethi,Finance,20000.0,2013-05-07 00:00:00.0
Offcourse, SQOOP is highly customizable and we saw how we can customize it further according to our needs. You can see the further options by typing
# sqoop import --help
and that will give you many more options to use with the command.
I hope this tutorial is good to start with and help you. To reach out to me you can connect to me at developer.vineet@gmail.com and @vineetsethi25 on twitter. You can also checkout this content on my blog http://www.technologywithvineet.com
and you want to incrementally import it. So, after running the command with options --check-column <column name>, --incremental <append or lastmodified> and --last-value <value>
#sqoop import --connect jdbc:mysql://ubuntu/mr_data --table employees --username root -P -m 1 --check-column "id" --incremental "append" --last-value 4
This command will search the rows whose id > 4 and gets appended to the last imported result. So, if we look at the /user/hduser/employees dir
Found 3 items
-rw------- 3 hduser supergroup 0 2013-08-30 04:31 /user/hduser/employees/_SUCCESS
-rw------- 3 hduser supergroup 107 2013-08-30 04:31 /user/hduser/employees/part-m-00000
-rw------- 3 hduser supergroup 210 2013-08-30 04:53 /user/hduser/employees/part-m-00001
You can see part-m-00001 file is appended to last imported records whose content is
5.0,Micheal Sethi,HR,25000.0,2013-02-25 00:00:00.0
6.0,Monica Sethi,Finance,15000.0,2014-02-21 00:00:00.0
7.0,Nick Sethi,Admin,25700.5,2013-04-02 00:00:00.0
8.0,Paul Sethi,Finance,20000.0,2013-05-07 00:00:00.0
Offcourse, SQOOP is highly customizable and we saw how we can customize it further according to our needs. You can see the further options by typing
# sqoop import --help
and that will give you many more options to use with the command.
I hope this tutorial is good to start with and help you. To reach out to me you can connect to me at developer.vineet@gmail.com and @vineetsethi25 on twitter. You can also checkout this content on my blog http://www.technologywithvineet.com