Problem:
The steps that I take:
create table using the following HQL commands:
CREATE TABLE 10projects(......)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
The table is created and raw data is loaded into the table, but the data reads like:
select * from 10projects:
"""e565fb42185c6e9f22806ad9d5ac8a77"""
"""2e17c8c91cb58132d8103a9aa8797e80"""
"""45e7ddbdd7023f1eb65a6cc028d7414f"""
360009001332 40.841691
-73.875457 Bronx
NY 10460
urban New York City Dept Of Ed
Bronx f
f f
f f
f Mr.
f f
Literacy Literacy & Language
Books highest poverty
Grades 9-12 NULL
NULL NULL
NULL 280.02
341.49 0
308.0 1
f f
completed 2007-03-08
2007-03-08 2007-03-08
2003-12-31
The double quotes, such as """e565fb42185c6e9f22806ad9d5ac8a77""", is not expected to appear.
I expected results to be:
e565fb42185c6e9f22806ad9d5ac8a77 2e17c8c91cb58132d8103a9aa8797e80
45e7ddbdd7023f1eb65a6cc028d7414f
360009001332 40.841691
-73.875457 Bronx
NY 10460
urban New York City Dept Of Ed
Bronx f
f f
f f
f Mr.
f f
Literacy Literacy & Language
Books highest poverty
Grades 9-12 NULL
NULL NULL
NULL 280.02
341.49 0
308.0 1
f f
completed 2007-03-08
2007-03-08 2007-03-08
2003-12-31
SOLUTION:
1. Using CSV SerDe:
Step 1: download
JAR HERE
Step 2: Add jar path to HIVE class path when lunching HIVE.
$ hive --auxpath /path/to/hive-examples.jar
Step 3: Create table using the serde as row format.
CREATE TABLE 10projects2 (projectid STRING, teacher_acctid STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
stored as textfile;
Results: '''' double quotes from triple quotes """ are eliminated, but " one quote still remain.
SOURCE: Hadoop Guide pp.451.
2. Using sed -i linux streaming editor.
SOURCE:
Linux sed
The s
Command:
Syntax: s / regexp / replacement /flags
The s command can be followed by zero or more of the following flags:
g
Apply the replacement to all matches to the regexp, not just the first.
For this case:
sed -i 's/"""//g' opendata_essays.log