shell 腳本執行python腳本,連接hive提交數據寫入表

使用說明

1.cd /opt/zy
在這個目錄下以root用戶權限執行命令
2.
在SAP查詢的時候
Tcode:ZMMR0005
Purchase Org *
PO Creating:2017/3/1 (開始日期) 2017/6/31(結束日期)
Vendor
1000341
plant *

這樣查詢處理的結果代表 發貨日期在20170301-20170631的所有記錄,不管到達日期在那個月

從SAP導出數據表格,存爲txt形式以”\t”分隔
用rz命令把導出的文件上傳到/opt/zy目錄下,
3.執行命令 注意參數必須嚴格符合XXXXXXXXtoYYYYYYYY的格式,代表startdate to enddate
example:
[root@slave1 zy]# bash try2.sh 20170301to20170632
4.去Hue裏查詢分析結果
SELECT * from saplifttime WHERE querypocredatestart=’XXXXXXXX’[and querypocredateend=’YYYYYYYY’];
run command
5.如果想看原數據,去pcg.sap表,命令如下:
SELECT * from sap WHERE querypocredatestart=’20170301’;

運行結果截圖:
rsesult

技術實現說明

用shell 腳本調用python腳本

shell 腳本 try2.sh

#!/bin/sh
#echo $1
daterange=$1#賦值給daterange這個變量是因爲後面截取字符串要用到,否則我不會寫
python3 /opt/zy/runtask.py $1 #運行python腳本
startdate=${daterange:0:8}   #截取查詢的開始日期
#echo $startdate
enddate=${daterange:10:18}   #截取查詢的結束日期
#echo $enddate
sed -i '1,3d' /opt/zy/$1.txt   #刪除前三行,因爲前三行是空行
sed 's/.\{1\}//' $1.txt>$1regular.txt #刪除第一列,因爲第一列是空列
hdfs dfs -put -f /opt/zy/$1regular.txt /user/hive/pcg-data/zhouyi6_files #把服務器上的本地文件上傳到hadoop集羣上
hive -e "LOAD DATA INPATH '/user/hive/pcg-data/zhouyi6_files/$1regular.txt' INTO TABLE pcg.sap partition(querypocredatestart=$startdate,querypocredateend=$enddate)" #把文件的數據載入表 
rm $1.txt #刪除本地原文件,只保留格式處理後的文件

備註:
1.因爲sed命令 不修改文件本身,所以要把修改後的結果存入新文件 +regular後綴的
2.sed -i,-i代表不把刪除前三行後的結果顯示在命令行上
3.hdfs dfs -put -f
-f option will overwrite the destination if it already exists.
4.運行這個腳本的前提是已經創建了pcg.sap表,建表語句如下:

CREATE TABLE SAP(`PO Cre Date` string,
`Vendor` string, 
`WW Partner` string, 
`Name of Vendor` string,
`PO Cre by` string, 
`Purch Doc Type` string,
`Purch Order` string,
`PO Item` string,
`Deletion Indicator in PO Item` string, 
`Request Shipment Day` string,
`Material` string,
`Short Text` string, 
`Plant` string, 
`Issuing Stor location` string,
`Receive Stor loaction` string, 
`PO item change date` string, 
`Delivery Priority` string,
`PO Qty` string,
`Total GR Qty` string,
`Still to be delivered` string,
`Delivery Note` string,
`Delivery Note Type (ASN or DN)` string, 
`Delivery Note item` string,
`Delivery Note qty` string, 
`Delivery Note Creation Date` string,
`Delivery Note ACK Date` string, 
`Incoterm` string, 
`Part Battery Indicator` string,
`BOL/AWBill` string, 
`Purchase order type` string, 
`Gr Date`string) 
partitioned by (`queryPoCreDateStart` string,`queryPoCreDateEnd` string)
row format delimited fields terminated by "\t" stored as textfile

python腳本

import  pandas as pd
import  sys
data = pd.read_csv(sys.argv[1]+".txt", sep="\t")
#print(data.columns)
data['Delivery Note Creation Date']=pd.to_datetime(data['Delivery Note Creation Date'],format='%d.%m.%Y')
data['Gr Date']=pd.to_datetime(data['Gr Date'],format='%d.%m.%Y')
data=data.drop(data[data['Delivery Note Creation Date'].isnull()].index.tolist())#刪除某列爲空值所在的行
data=data.drop(data[data['Gr Date'].isnull()].index.tolist())#刪除某列爲空值所在的行
data['delta']=(data['Gr Date']-data['Delivery Note Creation Date']).apply(lambda  x:x.days)#相差的時間
print(data['delta'].describe())
#sql_content="insert into table saplifttime values(%,%s,%s,%s,%s,%s,%s,%s,%s,%s)"%\
import hdfs
from impala.dbapi import connect
filename=sys.argv[1]+".txt"
hdfspath='/user/hive/pcg-data/zhouyi6_files'
client=hdfs.Client("http://10.100.208.222:50070")#50070
#8888是我登錄WEB 操作界面時候的接口
#print(client.status("/user/zhouyi",strict=True))#查看路徑信息
#print(client.list("/user/zhouyi"))#查看文件夾下的文件
#client.upload(hdfs_path=hdfspath,local_path="/opt/zy/"+filename,overwrite=True)
# overwrtie=True means Delete any uploaded files if an error occurs during theupload.
conn = connect(host='10.100.208.222', port=21050,database='pcg')
cur = conn.cursor()
stdate,edate=sys.argv[1].split("to")
#print(sys.argv[1])
cnt=str(data['delta'].describe()[0])
mean=str(data['delta'].describe()[1])
std=str(data['delta'].describe()[2])
mini=str(data['delta'].describe()[3])
twentyfive=str(data['delta'].describe()[4])
fifty=str(data['delta'].describe()[5])
seventyfive=str(data['delta'].describe()[6])
maxm=str(data['delta'].describe()[7])
args=[stdate,edate,cnt,mean,std,mini,twentyfive,fifty,seventyfive,maxm]
print(args)

#對的SQL
#sql_content="insert into table saplifttime values("+str(5555)+",'20200607','22','4.2','9.88','1','2','5','10','9999999999999')"
sql_content="insert into table saplifttime values(?,?,?,?,?,?,?,?,?,?)"
cur.execute(sql_content,args)#把運算結果插到表pcg.saplifttime裏

備註:
1.執行cur.execute的前提是已經建好pcg.saplifttime的表,建表語句如下:

CREATE TABLE SAPLifttime(querypocredatestart STRING,querypocredateend STRING,cnt STRING,mean STRING,std STRING,minimum STRING,25percent STRING,50percent STRING,75percent STRING,maxmum STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS Textfile

2.計算邏輯:
第一步:把字段“Delivery Note Creation Date”視爲貨物發出日期,如果爲空則刪除該行
第二步:把字段“Gr Date”視爲貨物到達日期,如果爲空則刪除該行
第三步:貨物在途時間= Gr Date - Delivery Note Creation Date
第四步:對貨物在途時間求cnt,mean,std,minimum,25%,50%,75%,maxmum

踩過的坑:
1.我的表字段都是STRING類型,values的佔位符問題,我一開始試過%s,%d,總與python裏對應的值格式不匹配。後來用?佔位就好了
2.cur.excute(sql.args)這樣寫的好處在於看起來清晰,不用拼接特別長的sql字符串了,非常容易拼錯

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章