tick數據研究

      經常聽見tick數據,回測的時候也用過,但是還真的沒有自己去處理過tick數據,據說tick數據有很多坑,所以打算自己研究一下。首先的第一步就是先拿正常的tick數據來生成bar,從而能夠理解一些細節,然後就是自己用ctp去接收tick數據,看看ctp有沒有坑。

      這裏,完美的tick數據是wind上的。

      這是wind上面導出來的,看起來還是比較正常的,反正一秒兩個數據嘛。畢竟我們知道,咱們交易所給我們的數據不是真正的tick,而是snapshot,說白了就是500毫秒一次切片。一切的行情軟件,其實都是根據tick數據來實現的。

      tick數據當然還有別的東西,比如ask、bid但是,最重要的還是last_price和volume。last price當然可以理解,切片時候的成交價格嘛,至於volume,我們來看一下曲線:

       所以,tick數據的volume是累計成交量,而一天的開始是九點的夜盤開始。當然沒有夜盤的品種當然就是第二天早上九點了。

      那麼怎麼變成分鐘數據呢?也就是tick變成bar。

#encoding=utf-8
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.finance as mpf
from matplotlib.pylab import date2num
tick_df = pd.read_hdf('rb_tick.h5')


class mBar(object):
    def __init__(self):
        """Constructor"""
        self.open = None
        self.close = None
        self.high = None
        self.low = None
        self.datetime = None

bar = None
m_bar_list = list()

for datetime, last in tick_df[['last']].iterrows():
    new_minute_flag = False

    if not bar:  # 第一次進循環
        bar = mBar()
        new_minute_flag = True
    elif bar.datetime.minute != datetime.minute:
        bar.datetime = bar.datetime.replace(second=0, microsecond=0)  # 將秒和微秒設爲0
        m_bar_list.append(bar)
        # 開啓新的一個分鐘bar線
        bar = mBar()
        new_minute_flag = True


    if new_minute_flag:
        bar.open, bar.high, bar.low = last['last'], last['last'], last['last']
    else:
        bar.high, bar.low = max(bar.high, last['last']), min(bar.low, last['last'])

    bar.close = last['last']
    bar.datetime = datetime

pk_df = pd.DataFrame(data=[[bar.datetime for bar in m_bar_list], 
                           [bar.close for bar in m_bar_list], 
                           [bar.open for bar in m_bar_list],
                           [bar.high for bar in m_bar_list],
                           [bar.low for bar in m_bar_list]],
             index=['datetime', 'close', 'open','high', 'low']
                     ).T[['datetime', 'open', 'high', 'low', 'close']]

pk_df['datetime'] = pk_df['datetime'].apply(lambda x: date2num(x)*1440) # 爲了顯示分鐘而不疊起來
fig, ax = plt.subplots(facecolor=(0, 0.3, 0.5),figsize=(12,8))

mpf.candlestick_ohlc(ax,pk_df.iloc[:100].as_matrix(),width=0.7,colorup='r',colordown='green') # 上漲爲紅色K線,下跌爲綠色,K線寬度爲0.7
plt.grid(True)

我們看一下我們生成的bar和wind給出的bar。

來自wind的:

      筆者覈對過,完全一樣。但是這裏要注意一個問題,就是分鐘Bar的時間戳。我們上面的程序中,分鐘bar的時間戳是bar的時間開始,也就是說,14:31分鐘的bar線是14點31分0秒開始到14點31分59秒。而有的軟件是用bar的時間的結束作爲時間戳的。

      理論上,解決這個問題之後,我們就可以把精力聚焦於如何獲得質量較高的tick數據。實際過程中,我們的tick數據都是實時的,所以,tick數據的質量往往由兩個因素決定,一個是我們處理tick的回調數據的速度,如果響應和處理都很慢的話,顯然就會有很大的問題;另外一個影響實時的tick數據的因素就是ctp前置的實時負載,如果服務器壓力大的話,很容易就會丟失數據。

      然後筆者花了幾天,利用vnpy封裝好的ctp接口接收了幾天的數據。大致看起來,可能是網絡比較好,又地處上海金融中心陸家嘴,所以沒有丟包的情況,檢查了一下,一直是一秒兩個切片。

2019-03-15 14:59:49.500000,3763.0,3709870,3764.0,0.0,0.0,0.0,0.0,40,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,371,0,0,0,0
2019-03-15 14:59:50.500000,3764.0,3710314,3764.0,0.0,0.0,0.0,0.0,58,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,284,0,0,0,0
2019-03-15 14:59:51.500000,3763.0,3711114,3765.0,0.0,0.0,0.0,0.0,738,0,0,0,0,3764.0,0.0,0.0,0.0,0.0,3,0,0,0,0
2019-03-15 14:59:52,3764.0,3711880,3763.0,0.0,0.0,0.0,0.0,1,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,50,0,0,0,0
2019-03-15 14:59:52.500000,3764.0,3712004,3764.0,0.0,0.0,0.0,0.0,12,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,2,0,0,0,0
2019-03-15 14:59:53.500000,3762.0,3712846,3763.0,0.0,0.0,0.0,0.0,20,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,6,0,0,0,0
2019-03-15 14:59:54.500000,3764.0,3713058,3765.0,0.0,0.0,0.0,0.0,739,0,0,0,0,3764.0,0.0,0.0,0.0,0.0,55,0,0,0,0
2019-03-15 14:59:55.500000,3761.0,3713670,3762.0,0.0,0.0,0.0,0.0,2,0,0,0,0,3761.0,0.0,0.0,0.0,0.0,356,0,0,0,0
2019-03-15 14:59:56.500000,3763.0,3713982,3763.0,0.0,0.0,0.0,0.0,55,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,94,0,0,0,0
2019-03-15 14:59:57.500000,3761.0,3714472,3762.0,0.0,0.0,0.0,0.0,4,0,0,0,0,3761.0,0.0,0.0,0.0,0.0,327,0,0,0,0
2019-03-15 14:59:58.500000,3763.0,3714668,3763.0,0.0,0.0,0.0,0.0,6,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,5,0,0,0,0
2019-03-15 14:59:59,3762.0,3714990,3763.0,0.0,0.0,0.0,0.0,1,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,1,0,0,0,0

但是,中間也發生了比較奇怪的現象,比如:

2019-03-15 14:59:59,3762.0,3714990,3763.0,0.0,0.0,0.0,0.0,1,0,0,0,0,3762.0,0.0,0.0,0.0,0.0,1,0,0,0,0
2019-03-15 15:00:00.500000,3763.0,3715090,3764.0,0.0,0.0,0.0,0.0,140,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,12,0,0,0,0
2019-03-15 15:00:01.500000,3763.0,3715090,3764.0,0.0,0.0,0.0,0.0,140,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,12,0,0,0,0
2019-03-15 15:18:12,3763.0,3715090,3764.0,0.0,0.0,0.0,0.0,140,0,0,0,0,3763.0,0.0,0.0,0.0,0.0,12,0,0,0,0
2019-03-15 18:11:30.500000,3763.0,0,1.7976931348623157e+308,0.0,0.0,0.0,0.0,0,0,0,0,0,1.7976931348623157e+308,0.0,0.0,0.0,0.0,0,0,0,0,0
2019-03-15 19:48:09,3763.0,0,1.7976931348623157e+308,0.0,0.0,0.0,0.0,0,0,0,0,0,1.7976931348623157e+308,0.0,0.0,0.0,0.0,0,0,0,0,0
2019-03-15 20:59:01,3758.0,4626,3759.0,0.0,0.0,0.0,0.0,129,0,0,0,0,3758.0,0.0,0.0,0.0,0.0,32,0,0,0,0
2019-03-15 21:00:02,3760.0,8248,3760.0,0.0,0.0,0.0,0.0,255,0,0,0,0,3759.0,0.0,0.0,0.0,0.0,601,0,0,0,0

注意到,15點18分、16點11分等等,在非交易時間也出現了tick數據,而且,有一個8點59分01秒。

這些數據很奇怪,在實盤過程中都是要被剔除的。通常,我們可以設置ctp接受的開始和結束的時間,但是像8點59分這樣的記錄,其實很難去分離,所以大概還要疊加首個tick是否符合時間要求吧。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章