遍歷百萬級Redis的鍵值的曲折經歷

背景

暖心同學突然跟我說想要獲取線上所有的Redis的key的大小信息，就是想知道redis中所有對應Key的大小信息（線上使用的redis存儲的信息基本統一而且沒有其他複雜的如set等數據結構），讓我幫忙來解決一下這個問題。聽到這個問題之後，我對着電腦深吸一口氣，表面表現出嗯嗯這個問題有點複雜，我需要整理一下思緒。

內心早就想着這個問題好像不難吧，抄起電腦吭哧吭哧幾行代碼不就搞定了嘛，哈哈哈。

Redis心花路放第一版

import redis
import time


rip = "192.168.10.205"
rport = "6379"


def wrapper_time(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        r = func(*args, **kwargs)
        end = time.time()
        print("finish use time {0} second".format(end-start))
        return r
    return wrapper


@wrapper_time
def t_redis():
    pool = redis.ConnectionPool(host=rip, port=rport, socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    total_count = 0
    for key in r.scan_iter(count=5000):
        total_count += 1
        keyMemUsage = int(r.memory_usage(key))
        print(key, keyMemUsage)

    print("total  count  {0}".format(total_count))


if __name__ == '__main__':
    t_redis()

快樂就是這麼簡單，python點讀機哪裏不會點哪裏，心中竊喜，有時候問題就是這麼簡單。本地運行一下一切正常，輸出如下；

('mylist', 421132)
('key_9fe99777-97df-4bb8-b2e9-c1a7f525d5d7', 122)
('a', 48)
('_kombu.binding.celeryev', 907)
('_kombu.binding.celery.pidbox', 276)
('myset:__rand_int__', 100)
('key_e76a94ac-cd9f-438c-9459-ca4711260451', 122)
('key_66afcfba-fccd-4bf4-8dc2-fb0d99ba532d', 122)
('_kombu.binding.celery', 235)
('key:__rand_int__', 65)
('counter:__rand_int__', 64)
('key_31a63d80-a029-400f-8634-4c64fa679b2d', 122)
('key_c3cea6f9-2e29-4bce-a31b-d58c36fb597d', 122)
total  count  13
finish use time 0.0105438232422 second

我拿着這段代碼，給到同事時，同事說這個key的數量不是13個，所有的實例加在一起總共的key的數量大概有十幾億個。what???

再激動我也要保持自己自信的微笑，拿回來繼續優化一波。

在本地測試一下，在上述代碼中，如果不執行r.memory_usage這個函數來看一下只是遍歷一下所有的key的時間消耗。此時測試的redis的key的數量已經增加到大概五百二萬左右。

import redis
import time


rip = "192.168.10.205"
rport = "6379"


def wrapper_time(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        r = func(*args, **kwargs)
        end = time.time()
        print("finish use time {0} second".format(end-start))
        return r
    return wrapper


@wrapper_time
def t_redis():
    pool = redis.ConnectionPool(host=rip, port=rport, socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    total_count = 0
    for key in r.scan_iter(count=5000):
        total_count += 1
        # keyMemUsage = int(r.memory_usage(key))
        # print(key, keyMemUsage)

    print("total  count  {0}".format(total_count))


if __name__ == '__main__':
    t_redis()

終端輸出如下；

total  count  5271392
finish use time 82.052093029 second

總共花了82秒就完成了所有的Key的遍歷，總共大約有520萬個key被遍歷。

那我們還是依照第一版的redis代碼來執行遍歷這520萬個key，看遍歷完成需要多久。

剛剛開始的時候

大概閱讀完兩會概要，看完了最新的技術文章之後。

在聽了無數遍你笑起來真好看之後；

都不知道過了多久之後，我關掉了這個一直在輸出的腳本。問題在哪裏，問題出在哪裏。

簡單的幾行代碼，裏面就出現了scan_iter和r.memory_usage這兩個比較可能出現性能問題的地方。

# scan_iter

    def scan_iter(self, match=None, count=None):
        """
        Make an iterator using the SCAN command so that the client doesn't
        need to remember the cursor position.

        ``match`` allows for filtering the keys by pattern

        ``count`` allows for hint the minimum number of returns
        """
        cursor = '0'
        while cursor != 0:
            cursor, data = self.scan(cursor=cursor, match=match, count=count)
            for item in data:
                yield item
                
  	def scan(self, cursor=0, match=None, count=None):
        """
        Incrementally return lists of key names. Also return a cursor
        indicating the scan position.

        ``match`` allows for filtering the keys by pattern

        ``count`` allows for hint the minimum number of returns
        """
        pieces = [cursor]
        if match is not None:
            pieces.extend([b'MATCH', match])
        if count is not None:
            pieces.extend([b'COUNT', count])
        return self.execute_command('SCAN', *pieces)

看了這兩行代碼之後再查看了一下scan這個命令基本上排除了這個scan命令會嚴重影響性能。於是繼續查看memory_usage

    def memory_usage(self, key, samples=None):
        """
        Return the total memory usage for key, its value and associated
        administrative overheads.

        For nested data structures, ``samples`` is the number of elements to
        sample. If left unspecified, the server's default is 5. Use 0 to sample
        all elements.
        """
        args = []
        if isinstance(samples, int):
            args.extend([b'SAMPLES', samples])
        return self.execute_command('MEMORY USAGE', key, *args)

看了代碼之後，每一次的獲取key大小都會調用去發送一條命令到redis服務器然後等待接收返回的結果，一切盡在這裏面。

看到這裏的第一個解決方法便是通過pipeline來解決這個問題。於是繼續修改。

Redis幡然醒悟第二版

import redis
import time


rip = "192.168.10.205"
rport = "6379"


def wrapper_time(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        r = func(*args, **kwargs)
        end = time.time()
        print("finish use time {0} second".format(end-start))
        return r
    return wrapper


@wrapper_time
def t_redis(pipe_size=1000):
    pool = redis.ConnectionPool(host=rip, port=rport, socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    pipe = r.pipeline()
    pipe_count = 0
    total_count = 0
    keys = []
    for key in r.scan_iter(count=5000):
        pipe_count += 1
        total_count += 1
        if pipe_count < pipe_size:
            pipe.memory_usage(key)
            keys.append(key)
            continue
        else:
          	pipe.memory_usage(key)
            result = pipe.execute()
            pipe_count = 0
        for i, v in enumerate(result):
            keyMemUsage = int(v)
            # print(keys[i], keyMemUsage)
        keys = []
    if keys:
      	result = pipe.execute()
        pipe_count = 0
        for i, v in enumerate(result):
            keyMemUsage = int(v)

    print("total  count  {0}".format(total_count))


if __name__ == '__main__':
    t_redis()

等了良久之後輸出如下；

total  count  5271392
finish use time 254.994492054 second

遍歷完成五百多萬的key需要大約254秒，相對大約82秒就能遍歷完成所有key，感覺我是不是還能在搶救一下呢，主要就是scan_iter快，memory_usage慢一些。

import redis
import time
import Queue
import threading


rip = "192.168.10.205"
rport = "6379"


def wrapper_time(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        r = func(*args, **kwargs)
        end = time.time()
        print("finish use time {0} second".format(end-start))
        return r
    return wrapper


@wrapper_time
def t_redis(pipe_size=1000):
    pool = redis.ConnectionPool(host=rip, port=rport, socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    pipe_count = 0
    total_count = 0
    keys = []
    start_time = time.time()
    workQueue = Queue.Queue()

    def work_execute():
        work_count = 0
        while True:
            try:
                keys = workQueue.get(timeout=3)
            except Exception as e:
                print("get exeption  {0}".format(e))
                continue
            if keys is None or not keys:
                end_time = time.time()
                print("exist {0}  count {1}".format(end_time - start_time, work_count))
                return

            pipe = r.pipeline()
            for k in keys:
                pipe.memory_usage(k)
            try:
                result = pipe.execute()
            except Exception as e:
                print("get execute error {0}".format(e))
                return
            for key_usage in result:
                work_count += 1
                keyMemUsage = int(key_usage)

    t = threading.Thread(target=work_execute)
    t.start()

    for key in r.scan_iter(count=5000):
        pipe_count += 1
        total_count += 1
        if pipe_count < pipe_size:
            keys.append(key)
            continue
        else:
          	keys.append(key)
            workQueue.put([k for k in keys])
            pipe_count = 0
            keys = []
    if keys:
        workQueue.put([k for k in keys])
    workQueue.put("")
    print("total  count  {0}".format(total_count))


if __name__ == '__main__':
    t_redis()

靜靜等待輸出結果；

total  count  5271392
finish use time 90.5585868359 second
exist 243.682749033  count 5271392

從scan_iter來看，該函數還是保持一貫的水準，差不多80秒就迭代完成，然後就通過一個線程來執行memory_usage函數，從執行結果來看一個線程的消費只比沒修改之前快了大概十幾秒，場面有點尷尬。

繼續改改看；

import redis
import time
import Queue
import threading


rip = "192.168.10.205"
rport = "6379"


def wrapper_time(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        r = func(*args, **kwargs)
        end = time.time()
        print("finish use time {0} second".format(end-start))
        return r
    return wrapper


@wrapper_time
def t_redis(pipe_size=1000):
    pool = redis.ConnectionPool(host=rip, port=rport, socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    pipe_count = 0
    total_count = 0
    keys = []
    start_time = time.time()
    workQueue = Queue.Queue()
    numThreads = 5
    threads = []

    def work_execute():
        work_count = 0
        while True:
            try:
                keys = workQueue.get(timeout=3)
            except Exception as e:
                print("get exeption  {0}".format(e))
                continue
            if keys is None or not keys:
                end_time = time.time()
                print("exist {0}  count {1}".format(end_time - start_time, work_count))
                return

            pipe = r.pipeline()
            for k in keys:
                pipe.memory_usage(k)
            try:
                result = pipe.execute()
            except Exception as e:
                print("get execute error {0}".format(e))
                return
            for key_usage in result:
                work_count += 1
                keyMemUsage = int(key_usage)

    for i in range(numThreads):
        t = threading.Thread(target=work_execute)
        t.start()
        threads.append(t)

    for key in r.scan_iter(count=5000):
        pipe_count += 1
        total_count += 1
        if pipe_count < pipe_size:
            keys.append(key)
            continue
        else:
            keys.append(key)
            workQueue.put([k for k in keys])
            pipe_count = 0
            keys = []

    if keys:
        workQueue.put([k for k in keys])

    RUNNING = True
    while RUNNING:
        RUNNING = False
        for t in threads:
            if t.isAlive():
                workQueue.put("")
                RUNNING = True
        time.sleep(0.5)

    print("total  count  {0}".format(total_count))


if __name__ == '__main__':
    t_redis()

輸出結果如下；

exist 218.543583155  count 1058000
exist 218.544660091  count 1049000
exist 218.566720009  count 1054392
exist 218.714774132  count 1056000
exist 218.821619987  count 1054000
total  count  5271392
finish use time 218.969571114 second

看見結果大約218秒，提高了大概二十多秒，相對而言有點提高，心裏美滋滋；

回顧這一段路程，都是順着常規的性能解決方案來一步步實現，但是在本例中最主要的場景終歸是網絡IO的問題，想到這裏我也陷入了沉思。

Redis悠揚小道第三版

使用Python帶有的相關的協程相關來嘗試解決一下這個問題，推薦一下aioredis這個異步的redis庫，基於Python3版本的異步庫。繼續征程。

import asyncio
import aioredis
import time


def t_redis():
    async def go():
        redis = await aioredis.create_redis_pool(
            'redis://192.168.10.205')
        cursor = '0'
        work_count = 0

        async def scan_iter(count=5000):
            nonlocal work_count
            cursor = "0"
            while cursor != 0:
                cursor, data = await redis.scan(cursor=cursor, count=count)
                if len(data):
                    work_count += len(data)
        await scan_iter()
        print("total count key {0}".format(work_count))
        redis.close()
        await redis.wait_closed()


    start = time.time()
    asyncio.run(go())
    end = time.time()
    print("finish use time {0} second".format(end-start))


if __name__ == '__main__':
    t_redis()

輸出結果如下；

total count key 5271392
finish use time 28.971981048583984 second

通過與第一個版本遍歷所有的key的時間大約80秒左右相比，提升了一大步，那我們繼續使用memory_usage加入其中看是否有性能的提升。

然鵝在一臉懵圈的文檔中沒有看見支持memory usage這個命令，

不能認輸，都到這裏了不能認輸，在對比了redis庫中解析memory usage的相關代碼之後，在研究一下aioredis的execute的命令解釋到byte的過程，修改一下代碼如下；

import asyncio
import aioredis
import time


def t_redis():
    async def go():
        redis = await aioredis.create_redis_pool(
            'redis://192.168.10.205')
        cursor = '0'
        work_count = 0

        async def scan_iter(count=5000):
            nonlocal work_count
            cursor = "0"
            while cursor != 0:
                cursor, data = await redis.scan(cursor=cursor, count=count)
                if len(data):
                    work_count += len(data)
                    for k in data:
                        r = await redis.execute(b"MEMORY", *["USAGE", k])
                        # print(k, r)

        await scan_iter()
        print("total count key {0}".format(work_count))
        redis.close()
        await redis.wait_closed()

    start = time.time()
    asyncio.run(go())
    end = time.time()
    print("finish use time {0} second".format(end-start))


if __name__ == '__main__':
    t_redis()

運行一下，我發現我沒有等到它停止的時候；

應該是execute創建的事件太多，導致效率不高，接下來如果要改進的話，可以考慮通過pipeline來改進或者啓動子進程來進行單獨的一個事件循環來進行處理，由於aioredis本身沒有支持memory usage 並且在pipeline的過程中初步瞭解還是用的協程包裝，故沒有往這個方向嘗試轉而使用了多線程方式；

import asyncio
import aioredis
import time
import queue
import threading


work_queue = queue.Queue()
start_time = time.time()


def worker():
    async def go():
        redis = await aioredis.create_redis('redis://192.168.10.205')
        while True:
            try:
                keys = work_queue.get(timeout=3)
            except Exception as e:
                print("get exeption  {0}".format(e))
                continue
            if keys is None or not keys:
                end_time = time.time()
                print("exist {0}  count {1}".format(end_time - start_time))
                return

            r = await redis.execute(b"MEMORY", *["USAGE", keys])
            print(r)
    asyncio.run(go())


t = threading.Thread(target=worker)
t.start()


def t_redis():
    async def go():
        redis = await aioredis.create_redis('redis://192.168.10.205')
        cursor = '0'
        work_count = 0

        async def scan_iter(count=5000):
            nonlocal work_count
            cursor = "0"
            pipe_count = 0
            while cursor != 0:
                cursor, data = await redis.scan(cursor=cursor, count=count)
                if len(data):
                    work_count += len(data)
                    for k in data:
                        work_queue.put(k)
                    pipe_count += 1
                    print(work_count)

        await scan_iter()
        print("total count key {0}".format(work_count))
        work_queue.put("")
        redis.close()
        await redis.wait_closed()

    start = time.time()
    asyncio.run(go())
    end = time.time()
    print("finish use time {0} second".format(end-start))


if __name__ == '__main__':
    t_redis()

好吧，我沒有認識到自己的錯誤，這段代碼也註定是漫長的等待，現在問題要麼重新詳細查看一下aioredis的memory usage相關的pipeline的封裝，就算多個線程多個loop也無法搞定這麼多的單個memory的查詢；

難道我和Redis這段相戀相殺的糾葛到此結束了嗎！！！

Redis奮力一搏第四版

通過前三部曲的探索，我放下了我的倔強，一搏操作完成之後，在現階段有限的時間裏，我好像只能看出如下的最後的招數來化解這個尷尬的問題。

import os
import time
from multiprocessing import Pool, Manager
import asyncio

import aioredis
import redis


start_time = time.time()


def consumer(queue):
    pipe_count = 0
    work_count = 0
    pipe_size = 1000
    pool = redis.ConnectionPool(host="192.168.10.205", port="6379", socket_connect_timeout=3)
    r = redis.Redis(connection_pool=pool)
    pipe = r.pipeline()
    print('Run task (%s)...' % (os.getpid()))
    while True:
        try:
            keys = queue.get(timeout=3)
        except Exception as e:
            print(" queue get error  {0}".format(e))
            time.sleep(3)
            if queue.qsize() == 0:
                end_time = time.time()
                print("exist {0}  count {1}  ".format(end_time - start_time, work_count))
                return
            continue

        if keys is None or not keys:
            end_time = time.time()
            print("exist {0}  count {1}".format(end_time - start_time, work_count))
            return

        store_keys = []
        for key in keys:
            pipe_count += 1
            key_decode = key.decode("utf-8")
            store_keys.append(key_decode)
            pipe.memory_usage(key_decode)
            if pipe_count < pipe_size:
                if key != keys[-1]:
                    continue
            try:
                result = pipe.execute()
            except Exception as e:
                print("get execute error {0}".format(e))
                return

            pipe_count = 0
            for i, key_usage in enumerate(result):
                work_count += 1
                keyMemUsage = int(key_usage)
                # print(store_keys[i], keyMemUsage)
            store_keys = []


def producer(queue):
    async def go():
        redis = await aioredis.create_redis('redis://192.168.10.205')
        work_count = 0

        async def scan_iter(count=5000):
            nonlocal work_count
            cursor = "0"
            while cursor != 0:
                cursor, data = await redis.scan(cursor=cursor, count=count)
                if len(data):
                    work_count += len(data)
                    queue.put(data)
                    # print(work_count)

        await scan_iter()
        print("total count key {0}".format(work_count))
        redis.close()
        await redis.wait_closed()

    start = time.time()
    asyncio.run(go())
    end = time.time()
    print("finish use time {0} second".format(end-start))
    while queue.qsize():
        print(queue.qsize())
        time.sleep(1)
    print("produce end")


if __name__=='__main__':
    queue = Manager().Queue()
    print('Parent process %s.' % os.getpid())
    p = Pool(6)
    num_worker = 5
    for i in range(num_worker):
        p.apply_async(consumer, args=(queue,))
    p.apply_async(producer, args=(queue, ))
    print('Waiting for all subprocesses done...')
    p.close()
    p.join()
    print('All subprocesses done.')

利用一個進程池來完成這個任務，選用aioredis來獲取所有的key，然後通過pipeline的進程池來獲取key的大小。

Parent process 53200.
Waiting for all subprocesses done...
Run task (53204)...
Run task (53203)...
Run task (53202)...
Run task (53205)...
Run task (53206)...
total count key 5271392
finish use time 84.6968948841095 second
150
138
120
101
86
68
53
37
20
4
produce end
 queue get error  
 queue get error  
 queue get error  
 queue get error  
 queue get error  
exist 99.95775818824768  count 1060055  
exist 100.00523614883423  count 961136  
exist 100.05303382873535  count 1075066  
exist 100.10038113594055  count 1065064  
exist 100.10428404808044  count 1110071  
All subprocesses done.

現在運行完成大概需要100秒左右的時間，因爲真正執行腳本的機器的核心數大於6的，內存消耗也夠用，結束吧，就用當前這個折中的辦法吧。

總結

本文只是簡單記錄了一下這個過程而已，其實這當中還有好多事情可以去深入探索，因爲時間緊迫的原因就沒有再深入探索aioredis對應的memory usage如何使用pipeline來進行執行，不過aio來遍歷所有的key的效率確實要好一些，因爲都是侷限於python相關的技術棧，也可以考慮一下通過golang來實現一下看golang實現的效率表現如何。由於本人才疏學淺，如有錯誤請批評指正。

遍歷百萬級Redis的鍵值的曲折經歷

背景

Redis心花路放第一版

Redis幡然醒悟第二版

Redis悠揚小道第三版

Redis奮力一搏第四版

總結

Redis的rdb格式學習

遍歷百萬級Redis的鍵值的大結局

租約-代碼實踐

golang源碼分析：調度器chan調度

兩階段提交實際項目V1

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結