並行處理海量數據實驗

原創

2018-08-27 19:42

以前看到過一個題目，說設計算法，要求在10億個數的數據流中找出最小的10個數
我想到有兩種算法，第一種利用大根堆找出最小的10個數，第二種方法是第一種的改進，利用多線程實現並行計算，將計算任務分爲若干個，之後再將結果進行合併。

實驗一

思路：需要一個大根堆存儲最終的結果，利用Random不斷地產生隨機數。大根堆初始化爲容量爲10、所有元素都爲Integer.MAX_VALUE。之後，沒產生一個隨機數，都與大根堆的堆頂做比較，如果小於堆頂元素值，則彈出堆頂元素，講這個隨機數插入大根堆，（自動建堆後）繼續。

Java代碼

public class Test1 {
	static Random random = new Random();
	public PriorityQueue<Integer> queue;   //大根堆

	public Test1() {
		//用PriorityQueue實現大根堆
		queue = new PriorityQueue<Integer>(10, new Comparator<Integer>() {
			@Override
			public int compare(Integer num1, Integer num2) {
				return -(num1 - num2);
			}
		});

		//初始化大根堆
		for (int i = 10; i > 0; i--)
			queue.add(Integer.MAX_VALUE);
	}

	public static void main(String[] args) {
		Test1 solution = new Test1();
		long st = System.currentTimeMillis();
		for (int i = 0; i < 1000000000; i++) {
			int num = random.nextInt(Integer.MAX_VALUE);
			
			//隨機數與堆頂元素進行比較
			if (solution.queue.peek() > num) {
				solution.queue.poll();
				solution.queue.add(num);
			}
		}

		long et = System.currentTimeMillis();
		
		System.out.println(solution.queue.toString());
		System.out.println(et - st + "ms");
	}
}

實驗結果：

實驗二

思路：因爲算法屬於CPU計算密集型，根據計算機CPU數目來確定需要開啓的線程數。在每個線程利用Callable來實現（因爲可以進行結果的返回），每個線程需要完成各自均分的計算量（返回10個最小的數）。在線程計算任務完成後，需要對計算結果進行返回。因爲返回的結果具有依賴性，需要等待所有線程的計算結果都返回以後才能進行下一步的計算，此時需要一個閉鎖，等待所有線程都到達計算完成以後才能進行下一步。下一步中，需要對每個線程返回的結果進行彙總，再找出最終的最小10個數。因爲線程的數量和CPU數目有關，每個線程最後返回其計算任務中最小的10個數字，所有此步驟中計算量不會太大，之間講所有線程的返回結果集合到一起在進行排序選出最小的10個數字即可。

線程類：

public class SortTask implements Callable<PriorityQueue<Integer>> {

	private int times = 0;
	private Random random = new Random();   //每個線程擁有獨立的隨機數生成器
	private PriorityQueue<Integer> heap = null;    //大根堆
	private CountDownLatch latch = null;   //多個線程受同一個閉鎖制約
	
	public SortTask() {
		
	}
	
	public SortTask(int times, CountDownLatch latch) {
		this.times = times;
		this.latch = latch;
		
		heap = new PriorityQueue<Integer>(10, new Comparator<Integer>() {
			@Override
			public int compare(Integer num1, Integer num2) {
				return -(num1 - num2);
			}
		});

		for (int i = 10; i > 0; i--)
			heap.add(Integer.MAX_VALUE);
	}
	
	@Override
	public PriorityQueue<Integer> call() throws Exception {
		for (int i = 0; i < times; i++) {
			int num = random.nextInt(Integer.MAX_VALUE);
			if (this.heap.peek() > num) {
				this.heap.poll();
				this.heap.add(num);
			}
		}
		//完成計算任務，閉鎖計數減1
		this.latch.countDown();
		
		return this.heap;
	}
	
}

計算類

public class BigDataTest {
	
	public BigDataTest() {
		
	}
	
	public List<Integer> doSortBigData(int cpuNumber, int times) 
			throws InterruptedException {
		if (cpuNumber < 1)
			return null;
		
		//所有的線程都受同一個閉鎖的限制，每完成一個線程的計算，閉鎖計數減1
		CountDownLatch latch = new CountDownLatch(cpuNumber);
		
		//線程任務集
		List<SortTask> tasks = new ArrayList<SortTask>();
		for (int i = 0; i < cpuNumber; i++) {
			tasks.add(new SortTask(times, latch));
		}
		
		//在線程交給線程池執行
		ExecutorService threadsPool = Executors.newCachedThreadPool();
		List<Future<PriorityQueue<Integer>>> results = threadsPool.invokeAll(tasks);
		
		latch.await();   //阻塞直到所有線程都執行完畢
		
		//把所有線程的計算返回結構聚集在ArrayList中
		List<Integer> sortResult = new ArrayList<Integer>();
		for (int i = 0; i < results.size(); i++) {
			try {
				sortResult.addAll(results.get(i).get());
			} catch (ExecutionException e) {
				e.printStackTrace();
			}
		}
		
		threadsPool.shutdown();
		Collections.sort(sortResult);   //排序
		
		return sortResult.subList(0, 10);
	}
}

主類

public class TestMain {

	private static final short CUP_NUMBER = 4;   //cpu數目
	private static final int NUMBER_COUNT = 1000000000;
	
	public static void main(String[] args) {
		List<Integer> result = null;
		
		long starTime = System.currentTimeMillis();
		try {
			//啓動CUP_NUMBER個線程，每個線程計算量爲NUMBER_COUNT / CUP_NUMBER
			result = new BigDataTest().doSortBigData(TestMain.CUP_NUMBER, 
					TestMain.NUMBER_COUNT / TestMain.CUP_NUMBER);
		} catch (InterruptedException e) {
			e.printStackTrace();
		}
		long endTime = System.currentTimeMillis();
		
		System.out.println(result.toString());
		System.out.println((endTime - starTime) + "ms");
	}

}

結果：

可以看到，利用多線程實現的並行算法，在效率上有了很大的提高，並且cpu數目越多，效率越高。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

並行處理海量數據實驗

二叉樹的右視圖

遞歸反轉字符串

MySQL索引背後的數據結構及算法原理

Longest Consecutive Sequence

Rotate List（鏈表旋轉）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結