Actor VS Thread VS Coroutine

原文:http://www.cnblogs.com/netfocus/p/3365166.html


先從著名的c10k問題談起。有一個叫Dan Kegel的人在網上(http://www.kegel.com/c10k.html)提出:現在的硬件應該能夠讓一臺機器支持10000個併發的client。然後他討論了用不同的方式實現大規模併發服務的技術,歸納起來就是兩種方式:一個client一個thread,用blocking I/O;多個clients一個thread,用nonblocking I/O或者asynchronous I/O。目前asynchronous I/O的支持在Linux上還不是很好,所以一般都是用nonblocking I/O。大多數的實現都是用epoll()的edge triggering(傳統的select()有很大的性能問題)。這就引出了thread和event之爭,因爲前者就是完全用線程來處理併發,後者是用事件驅動來處理併發。當然實際的系統當中往往是混合系統:用事件驅動來處理網絡時間,而用線程來處理事務。由於目前操作系統(尤其是Linux)和程序語言的限制(Java/C/C++等),線程無法實現大規模的併發事務。一般的機器,要保證性能的話,線程數量基本要限制幾百(Linux上的線程有個特點,就是達到一定數量以後,會導致系統性能指數下降,參看SEDA的論文)。所以現在很多高性能web server都是使用事件驅動機制,比如nginx,Tornado,node.js等等。事件驅動幾乎成了高併發的同義詞,一時間紅的不得了。


其實線程和事件,或者說同步和異步之爭早就在學術領域爭了幾十年了。1978年有人爲了平息爭論,寫了論文證明了用線性的process(線程的模式)和消息傳遞(事件的模式)是等價的,而且如果實現合適,兩者應該有同等性能。當然這是理論上的。針對事件驅動的流行,2003年加大伯克利發表了一篇論文叫“Why events are a bad idea (for high-concurrency servers)”,指出其實事件驅動並沒有在功能上有比線程有什麼優越之處,但編程要麻煩很多,而且特別容易出錯。線程的問題,無非是目前的實現的原因。一個是線程佔的資源太大,一創建就分配幾個MB的stack,一般的機器能支持的線程大受限制。針對這點,可以用自動擴展的stack,創建的先少分點,然後動態增加。第二個是線程的切換負擔太大,Linux中實際上process和thread是一回事,區別就在於是否共享地址空間。解決這個問題的辦法是用輕量級的線程實現,通過合作式的辦法來實現共享系統的線程。這樣一個是切換的花費很少,另外一個可以維護比較小的stack。他們用coroutine和nonblocking I/O(用的是poll()+thread pool)實現了一個原型系統,證明了性能並不比事件驅動差。


那是不是說明線程只要實現的好就行了呢。也不完全對。2006年還是加大伯克利,發表了一篇論文叫“The problem with threads”。線程也不行。原因是這樣的。目前的程序的模型基本上是基於順序執行。順序執行是確定性的,容易保證正確性。而人的思維方式也往往是單線程的。線程的模式是強行在單線程,順序執行的基礎上加入了併發和不確定性。這樣程序的正確性就很難保證。線程之間的同步是通過共享內存來實現的,你很難來對併發線程和共享內存來建立數學模型,其中有很大的不確定性,而不確定性是編程的巨大敵人。作者以他們的一個項目中的經驗來說明,保證多線程的程序的正確性,幾乎是不可能的事情。首先,很多很簡單的模式,在多線程的情況下,要保證正確性,需要注意很多非常微妙的細節,否則就會導致deadlock或者race condition。其次,由於人的思維的限制,即使你採取各種消除不確定的辦法,比如monitor,transactional memory,還有promise/future,等等機制,還是很難保證面面俱到。以作者的項目爲例,他們有計算機科學的專家,有最聰明的研究生,採用了整套軟件工程的流程:design review, code review, regression tests, automated code coverage metrics,認爲已經消除了大多數問題,不過還是在系統運行4年以後,出現了一個deadlock。作者說,很多多線程的程序實際上存在併發錯誤,只不過由於硬件的並行度不夠,往往不顯示出來。隨着硬件的並行度越來越高,很多原來運行完好的程序,很可能會發生問題。我自己的體會也是,程序NPE,core dump都不怕,最怕的就是race condition和deadlock,因爲這些都是不確定的(non-deterministic),往往很難重現。


那既然線程+共享內存不行,什麼樣的模型可以幫我們解決併發計算的問題呢。研究領域已經發展了一些模型,目前越來越多地開始被新的程序語言採用。最主要的一個就是Actor模型。它的主要思想就是用一些併發的實體,稱爲actor,他們之間的通過發送消息來同步。所謂“Don’t communicate by sharing memory, share memory by communicating”。Actor模型和線程的共享內存機制是等價的。實際上,Actor模型一般通過底層的thread/lock/buffer 等機制來實現,是高層的機制。Actor模型是數學上的模型,有理論的支持。另一個類似的數學模型是CSP(communicating sequential process)。早期的實現這些理論的語言最著名的就是erlang和occam。尤其是erlang,所謂的Ericsson Language,目的就是實現大規模的併發程序,用於電信系統。Erlang後來成爲比較流行的語言。


類似Actor/CSP的消息傳遞機制。Go語言中也提供了這樣的功能。Go的併發實體叫做goroutine,類似coroutine,但不需要自己調度。Runtime自己就會把goroutine調度到系統的線程上去運行,多個goroutine共享一個線程。如果有一個要阻塞,系統就會自動把其他的goroutine調度到其他的線程上去。




一些名詞定義:Processes, threads, green threads, protothreads, fibers, coroutines: what's the difference?
Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
Coroutines: Exactly fibers, except not OS-managed.Coroutines are computer program components that generalize subroutines to allow multiple entry points for suspending and resuming execution at certain locations. Coroutines are well-suited for implementing more familiar program components such as cooperative tasks, iterators, infinite lists and pipes.Continuation: An abstract representation of the control state of a computer program.A continuation reifies the program control state, i.e. the continuationis a data structure that represents the computational process at a given point in the process' execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment. Continuations are useful for encoding other control mechanisms in programming languages such as exceptions, generators, coroutines, and so on.
The "current continuation" or "continuation of the computation step" is the continuation that, from the perspective of running code, would be derived from the current point in a program's execution. The term continuations can also be used to refer to first-class continuations, which are constructs that give a programming language the ability to save the execution state at any pointand return to that point at a later point in the program.(yield keywork in some languages, such as c# or python)
Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章