Client-go中的watch接口的resultChan會自動關閉

Client-go中的watch接口的resultChan會自動關閉

問題描述

在client-go工程中,有時候需要用到watch接口,實際場景如下:

namespacesWatch, err := clientSet.CoreV1().Namespaces().Watch(metav1.ListOptions{})
if err != nil {
    klog.Errorf("create watch error, error is %s, program exit!", err.Error())
    panic(err)
}
for {
    select e, ok := <-namespacesWatch.ResultChan()
    if e.Object == nil {
        // 這個時候一般是chan已經被關閉了,可以順便開看下ok是不是flase。
    } else {
        // 正常處理邏輯
    }
}

watch的resultChan會週期性的關閉,我不知道這個週期是不是可以設置,但是我在github的issue中看到,有人5分鐘就自動關閉了,我的集羣是大概40分鐘左右,進一步的追究需要深入,我這邊就說我遇到的問題以及解決方法吧。問題就是resultChan會定期自動關閉。github上client-go項目對於該問題的Issues: https://github.com/kubernetes/client-go/issues/623, 上面有大佬回覆:No, the server will close watch connections regularly. Re-establishing a watch at the last-received resourceVersion is a normal part of maintaining a watch as a client. There are helpers to do this for you in https://github.com/kubernetes/client-go/tree/master/tools/watch。是有官方解決方法的去client-go的tools目錄下的watch文件夾下看下。

resultChan會自動關閉的原因

看一下watch.go中的部分代碼:

// Interface can be implemented by anything that knows how to watch and report changes.
type Interface interface {
	// Stops watching. Will close the channel returned by ResultChan(). Releases
	// any resources used by the watch.
	Stop()

	// Returns a chan which will receive all the events. If an error occurs
	// or Stop() is called, this channel will be closed, in which case the
	// watch should be completely cleaned up.  !!!明確說了在出現錯誤或者被調用Stop時,通道會自動關閉的
	ResultChan() <-chan Event
}

我們接着看下有哪些error和哪些情況會調用stop,我就開始追watch方法的那個對象,最後追到這裏streamwatcher.go,Ok,這裏就可以看到watch實際對象了,struct的中參數可以看下,有result的通道,互斥鎖和Stoppe標誌接收是否已經結束了的標誌位。看下NewStreamWatcher方法,關注其中的go sw.receive(),在返回對象前就起了協程在接收數據了,那接着去看receive函數,看一下receive那邊的註釋,我寫的中文註釋,就是在解碼遇到錯誤是就會選擇return,return前看下defer sw.Stop()等清理操作。這樣就跟上面對上了!

// StreamWatcher turns any stream for which you can write a Decoder interface
// into a watch.Interface.
type StreamWatcher struct {
	sync.Mutex
	source   Decoder
	reporter Reporter
	result   chan Event
	stopped  bool
}

// NewStreamWatcher creates a StreamWatcher from the given decoder.
func NewStreamWatcher(d Decoder, r Reporter) *StreamWatcher {
	sw := &StreamWatcher{
		source:   d,
		reporter: r,
		// It's easy for a consumer to add buffering via an extra
		// goroutine/channel, but impossible for them to remove it,
		// so nonbuffered is better.
		result: make(chan Event),
	}
	go sw.receive()// !!!看這裏,新建完對象就開始接收數據了
	return sw
}

// ResultChan implements Interface.
func (sw *StreamWatcher) ResultChan() <-chan Event {
	return sw.result
}

// Stop implements Interface.
func (sw *StreamWatcher) Stop() {
	// Call Close() exactly once by locking and setting a flag.
	sw.Lock()
	defer sw.Unlock()
	if !sw.stopped {
		sw.stopped = true
		sw.source.Close()
	}
}

// stopping returns true if Stop() was called previously.
func (sw *StreamWatcher) stopping() bool {
	sw.Lock()
	defer sw.Unlock()
	return sw.stopped
}

// receive reads result from the decoder in a loop and sends down the result channel.
func (sw *StreamWatcher) receive() {
	defer close(sw.result)
	defer sw.Stop()// 注意看這裏,這個方法退出前就會調用stop函數
	defer utilruntime.HandleCrash()
	for {// for循環,一直接收
		action, obj, err := sw.source.Decode()
		if err != nil {//以下是接收到的錯,反正有錯誤就會return
			// Ignore expected error.
			if sw.stopping() {
				return
			}
			switch err {
			case io.EOF:
				// watch closed normally
			case io.ErrUnexpectedEOF:
				klog.V(1).Infof("Unexpected EOF during watch stream event decoding: %v", err)
			default:
				if net.IsProbableEOF(err) {
					klog.V(5).Infof("Unable to decode an event from the watch stream: %v", err)
				} else {
					sw.result <- Event{
						Type:   Error,
						Object: sw.reporter.AsObject(fmt.Errorf("unable to decode an event from the watch stream: %v", err)),
					}
				}
			}
			return //只要是錯誤就是停止接收了!!!
		}
		sw.result <- Event{
			Type:   action,
			Object: obj,
		}// 沒錯誤就往result裏塞數據
	}
}

在這部分源代碼中,我學到一個點就是在要從一個對象中持續的處理數據時,開通到,並且在新建完返回對象前就可以開始傳數據了,接收到只要使用對象中的chan就可以拿到數據,這樣就不用手動開啓reecive了,很省事,很安全,所有操作(除了Stop和ResultChan方法)都是內部做好了,不容許外部調用者做額外的操作干擾我正常的邏輯。

解決辦法

首先說下,這個問題是有解決方法的:

No, the server will close watch connections regularly. Re-establishing a watch at the last-received resourceVersion is a normal part of maintaining a watch as a client. There are helpers to do this for you in https://github.com/kubernetes/client-go/tree/master/tools/watch

這串是大佬的回覆。

先說下我自己挫比解決方案,代碼如下,在我知道有watch接口的resultChan會自動關閉後,第一時間想到的就是,關閉了,我重新起不就可以了嗎,所以我寫了這兩段for循環,檢測到通道close掉後跳出內層循環,再次創建就可以了。挫吧!當時還挺有用,作爲上線時候的代碼使用了。(我是真不專業!當時應該想到會有官方解決方案的)

	for {
		klog.Info("start watch")
		config, err := rest.InClusterConfig()
		clientSet, err := kubernetes.NewForConfig(config)
		namespacesWatch, err := clientSet.CoreV1().Namespaces().Watch(metav1.ListOptions{})
		if err != nil {
			klog.Errorf("create watch error, error is %s, program exit!", err.Error())
			panic(err)
		}
		loopier:
		for {
			select {
			case e, ok := <-namespacesWatch.ResultChan():
                if !ok {
					// 說明該通道已經被close掉了
					klog.Warning("!!!!!namespacesWatch chan has been close!!!!")
					klog.Info("clean chan over!")
					time.Sleep(time.Second * 5)
					break loopier
				}
				if e.Object != nil {
					// 業務邏輯
				}
			}
		}
	}

下面就是官方解決方案,在client工程下有個tools文件夾中有個watch文件夾,裏面有個retrywatcher.go:我是把這個文件下所有的源碼複製出來了,下面會細講,你先看看!

// resourceVersionGetter is an interface used to get resource version from events.
// We can't reuse an interface from meta otherwise it would be a cyclic dependency and we need just this one method
type resourceVersionGetter interface {
	GetResourceVersion() string
}

// RetryWatcher will make sure that in case the underlying watcher is closed (e.g. due to API timeout or etcd timeout)
// it will get restarted from the last point without the consumer even knowing about it.
// RetryWatcher does that by inspecting events and keeping track of resourceVersion.
// Especially useful when using watch.UntilWithoutRetry where premature termination is causing issues and flakes.
// Please note that this is not resilient to etcd cache not having the resource version anymore - you would need to
// use Informers for that.
type RetryWatcher struct {
	lastResourceVersion string
	watcherClient       cache.Watcher
	resultChan          chan watch.Event
	stopChan            chan struct{}
	doneChan            chan struct{}
	minRestartDelay     time.Duration
}

// NewRetryWatcher creates a new RetryWatcher.
// It will make sure that watches gets restarted in case of recoverable errors.
// The initialResourceVersion will be given to watch method when first called.
func NewRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher) (*RetryWatcher, error) {
	return newRetryWatcher(initialResourceVersion, watcherClient, 1*time.Second)
}

func newRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher, minRestartDelay time.Duration) (*RetryWatcher, error) {
	switch initialResourceVersion {
	case "", "0":
		// TODO: revisit this if we ever get WATCH v2 where it means start "now"
		//       without doing the synthetic list of objects at the beginning (see #74022)
		return nil, fmt.Errorf("initial RV %q is not supported due to issues with underlying WATCH", initialResourceVersion)
	default:
		break
	}

	rw := &RetryWatcher{
		lastResourceVersion: initialResourceVersion,
		watcherClient:       watcherClient,
		stopChan:            make(chan struct{}),
		doneChan:            make(chan struct{}),
		resultChan:          make(chan watch.Event, 0),
		minRestartDelay:     minRestartDelay,
	}

	go rw.receive()
	return rw, nil
}

func (rw *RetryWatcher) send(event watch.Event) bool {
	// Writing to an unbuffered channel is blocking operation
	// and we need to check if stop wasn't requested while doing so.
	select {
	case rw.resultChan <- event:
		return true
	case <-rw.stopChan:
		return false
	}
}

// doReceive returns true when it is done, false otherwise.
// If it is not done the second return value holds the time to wait before calling it again.
func (rw *RetryWatcher) doReceive() (bool, time.Duration) {
	watcher, err := rw.watcherClient.Watch(metav1.ListOptions{
		ResourceVersion: rw.lastResourceVersion,
	})
	// We are very unlikely to hit EOF here since we are just establishing the call,
	// but it may happen that the apiserver is just shutting down (e.g. being restarted)
	// This is consistent with how it is handled for informers
	switch err {
	case nil:
		break

	case io.EOF:
		// watch closed normally
		return false, 0

	case io.ErrUnexpectedEOF:
		klog.V(1).Infof("Watch closed with unexpected EOF: %v", err)
		return false, 0

	default:
		msg := "Watch failed: %v"
		if net.IsProbableEOF(err) {
			klog.V(5).Infof(msg, err)
			// Retry
			return false, 0
		}

		klog.Errorf(msg, err)
		// Retry
		return false, 0
	}

	if watcher == nil {
		klog.Error("Watch returned nil watcher")
		// Retry
		return false, 0
	}

	ch := watcher.ResultChan()
	defer watcher.Stop()

	for {
		select {
		case <-rw.stopChan:
			klog.V(4).Info("Stopping RetryWatcher.")
			return true, 0
		case event, ok := <-ch:
			if !ok {
				klog.V(4).Infof("Failed to get event! Re-creating the watcher. Last RV: %s", rw.lastResourceVersion)
				return false, 0
			}

			// We need to inspect the event and get ResourceVersion out of it
			switch event.Type {
			case watch.Added, watch.Modified, watch.Deleted, watch.Bookmark:
				metaObject, ok := event.Object.(resourceVersionGetter)
				if !ok {
					_ = rw.send(watch.Event{
						Type:   watch.Error,
						Object: &apierrors.NewInternalError(errors.New("retryWatcher: doesn't support resourceVersion")).ErrStatus,
					})
					// We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
					return true, 0
				}

				resourceVersion := metaObject.GetResourceVersion()
				if resourceVersion == "" {
					_ = rw.send(watch.Event{
						Type:   watch.Error,
						Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher: object %#v doesn't support resourceVersion", event.Object)).ErrStatus,
					})
					// We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
					return true, 0
				}

				// All is fine; send the event and update lastResourceVersion
				ok = rw.send(event)
				if !ok {
					return true, 0
				}
				rw.lastResourceVersion = resourceVersion

				continue

			case watch.Error:
				// This round trip allows us to handle unstructured status
				errObject := apierrors.FromObject(event.Object)
				statusErr, ok := errObject.(*apierrors.StatusError)
				if !ok {
					klog.Error(spew.Sprintf("Received an error which is not *metav1.Status but %#+v", event.Object))
					// Retry unknown errors
					return false, 0
				}

				status := statusErr.ErrStatus

				statusDelay := time.Duration(0)
				if status.Details != nil {
					statusDelay = time.Duration(status.Details.RetryAfterSeconds) * time.Second
				}

				switch status.Code {
				case http.StatusGone:
					// Never retry RV too old errors
					_ = rw.send(event)
					return true, 0

				case http.StatusGatewayTimeout, http.StatusInternalServerError:
					// Retry
					return false, statusDelay

				default:
					// We retry by default. RetryWatcher is meant to proceed unless it is certain
					// that it can't. If we are not certain, we proceed with retry and leave it
					// up to the user to timeout if needed.

					// Log here so we have a record of hitting the unexpected error
					// and we can whitelist some error codes if we missed any that are expected.
					klog.V(5).Info(spew.Sprintf("Retrying after unexpected error: %#+v", event.Object))

					// Retry
					return false, statusDelay
				}

			default:
				klog.Errorf("Failed to recognize Event type %q", event.Type)
				_ = rw.send(watch.Event{
					Type:   watch.Error,
					Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher failed to recognize Event type %q", event.Type)).ErrStatus,
				})
				// We are unable to restart the watch and have to stop the loop or this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
				return true, 0
			}
		}
	}
}

// receive reads the result from a watcher, restarting it if necessary.
func (rw *RetryWatcher) receive() {
	defer close(rw.doneChan)
	defer close(rw.resultChan)

	klog.V(4).Info("Starting RetryWatcher.")
	defer klog.V(4).Info("Stopping RetryWatcher.")

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	go func() {
		select {
		case <-rw.stopChan:
			cancel()
			return
		case <-ctx.Done():
			return
		}
	}()

	// We use non sliding until so we don't introduce delays on happy path when WATCH call
	// timeouts or gets closed and we need to reestablish it while also avoiding hot loops.
	wait.NonSlidingUntilWithContext(ctx, func(ctx context.Context) {
		done, retryAfter := rw.doReceive()
		if done {
			cancel()
			return
		}

		time.Sleep(retryAfter)

		klog.V(4).Infof("Restarting RetryWatcher at RV=%q", rw.lastResourceVersion)
	}, rw.minRestartDelay)
}

// ResultChan implements Interface.
func (rw *RetryWatcher) ResultChan() <-chan watch.Event {
	return rw.resultChan
}

// Stop implements Interface.
func (rw *RetryWatcher) Stop() {
	close(rw.stopChan)
}

// Done allows the caller to be notified when Retry watcher stops.
func (rw *RetryWatcher) Done() <-chan struct{} {
	return rw.doneChan
}

看一下retrywatcher有哪些屬性,多了stopChan和doneChan以及minRestartDelay(重啓時的延遲時間可以設置)

type RetryWatcher struct {
	lastResourceVersion string
	watcherClient       cache.Watcher
	resultChan          chan watch.Event
	stopChan            chan struct{}
	doneChan            chan struct{}
	minRestartDelay     time.Duration
}

看一下新建函數,NewRetryWatcher面向外部調用,主要還是看內部的newRetryWatcher,裏面設置了minRestartDelay的時間是1秒,詳細看下newRetryWatcher方法,最後調用了go rw.receive(),我們接着看receive()方法

func NewRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher) (*RetryWatcher, error) {
	return newRetryWatcher(initialResourceVersion, watcherClient, 1*time.Second)
}

func newRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher, minRestartDelay time.Duration) (*RetryWatcher, error) {
	switch initialResourceVersion {
	case "", "0":
		// TODO: revisit this if we ever get WATCH v2 where it means start "now"
		//       without doing the synthetic list of objects at the beginning (see #74022)
		return nil, fmt.Errorf("initial RV %q is not supported due to issues with underlying WATCH", initialResourceVersion)
	default:
		break
	}

	rw := &RetryWatcher{
		lastResourceVersion: initialResourceVersion,
		watcherClient:       watcherClient,
		stopChan:            make(chan struct{}),
		doneChan:            make(chan struct{}),
		resultChan:          make(chan watch.Event, 0),
		minRestartDelay:     minRestartDelay,
	}

	go rw.receive()
	return rw, nil
}

看一下這段代碼ctx, cancel := context.WithCancel(context.Background()),看不懂的小夥伴建議去看下go語言的context包,WithCancel函數,傳遞一個父Context作爲參數,返回子Context,以及一個取消函數用來取消Context。context專門用來簡化對於處理單個請求的多個goroutine之間與請求域的數據、取消信號、截止時間等相關操作。意思就是你在子協程中需要關閉一連串相關協程時就用這個context,調用cancel函數即可。我們接着看下面wait.NonSlidingUntilWithContext方法,去看看這個方法的註釋,代碼在下方。意思是隻要context不被done,就將循環調用其中的匿名函數。看一下匿名函數做了些啥,調用了doReceive()函數,看下doReceive函數是幹嘛的,我們可以看到,他也是使用watch方法,watch方法報錯就返回,不報錯就繼續進for循環,select語句查看當前retrywatch是否被stop,或者ch中是否有數據,若ch被關閉,就return false,0。一旦返回就看外面的receive函數處理,receive函數會判斷如果返回是true就調用cancel就真的退出,不會重建watch。只有當返回false時纔會重新回到NonSlidingUntilWithContext,循環調用匿名函數進行偵聽。所以上面我的挫比解決方案,還是太粗了。這邊的主要亮點就是wait.NonSlidingUntilWithContext方法的妙用了,建議大家學一下,還有就是,重試機制需要明確定義哪些情況是需要重試,哪些情況不需要重試!

func (rw *RetryWatcher) receive() {
	defer close(rw.doneChan)
	defer close(rw.resultChan)

	klog.V(4).Info("Starting RetryWatcher.")
	defer klog.V(4).Info("Stopping RetryWatcher.")

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()
	go func() {
		select {
		case <-rw.stopChan:
			cancel()
			return
		case <-ctx.Done():
			return
		}
	}()

	// We use non sliding until so we don't introduce delays on happy path when WATCH call
	// timeouts or gets closed and we need to reestablish it while also avoiding hot loops.
	wait.NonSlidingUntilWithContext(ctx, func(ctx context.Context) {
		done, retryAfter := rw.doReceive()
		if done {
			cancel()
			return
		}

		time.Sleep(retryAfter)

		klog.V(4).Infof("Restarting RetryWatcher at RV=%q", rw.lastResourceVersion)
	}, rw.minRestartDelay)
}
// NonSlidingUntilWithContext loops until context is done, running f every#意思是除非context調用了Done,不然就會循環調用f函數
// period.
//
// NonSlidingUntilWithContext is syntactic sugar on top of JitterUntilWithContext
// with zero jitter factor, with sliding = false (meaning the timer for period
// starts at the same time as the function starts).
func NonSlidingUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration) {
	JitterUntilWithContext(ctx, f, period, 0.0, false)
}

doReceive函數:

func (rw *RetryWatcher) doReceive() (bool, time.Duration) {
	watcher, err := rw.watcherClient.Watch(metav1.ListOptions{
		ResourceVersion: rw.lastResourceVersion,
	})//打開watch
	// We are very unlikely to hit EOF here since we are just establishing the call,
	// but it may happen that the apiserver is just shutting down (e.g. being restarted)
	// This is consistent with how it is handled for informers
	switch err {
	case nil:
		break

	case io.EOF:
		// watch closed normally
		return false, 0

	case io.ErrUnexpectedEOF:
		klog.V(1).Infof("Watch closed with unexpected EOF: %v", err)
		return false, 0

	default:
		msg := "Watch failed: %v"
		if net.IsProbableEOF(err) {
			klog.V(5).Infof(msg, err)
			// Retry
			return false, 0
		}

		klog.Errorf(msg, err)
		// Retry
		return false, 0
	}

	if watcher == nil {
		klog.Error("Watch returned nil watcher")
		// Retry
		return false, 0
	}

	ch := watcher.ResultChan()
	defer watcher.Stop()
###########這裏很重要!這裏很重要,這裏很重要
	for {
		select {
		case <-rw.stopChan://查看是否被停止了
			klog.V(4).Info("Stopping RetryWatcher.")
			return true, 0
		case event, ok := <-ch://從通道拿出數據
			if !ok {//通道是不是開着的,關閉的話,就返回
				klog.V(4).Infof("Failed to get event! Re-creating the watcher. Last RV: %s", rw.lastResourceVersion)
				return false, 0
			}

			// We need to inspect the event and get ResourceVersion out of it
			switch event.Type {//下面是成功獲取到數據的邏輯
			case watch.Added, watch.Modified, watch.Deleted, watch.Bookmark:
				metaObject, ok := event.Object.(resourceVersionGetter)
				if !ok {
					_ = rw.send(watch.Event{
						Type:   watch.Error,
						Object: &apierrors.NewInternalError(errors.New("retryWatcher: doesn't support resourceVersion")).ErrStatus,
					})
					// We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
					return true, 0
				}

				resourceVersion := metaObject.GetResourceVersion()
				if resourceVersion == "" {
					_ = rw.send(watch.Event{
						Type:   watch.Error,
						Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher: object %#v doesn't support resourceVersion", event.Object)).ErrStatus,
					})
					// We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
					return true, 0
				}

				// All is fine; send the event and update lastResourceVersion
				ok = rw.send(event)
				if !ok {
					return true, 0
				}
				rw.lastResourceVersion = resourceVersion

				continue

			case watch.Error:
				// This round trip allows us to handle unstructured status
				errObject := apierrors.FromObject(event.Object)
				statusErr, ok := errObject.(*apierrors.StatusError)
				if !ok {
					klog.Error(spew.Sprintf("Received an error which is not *metav1.Status but %#+v", event.Object))
					// Retry unknown errors
					return false, 0
				}

				status := statusErr.ErrStatus

				statusDelay := time.Duration(0)
				if status.Details != nil {
					statusDelay = time.Duration(status.Details.RetryAfterSeconds) * time.Second
				}

				switch status.Code {
				case http.StatusGone:
					// Never retry RV too old errors
					_ = rw.send(event)
					return true, 0

				case http.StatusGatewayTimeout, http.StatusInternalServerError:
					// Retry
					return false, statusDelay

				default:
					// We retry by default. RetryWatcher is meant to proceed unless it is certain
					// that it can't. If we are not certain, we proceed with retry and leave it
					// up to the user to timeout if needed.

					// Log here so we have a record of hitting the unexpected error
					// and we can whitelist some error codes if we missed any that are expected.
					klog.V(5).Info(spew.Sprintf("Retrying after unexpected error: %#+v", event.Object))

					// Retry
					return false, statusDelay
				}

			default:
				klog.Errorf("Failed to recognize Event type %q", event.Type)
				_ = rw.send(watch.Event{
					Type:   watch.Error,
					Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher failed to recognize Event type %q", event.Type)).ErrStatus,
				})
				// We are unable to restart the watch and have to stop the loop or this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
				return true, 0
			}
		}
	}
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章