vm內核參數之內存水位min_free_kbytes和保留內存lowmem_reserve_ratio

注:本文分析基於3.10.0-693.el7內核版本,即CentOS 7.4

1、zone內存水位值

系統內存的每個node上都有不同的zone,每個zone的內存都有對應的水位線,當內存使用達到某個閾值時就會觸發相應動作,比如直接回收內存,或者啓動kswap進行回收內存。我們可以通過查看/proc/zoneinfo來確認每個zone的min、low、high水位值。

[root@centos7 ~]# cat /proc/zoneinfo | grep -E "Node|min|low|high "
Node 0, zone      DMA
        min      92
        low      115
        high     138
Node 0, zone    DMA32
        min      10839
        low      13548
        high     16258
Node 1, zone    DMA32
        min      5706
        low      7132
        high     8559
Node 1, zone   Normal
        min      5890
        low      7362
        high     8835

而這些內存的水位值就由/proc/sys/vm/min_free_kbytes這個參數控制。

2、zone內存保留值

系統在分配內存時,有可能會出現跨zone分配內存的情況,比如,如果我們想要在zone的normal區域分配一個order爲6大小的內存塊,但是由於normal內存使用緊張,無法提供該大小的內存塊,就會往下到DMA32的zone空間去分配。一兩次沒關係,如果每次都這麼操作,那麼DMA32的內存就會被耗光;等到某些應用程序需要該zone的內存時無法分配,特別對於某些只能使用特定zone內存的程序。因此對於跨zone使用內存需要控制,因此在考慮是否能進行跨zone分配時,需要參考該zone對應的lowmem_reserve內存。該值就由/proc/sys/vm/lowmem_reserve_ratio控制。

3、內核初始化zone內存水位線

內核在初始化階段會調用init_per_zone_wmark_min來進行每個zone的內存水位線初始化,同時也會設置每個zone的lowmem_reserve值。

/*
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 * 	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
 *	min_free_kbytes = sqrt(lowmem_kbytes * 16)
 */
int __meminit init_per_zone_wmark_min(void)
{
	unsigned long lowmem_kbytes;
	int new_min_free_kbytes;
    //獲取系統空閒內存值,扣除每個zone的high水位值後的總和
	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
    //根據上述公式計算new_min_free_kbytes值
	new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);

	if (new_min_free_kbytes > user_min_free_kbytes) {
		min_free_kbytes = new_min_free_kbytes;
        //最小128k
		if (min_free_kbytes < 128)
			min_free_kbytes = 128;
        //最大65M,但是這只是系統初始化的值,可以通過proc接口設置範圍外的值
		if (min_free_kbytes > 65536)
			min_free_kbytes = 65536;
	} else {
		pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n",
				new_min_free_kbytes, user_min_free_kbytes);
	}
    //設置每個zone的min low high水位值
	setup_per_zone_wmarks();
	refresh_zone_stat_thresholds();
    //設置每個zone爲其他zone的保留內存
	setup_per_zone_lowmem_reserve();
	setup_per_zone_inactive_ratio();
	return 0;
}

系統初始化裏min_free_kbytes的值介於128k~65M之間,但是通過proc接口設置就沒這個限制,

int min_free_kbytes_sysctl_handler(ctl_table *table, int write, 
	void __user *buffer, size_t *length, loff_t *ppos)
{
	int rc;

	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
	if (rc)
		return rc;

	if (write) {
		user_min_free_kbytes = min_free_kbytes;
        //直接根據用戶的值設置min、low、high水位值
		setup_per_zone_wmarks();
	}
	return 0;
}

然後就是進入setup_per_zone_wmarks計算每個zone的min、low、high水位線,因爲需要考慮多個zone,因此這個min_free_kbytes需要按比例分配給各個zone。

void setup_per_zone_wmarks(void)
{
	mutex_lock(&zonelists_mutex);
	__setup_per_zone_wmarks();
	mutex_unlock(&zonelists_mutex);
}

static void __setup_per_zone_wmarks(void)
{
	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
	unsigned long lowmem_pages = 0;
	struct zone *zone;
	unsigned long flags;

	//統計非ZONE_HIGHMEM的內存總量
	for_each_zone(zone) {
		if (!is_highmem(zone))
			lowmem_pages += zone->managed_pages;
	}
    //針對每個zone設置min low high水位線
	for_each_zone(zone) {
		u64 tmp;

		spin_lock_irqsave(&zone->lock, flags);
        //以下兩個語句就是按當前zone內存量佔總的內存量的大小來分配min的水位值
        //因爲所有zone的min水位相加纔是真正的min_free_kbytes
		tmp = (u64)pages_min * zone->managed_pages;
		do_div(tmp, lowmem_pages);
        //64位機器上不會有highmem區域,因此不考慮該情況
		if (is_highmem(zone)) {
			...
		} else {
			//設置min水位值
			zone->watermark[WMARK_MIN] = tmp;
		}
        //low水位值=5/4的min水位值
		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
        //high水位值=3/2的min水位值
		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

		__mod_zone_page_state(zone, NR_ALLOC_BATCH,
				      high_wmark_pages(zone) -
				      low_wmark_pages(zone) -
				      zone_page_state(zone, NR_ALLOC_BATCH));

		setup_zone_migrate_reserve(zone);
		spin_unlock_irqrestore(&zone->lock, flags);
	}

	//更新totalreserve_pages的值
	calculate_totalreserve_pages();
}

所以總的來說,

  • watermark[WMARK_MIN] = min_free_kbytes/4*zone.pages/zone.allpages
  • watermark[WMARK_LOW] = 5/4*watermark[WMARK_MIN]
  • watermark[WMARK_HIGH] = 3/2*watermark[WMARK_MIN]

設置完內存水位線後,會更新totalreserve_pages的值,這個值用於評估系統正常運行時需要使用的內存,該值會作用於overcommit時,判斷當前是否允許此次內存分配。

不過由於此時各個zone的lowmem_reserve值還未設置,因此這裏我們先不分析,我們先看lowmem_reserve值是如何設置,畢竟totalreserve_pages的值會在設置lowmem_reserve後再次更新。

那麼我們就進入setup_per_zone_lowmem_reserve,看看lowmem_reserve值是如何設置的,

/*
 * setup_per_zone_lowmem_reserve - called whenever
 *	sysctl_lower_zone_reserve_ratio changes.  Ensures that each zone
 *	has a correct pages reserved value, so an adequate number of
 *	pages are left in the zone after a successful __alloc_pages().
 */
static void setup_per_zone_lowmem_reserve(void)
{
	struct pglist_data *pgdat;
	enum zone_type j, idx;
    //遍歷每個node
	for_each_online_pgdat(pgdat) {
        //遍歷每個zone,假設系統上有ZONE_DMA,ZONE_DMA32,ZONE_NORMAL三個zone類型
		for (j = 0; j < MAX_NR_ZONES; j++) {
			struct zone *zone = pgdat->node_zones + j;
			unsigned long managed_pages = zone->managed_pages;
            //j=0,則zone[DMA].lowmem_reserve[DMA] = 0
            //j=1,則zone[DMA32].lowmem_reserve[DMA32] = 0
            //j=2,則zone[NORMAL].lowmem_reserve[NORMAL] = 0
            //這裏的意思就是對於自身zone的內存使用,不需要考慮保留
			zone->lowmem_reserve[j] = 0;

			idx = j;
			while (idx) {
				struct zone *lower_zone;

				idx--;

				if (sysctl_lowmem_reserve_ratio[idx] < 1)
					sysctl_lowmem_reserve_ratio[idx] = 1;

				lower_zone = pgdat->node_zones + idx;
                //j=0,不會進入循環
                //j=1,idx=0,則zone[DMA].lowmem_reserve[DMA32] = zone[DMA32].pages/sysctl_lowmem_reserve_ratio[DMA]
                //j=2,idx=1,則zone[DMA32].lowmem_reserve[NORMAL] = zone[NORMAL].pages/sysctl_lowmem_reserve_ratio[DMA32]
                //    idx=0,則zone[DMA].lowmem_reserve[NORMAL] = zone[NORMAL+DMA32].pages/sysctl_lowmem_reserve_ratio[DMA]  
				lower_zone->lowmem_reserve[j] = managed_pages /
					sysctl_lowmem_reserve_ratio[idx];
				managed_pages += lower_zone->managed_pages;
			}
		}
	}
    //再次更新totalreserve_pages的值
	calculate_totalreserve_pages();
}

由於內存分配只可能向下分配,不可能向上分配,即需要DMA的內存不可能到DMA32或者NORMAL上分配,
同理需要DMA32的內存也不可能到NORMAL上分配,因此不存在zone[DMA32].lowmem_reserve[DMA],
也不存在zone[NORMAL].lowmem_reserve[DMA]和zone[NORMAL].lowmem_reserve[DMA32],

整理該過程,我們有以下計算方式,

zone[DMA].lowmem_reserve[DMA] = 0
zone[DMA].lowmem_reserve[DMA32] = zone[DMA32].pages/sysctl_lowmem_reserve_ratio[DMA]
zone[DMA].lowmem_reserve[NORMAL] = zone[NORMAL+DMA32].pages/sysctl_lowmem_reserve_ratio[DMA] 

zone[DMA32].lowmem_reserve[DMA32] = 0
zone[DMA32].lowmem_reserve[NORMAL] = zone[NORMAL].pages/sysctl_lowmem_reserve_ratio[DMA32]

zone[NORMAL].lowmem_reserve[NORMAL] = 0

總的來說就是爲了避免跨zone分配內存將下級zone內存耗光的情況。

設置lowmem_reserve後就會再次更新totalreserve_pages的值,

/*
 * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio
 *	or min_free_kbytes changes.
 */
static void calculate_totalreserve_pages(void)
{
	struct pglist_data *pgdat;
	unsigned long reserve_pages = 0;
	enum zone_type i, j;
    //遍歷每個node
	for_each_online_pgdat(pgdat) {
        //遍歷每個zone
		for (i = 0; i < MAX_NR_ZONES; i++) {
			struct zone *zone = pgdat->node_zones + i;
			unsigned long max = 0;

			/* Find valid and maximum lowmem_reserve in the zone */
            //查找當前zone中,爲上級zone內存類型最大的保留值
			for (j = i; j < MAX_NR_ZONES; j++) {
				if (zone->lowmem_reserve[j] > max)
					max = zone->lowmem_reserve[j];
			}

			/* we treat the high watermark as reserved pages. */
            //每個zone的high水位值和最大保留值之和當做是系統運行保留閾值
			max += high_wmark_pages(zone);

			if (max > zone->managed_pages)
				max = zone->managed_pages;
			reserve_pages += max;
			/*
			 * Lowmem reserves are not available to
			 * GFP_HIGHUSER page cache allocations and
			 * kswapd tries to balance zones to their high
			 * watermark.  As a result, neither should be
			 * regarded as dirtyable memory, to prevent a
			 * situation where reclaim has to clean pages
			 * in order to balance the zones.
			 */
			zone->dirty_balance_reserve = max;
		}
	}
	dirty_balance_reserve = reserve_pages;
    //這個totalreserve_pages在overcommit時會使用到
    //該值作爲系統正常運行的最低保證
	totalreserve_pages = reserve_pages;
}

4、系統驗證

通過代碼分析後我們瞭解了內存的min、low、high水位線,以及每個zone的lowmem_reserve,下面我們就來驗證下系統上這些值是否符合我們的分析。

首先是min、low、high水位線和min_free_kbytes,我們以min值爲例,

[root@centos7 ~]# echo 90112 > /proc/sys/vm/min_free_kbytes
[root@centos7 ~]# cat /proc/sys/vm/min_free_kbytes
90112
[root@centos7 ~]# cat /proc/zoneinfo | grep -E "Node|managed|min"
Node 0, zone      DMA
        min      92
        managed  3977
Node 0, zone    DMA32
        min      10844
        managed  467178
Node 1, zone    DMA32
        min      5703
        managed  245716
Node 1, zone   Normal
        min      5887
        managed  253642

由此有,

Node[0].DMA.min=3977/4*90112/(3977+467178+245716+253642)=92
Node[0].DMA32.min=467178/4*90112/(3977+467178+245716+253642)=10844
Node[1].DMA32.min=90112/4*245716/(3977+467178+245716+253642)=5703
Node[1].Normal.min=90112/4*253642/(3977+467178+245716+253642)=5887

可見是符合我們的分析。

然後是lowmem_reserve,

[root@centos7 ~]# cat /proc/zoneinfo | grep -E "Node|managed|protection"
Node 0, zone      DMA
        managed  3977
        protection: (0, 1824, 1824, 1824)
Node 0, zone    DMA32
        managed  467178
        protection: (0, 0, 0, 0)
Node 1, zone    DMA32
        managed  245716
        protection: (0, 0, 990, 990)
Node 1, zone   Normal
        managed  253642
        protection: (0, 0, 0, 0)
[root@centos7 ~]# cat /proc/sys/vm/lowmem_reserve_ratio 
256	256	32

這裏的計算是要區分node的,由於系統上沒有MOVEABLE的zone,我們就不計算這個zone的相關參數,由此有,

Node[0].zone[DMA].lowmem_reserve[DMA] = 0
Node[0].zone[DMA].lowmem_reserve[DMA32] = Node[0].zone[DMA32].pages/sysctl_lowmem_reserve_ratio[DMA] = 467178/256 = 1824
Node[0].zone[DMA].lowmem_reserve[NORMAL] = Node[0].zone[NORMAL+DMA32].pages/sysctl_lowmem_reserve_ratio[DMA] = 467178/256 = 1824

Node[0].zone[DMA32].lowmem_reserve[DMA] = 0
Node[0].zone[DMA32].lowmem_reserve[DMA32] = 0
Node[0].zone[DMA32].lowmem_reserve[NORMAL] = Node[0].zone[NORMAL].pages/sysctl_lowmem_reserve_ratio[DMA32] = 0/256 = 0

Node[1].zone[DMA32].lowmem_reserve[DMA] = 0
Node[1].zone[DMA32].lowmem_reserve[DMA32] = 0
Node[1].zone[DMA32].lowmem_reserve[NORMAL] = Node[1].zone[NORMAL].pages/sysctl_lowmem_reserve_ratio[DMA32] = 253642/256 = 990

由此可見,確實符合我們的分析。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章