可執行文件運行的系統調用

系統調用execve的入口sys_execve()

/*
 * sys_execve() executes a new program.
 */
long sys_execve(const char __user *name,  //需要執行的文件的絕對路徑(存於用戶空間)
		const char __user *const __user *argv, //傳入系統調用的參數(存於用戶空間)		 
                const char __user *const __user *envp, struct pt_regs *regs) //regs是系統調用時系統堆棧的情況(詳細解釋請參看情景分析之系統調用)
{
	long error;
	char *filename;

	filename = getname(name); //copy *filename frome user space to system space.
	error = PTR_ERR(filename); 
	if (IS_ERR(filename))
		return error;
	error = do_execve(filename, argv, envp, regs);

#ifdef CONFIG_X86_32
	if (error == 0) {
		/* Make sure we don't return using sysenter.. */
                set_thread_flag(TIF_IRET);
        }
#endif

	putname(filename);
	return error;
}

我們首先關注標籤__user,這個標籤表示其後邊的變量是指向用戶空間的地址的(詳細的解釋,請參看深入Linux內核框架P27)。

關於sys_execve參數的說明:Not only the register set with the arguments and the name of the executable file (filename) but also pointers to the arguments and the environment of the program are passed as in system programming. The notation is slightly clumsy because argv and envp are arrays of pointers, and both the pointer tothe array itself as well as all pointers in the array are located in the userspace portion of the virtual address space. Recall from the Introduction that some precautions are required when userspace memoryis accessed from the kernel, and that the __user annotations allow automated tools to check if everything is handled properly.

接下來的getname將要執行的文件名從用戶空間拷貝到系統空間會調用如下函數:

static char *getname_flags(const char __user * filename, int flags) 
{
	char *tmp, *result;

	result = ERR_PTR(-ENOMEM);
	tmp = __getname();  //allocate a physical page in system space as cache. Because the file's name could be very long. (hu xi ming, Page 306)
	if (tmp)  {
		int retval = do_getname(filename, tmp);

		result = tmp;
		if (retval < 0) {
			if (retval != -ENOENT || !(flags & LOOKUP_EMPTY)) {
				__putname(tmp);
				result = ERR_PTR(retval);
			}
		}
	}
	audit_getname(result);
	return result;
}
注意函數中的__getname();爲文件名分配一個物理頁面作爲緩衝區,因爲一個絕對路徑可能很長,因此如果用臨時變量的話,這個路徑就被存儲在系統堆棧段中,這顯然是不合適的,因爲系統堆棧段只有約7KB的空間。

之後調用do_getname()將filename從用戶空間拷貝到分配到的系統物理頁面上:

static int do_getname(const char __user *filename, char *page)
{
	int retval;
	unsigned long len = PATH_MAX;

	if (!segment_eq(get_fs(), KERNEL_DS)) {  //如果進程地址限制和KERNEL_DS不和相等,即當前進程沒有運行在內核態
		if ((unsigned long) filename >= TASK_SIZE) //如果filname>=TASK_SIZE,則非法訪問了
			return -EFAULT;
		if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
			len = TASK_SIZE - (unsigned long) filename;    //這個是爲什麼????
	}

	retval = strncpy_from_user(page, filename, len);  //將filename從用戶空間中拷貝到內核頁面中。
	if (retval > 0) {
		if (retval < len)
			return 0;
		return -ENAMETOOLONG; 
	} else if (!retval)
		retval = -ENOENT;
	return retval;
}
對劃紅線部分代碼的理解:在創建新進程的時候,有個copy_mm操作,將父進程的頁目錄和頁表拷貝給子進程,同時將父進程中的可寫頁面也拷貝給子進程,只讀的頁面是不用拷貝的。但是我們運用了cow技術,因此在copy_mm中實際上並沒有將頁面拷貝給子進程,而是要等到子進程實際要用到這些頁面,具體的說就是要往這些頁面中寫的時候,纔會從子進程的空間中分配空閒頁面。顯然用戶空間的用於實現堆棧空間的頁面是可寫的,因此在copy_mm的時候並沒有將這些頁面拷貝給子進程,當子進程用到自己的堆棧的時候,會重新分配新的乾淨的頁面。那麼子進程的第一個操作就是execve(argv),我們知道用戶空間參數是通過堆棧給定的,因此filename作爲參數壓入子進程的堆棧時,子進程會分配乾淨的堆棧頁面,然後將*filename壓棧,這是第一次使用子進程的堆棧,當然堆棧是 空的,因此紅線部分可以的出fileame的長度。(但是有一個問題,庫函數有實現了不同的方式去調用系統調用sys_doexecve,而這些庫函數大都不止一個參數,而filename或者pathname一般都是第一參數,按照參數入棧次序,是不應該最先入棧的,那樣的話filename指針到TASK_SIZE就不僅僅存儲的是filename了。難道庫函數會對這個做處理?)。這樣的理解欠妥,更準確的解釋可以參看博文《一個簡單的進程創建的例子》。

完成拷貝動作的函數,最終調用:

/*
 * Copy a null terminated string from userspace.
 */

#define __do_strncpy_from_user(dst, src, count, res)			   \
do {									   \
	int __d0, __d1, __d2;						   \
	might_fault();							   \
	__asm__ __volatile__(						   \
		"	testl %1,%1\n"					   \
		"	jz 2f\n"					   \
		"0:	lodsb\n"					   \
		"	stosb\n"					   \
		"	testb %%al,%%al\n"				   \
		"	jz 1f\n"					   \
		"	decl %1\n"					   \
		"	jnz 0b\n"					   \
		"1:	subl %1,%0\n"					   \
		"2:\n"							   \
		".section .fixup,\"ax\"\n"				   \
		"3:	movl %5,%0\n"					   \
		"	jmp 2b\n"					   \
		".previous\n"						   \
		_ASM_EXTABLE(0b,3b)					   \
		: "=&d"(res), "=&c"(count), "=&a" (__d0), "=&S" (__d1),	   \  //輸出部分res運用edx寄存器,count運用ecx寄存器,_d0用eax寄存器
                "=&D" (__d2)						   \  //__d1用ESI寄存器,_d2用EDI寄存器
		                                                           \  //
                : "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \ //count用和%0參數一樣的寄存器,也就是用count初始化dx,以此類推。
		: "memory");						   \
} while (0)
這個函數高效的完成了拷貝工作,具體的解釋可以參考《情景分析》P250。


現在我們終於在將要調用的可執行文件的名字和路徑拷貝給了系統空間,下面回到sys_execve,調用do_execve(filename, argv, envp, regs);

/*
 * sys_execve() executes a new program.
 */
static int do_execve_common(const char *filename,
				struct user_arg_ptr argv,
				struct user_arg_ptr envp,
				struct pt_regs *regs)
{
	struct linux_binprm *bprm; //這個結構當然是非常重要的,下文,列出了這個結構體以便查詢各個成員變量的意義。
                                   // This structure is used to hold the arguments that are used when loading binaries.
	struct file *file;
	struct files_struct *displaced;
	bool clear_in_exec;
	int retval;

	retval = unshare_files(&displaced);//
/*
 *   源碼中的註釋是:
 *    Helper to unshare the files of the current task.
 *    We don't want to expose copy_files internals to
 *    the exec layer of the kernel.
 *    注意,在解除可執行文件共享的操作中,只是複製了文件描述符表給子進程(因爲在創建do_fork中,copy_files只是複製了file_struct,並沒有遞歸的複製更加深層次的東西),
 *    而沒有拷貝文件。
 */
	if (retval)
		goto out_ret;

	retval = -ENOMEM;
	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
	if (!bprm)
		goto out_files;

	retval = prepare_bprm_creds(bprm); //Prepare credentials and lock ->cred_guard_mutex.
	if (retval)
		goto out_free;

	retval = check_unsafe_exec(bprm);
/*
 * determine how safe it is to execute the proposed program
 * - the caller must hold ->cred_guard_mutex to protect against
 *   PTRACE_ATTACH
 */

	if (retval < 0)
		goto out_free;
	clear_in_exec = retval;
	current->in_execve = 1;/* Tell the LSMs that the process is doing an execve */
	file = open_exec(filename); //打開可執行文件,這屬於文件系統的內容。不過可以看一下里邊有關打開文件標誌的設置。返回的是可執行文件的上下文。
	retval = PTR_ERR(file);
	if (IS_ERR(file))
		goto out_unmark;

	sched_exec();
/*
When a new process is started with the exec system call, a good opportunity for the sched-
uler to move the task across CPUs arises. Naturally, it has not been running yet, so there can-
not be any negative effects on the CPU cache by moving the task to another CPU.
《深入Linux內核框架》 P125
 */
	bprm->file = file;
	bprm->filename = filename;
	bprm->interp = filename;

	retval = bprm_mm_init(bprm);
/*
 * Create a new mm_struct and populate it with a temporary stack
 * vm_area_struct.  We don't have enough context at this point to set the stack
 * flags, permissions, and offset, so we use temporary values.  We'll update
 * them later in setup_arg_pages().
 */
	if (retval)
		goto out_file;

	bprm->argc = count(argv, MAX_ARG_STRINGS);
	if ((retval = bprm->argc) < 0)
		goto out;

	bprm->envc = count(envp, MAX_ARG_STRINGS);
	if ((retval = bprm->envc) < 0)
		goto out;

	retval = prepare_binprm(bprm);
/* 
 * Fill the binprm structure from the inode. 
 * Check permissions, then read the first 128 (BINPRM_BUF_SIZE) bytes
 * 從可執行文件中讀取前128字節,不管是什麼格式的可執行文件,在開頭的128字節中都包括了關於可執行文件屬性的必要而充分的信息。
 */
	if (retval < 0)
		goto out;

	retval = copy_strings_kernel(1, &bprm->filename, bprm);//我也不知道這個是用來幹嗎的
	if (retval < 0)
		goto out;

	bprm->exec = bprm->p;  //********************************************
	retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out;

	retval = copy_strings(bprm->argc, argv, bprm);//連着的三個copy_strings是將運行所需的參數和環境變量收集到bprm中。
	if (retval < 0)
		goto out;

	retval = search_binary_handler(bprm,regs);  //整個函數的核心:
	if (retval < 0)
		goto out;

	/* execve succeeded */
	current->fs->in_exec = 0;
	current->in_execve = 0;
	acct_update_integrals(current);
	free_bprm(bprm);
	if (displaced)
		put_files_struct(displaced);
	return retval;

out:
	if (bprm->mm) {
		acct_arg_size(bprm, 0);
		mmput(bprm->mm);
	}

out_file:
	if (bprm->file) {
		allow_write_access(bprm->file);
		fput(bprm->file);
	}

out_unmark:
	if (clear_in_exec)
		current->fs->in_exec = 0;
	current->in_execve = 0;

out_free:
	free_bprm(bprm);

out_files:
	if (displaced)
		reset_files_struct(displaced);
out_ret:
	return retval;
}
重要的結構體:

/*
 * This structure is used to hold the arguments that are used when loading binaries.
 */
struct linux_binprm {
	char buf[BINPRM_BUF_SIZE];
#ifdef CONFIG_MMU
	struct vm_area_struct *vma;
	unsigned long vma_pages;
#else
# define MAX_ARG_PAGES	32
	struct page *page[MAX_ARG_PAGES];
#endif
	struct mm_struct *mm;
	unsigned long p; /* current top of mem */
	unsigned int
		cred_prepared:1,/* true if creds already prepared (multiple
				 * preps happen for interpreters) */
		cap_effective:1;/* true if has elevated effective capabilities,
				 * false if not; except for init which inherits
				 * its parent's caps anyway */
#ifdef __alpha__
	unsigned int taso:1;
#endif
	unsigned int recursion_depth;
	struct file * file;
	struct cred *cred;	/* new credentials */
	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
	unsigned int per_clear;	/* bits to clear in current->personality */
	int argc, envc;
	const char * filename;	/* Name of binary as seen by procps */
	const char * interp;	/* Name of the binary really executed. Most
				   of the time same as filename, but could be
				   different for binfmt_{misc,script} */
	unsigned interp_flags;
	unsigned interp_data;
	unsigned long loader, exec;
};
回到do_execve,我們來看一下這個函數的核心部分,關於這個函數的概要性的介紹在《情景分析》P311中有,

  
/*
 * cycle the list of binary formats handler, until one recognizes the image
 */
int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
{
	unsigned int depth = bprm->recursion_depth;
	int try,retval;
	struct linux_binfmt *fmt;

	retval = security_bprm_check(bprm);
	if (retval)
		return retval;

	retval = audit_bprm(bprm);
	if (retval)
		return retval;

	retval = -ENOENT;
	for (try=0; try<2; try++) {
		read_lock(&binfmt_lock);
		list_for_each_entry(fmt, &formats, lh) {
			int (*fn)(struct linux_binprm *, struct pt_regs *) = fmt->load_binary;
			if (!fn)
				continue;
			if (!try_module_get(fmt->module))
				continue;
			read_unlock(&binfmt_lock);
			retval = fn(bprm, regs);
			/*
			 * Restore the depth counter to its starting value
			 * in this call, so we don't have to rely on every
			 * load_binary function to restore it on return.
			 */
			bprm->recursion_depth = depth;
			if (retval >= 0) {
				if (depth == 0)
					tracehook_report_exec(fmt, bprm, regs);
				put_binfmt(fmt);
				allow_write_access(bprm->file);
				if (bprm->file)
					fput(bprm->file);
				bprm->file = NULL;
				current->did_exec = 1;
				proc_exec_connector(current);
				return retval;
			}
			read_lock(&binfmt_lock);
			put_binfmt(fmt);
			if (retval != -ENOEXEC || bprm->mm == NULL)
				break;
			if (!bprm->file) {
				read_unlock(&binfmt_lock);
				return retval;
			}
		}
		read_unlock(&binfmt_lock);
		if (retval != -ENOEXEC || bprm->mm == NULL) {
			break;
#ifdef CONFIG_MODULES
		} else {
#define printable(c) (((c)=='\t') || ((c)=='\n') || (0x20<=(c) && (c)<=0x7e))
			if (printable(bprm->buf[0]) &&
			    printable(bprm->buf[1]) &&
			    printable(bprm->buf[2]) &&
			    printable(bprm->buf[3]))
				break; /* -ENOEXEC */
			request_module("binfmt-%04x", *(unsigned short *)(&bprm->buf[2]));
#endif
		}
	}
	return retval;
}

這個函數的核心是兩層循環,內存循環對fomats隊列中的每個隊員循環,讓隊列中的成員逐個試試它們的loda_binary()函數,看能否對上號,如果對上了號,則將目標文件裝入並投入運行。如果內層循環結束後沒有找到合適的運行這個文件的隊員,那麼如果內核支持動態安裝模塊,就條用reques_module()函數,從文件系統中尋找適合執行該文件的代理人。如果有,就將該模塊加載進來,再對進行一次內部循環,查找適合的隊員。如果還是沒有找到,則返回出錯。

這涉及到不同類型執行文件的不同的操作方式,不能詳述,可以參看《情景分析》的a.out格式目標文件的裝載和投運,以及《深入Linux內核框架》的elf格式目標文件的裝載和投運。不過,不管什麼類型的執行文件,基本上都做以下事情:

 (1) It releases all resources used by the old process.
 (2) It maps the application into virtual address space. The following segments must be taken into  account (the variables specified are elements of the task structure and are             set to the correct values by binary format handler)
(3) The text segment contains the executable code of the program. start_code and end_code
          specify the area in address space where the segment resides.

(4) The pre-initialized data (variables supplied with a specific value at compilation time) are
          located between start_data and end_data and are mapped from the corresponding seg-
          ment of the executable file.
(5)The heap used for dynamic memory allocation is placed in virtual address space; start_brk
         and brk specify its boundaries.
(6)The position of the stack is defined by start_stack; the stack grows downward automati-
         cally on nearly all machines. The only exception is currently PA-Risc. The inverse direction
         of stack growth must be noted by the architecture by setting the configuration symbol
         STACK_GROWSUP.

(7)The program arguments and the environment are mapped into the virtual address space
         and are located between arg_start and arg_end and env_start and env_end, respec-
         tively.
                                                                                         ————《深入Linux內核框架》 P81
回到do_execve_common()中,在search_binary_handler後,做收尾工作:

	/* execve succeeded */
	current->fs->in_exec = 0;
	current->in_execve = 0;
	acct_update_integrals(current);  //update mm integral fields in task_struct;主要是task_struct結構中與時間相關的變量的設置,以用於以後的調度。
	free_bprm(bprm);
	if (displaced)
		put_files_struct(displaced);
	return retval;

至此,完成了execve的過程!

總結起來,主要乾了如下工作:

(1)將可執行文件的文件名從用戶空間都到內核空間    filename = getname(name);

(2)打開可執行文件:    file = open_exec(filename);
(3)初始化用於在加載二進制可執行文件時存儲與其相關的所有信息的linux_binprm數據結構:    retval = bprm_mm_init(bprm);
(4)將運行所需的參數和環境變量收集到bprm中:連續的三個copy_strings()

(5)函數的核心是:search_binary_handler。加載可執行文件。

完成了execv的過程!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章