5-Web Servers

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

Welcome to my github: https://github.com/gaoxiangnumber1

5.1 Web Servers Come in All Shapes and Sizes

A web server processes HTTP requests and serves responses.

5.1.1 Web Server Implementations

The web server implements the HTTP protocol, manages web resources, and provides web server administrative capabilities. It also shares responsibilities for managing TCP connections with the operating system. OS manages the hardware details of the underlying computer system and provides TCP/IP network support, file systems to hold web resources, and process management to control current computing activities.

5.1.2 General-Purpose Software Web Servers

5.1.3 Web Server Appliances

5.1.4 Embedded Web Servers

5.2 A Minimal Perl Web Server

5.3 What Real Web Servers Do

Commercial web servers perform several common tasks, as shown in Figure 5-3:

Set up connection accept a client connection, or close if the client is unwanted.
Receive request read an HTTP request message from the network.
Process request interpret the request message and take action.
Access resource access the resource specified in the message.
Construct response create the HTTP response message with the right headers.
Send response send the response back to the client.
Log transaction place notes about the completed transaction in a log file.

5.4 Step 1: Accepting Client Connections

If a client already has a persistent connection open to the server, it can use that connection to send its request. Otherwise, the client needs to open a new connection to the server.

5.4.1 Handling New Connections

When a client requests a TCP connection to the web server, the web server establishes the connection and determines which client is on the other side of the connection, extracting the IP address from the TCP connection. Once a new connection is established and accepted, the server adds the new connection to its list of existing web server connections and prepares to watch for data on the connection.
The web server is free to reject and immediately close any connection. Some web servers close connections because the client IP address or hostname is unauthorized or is a known malicious client. Other identification techniques can also be used.

5.4.2 Client Hostname Identification

Most web servers can be configured to convert client IP addresses into client hostnames by using “reverse DNS”. Web servers can use the client hostname for detailed access control and logging. Be warned that hostname lookups can take a long time, slowing down web transactions. Many high-capacity web servers either disable hostname resolution or enable it only for particular content.

5.4.3 Determining the Client User Through ident

Some web servers support the IETF ident protocol. The ident protocol lets servers find out what username initiated an HTTP connection. This information is useful for web server logging the second field of the Common Log Format(RFC 1413) contains the ident username of each HTTP request.
If a client supports the ident protocol, the client listens on TCP port 113 for ident requests. Figure 5-4 shows how the ident protocol works.

In Figure 5-4a, the client opens an HTTP connection. The server then opens its own connection back to the client’s identd server port(113), sends a request asking for the username corresponding to the new connection(specified by client and server port numbers), and retrieves from the client the response containing the username.

5.5 Step 2: Receiving Request Messages

As the data arrives on connections, the web server reads out the data from the network connection and parses out the pieces of the request message(Figure 5-5).

When parsing the request message, the web server:
1. Parses the request line looking for the request method, the specified resource identifier(URI), and the version number[3], each separated by a single space, and ending with a carriage-return line-feed(CRLF) sequence[4].
2. Reads the message headers, each ending in CRLF.
3. Detects the end-of-headers blank line, ending in CRLF(if present).
4. Reads the request body, if any(length specified by the Content-Length header).
[3] HTTP/0.9 does not support version numbers. Some web servers support missing version numbers, which interprets the message as an HTTP/0.9 request.
[4] Many web servers support LF or CRLF as end-of-line sequences because clients may send LF as the end-of-line terminator.
When parsing request messages, web servers receive input data from the network. The network connection can stall at any point. The web server needs to read data from the network and temporarily store the partial message data in memory until it receives enough data to parse it and make sense of it.

5.5.1 Internal Representations of Messages

Some web servers store the request messages in internal data structures that make the message easy to manipulate. E.g., the data structure might contain pointers and lengths of each piece of the request message, and the headers might be stored in a fast lookup table so the specific values of particular headers can be accessed quickly (Figure 5-6).

5.5.2 Connection Input/Output Processing Architectures

Web servers constantly watch for new web requests because requests can arrive at any time. Figure 5-7.

Single-threaded web servers(Figure 5-7a)

Single-threaded web servers process one request at a time until completion. When the transaction is complete, the next connection is processed.
Pros: simple.
Cons: during processing, all other connections are ignored, thus creates performance problems.

Multiprocess and multithreaded web servers(Figure 5-7b)

Multiprocess and multithreaded web servers dedicate multiple processes or threads to process requests simultaneously. The threads/processes may be created on demand or in advance. Many multithreaded web servers put a limit on the maximum number of threads/processes.

Multiplexed I/O servers(Figure 5-7c)

In a multiplexed architecture, all the connections are simultaneously watched for activity. When a connection changes state(e.g., when data becomes available or an error condition occurs), a small amount of processing is performed on the connection; when that processing is complete, the connection is returned to the open connection list for the next change in state. Work is done on a connection only when there is something to be done; threads and processes are not tied up waiting on idle connections.

Multiplexed multithreaded web servers(Figure 5-7d)

Some systems combine multithreading and multiplexing to take advantage of multiple CPUs. Multiple threads(often one per physical processor) each watch the open connections(or a subset of the open connections) and perform a small amount of work on each connection.

5.6 Step 3: Processing Requests

Once the web server has received a request, it can process the request using the method, resource, headers, and optional body.
Some methods(e.g., POST) require entity body data in the request message. Other methods(e.g., OPTIONS) allow a request body but don’t require one. A few methods(e.g., GET) forbid entity body data in request messages.

5.7 Step 4: Mapping and Accessing Resources

Web servers are resource servers. They deliver pre-created content, such as HTML pages or JPEG images, and dynamic content from resource-generating applications running on the servers.
Before the web server can deliver content to the client, it needs to identify the source of the content, by mapping the URI from the request message to the content or content generator on the web server.

5.7.1 Docroots

The simplest form of resource mapping uses the request URI to name a file in the web server’s filesystem. A folder in the web server filesystem that is reserved for web content is called the document root. The web server takes the URI from the request message and appends it to the document root.

Figure 5-8: a request for /specials/saw-blade.gif. The web server has document root /usr/local/httpd/files, so it returns the file /usr/local/httpd/files/specials/saw-blade.gif.

5.7.1.1 Virtually hosted docroots

Virtually hosted web servers host multiple web sites on the same web server, giving each site its own distinct document root on the server. A virtually hosted web server identifies the correct document root to use from the IP address or hostname in the URI or the Host header.

Figure 5-9. The server hosts two sites: www.joes-hardware.com and www.marys-antiques.com. The server can distinguish the web sites using the HTTP Host header, or from distinct IP addresses.
1. When request A arrives, the server fetches the file for /docs/joe/index.html.
2. When request B arrives, the server fetches the file for /docs/mary/index.html.

5.7.1.2 User home directory docroots

Another use of docroots gives people private web sites on a web server. A convention maps URIs whose paths begin with a slash and tilde(/~) followed by a username to a private document root for that user. The private docroot is often the folder called public_html inside that user’s home directory, but it can be configured differently(Figure 5-10).

5.7.2 Directory Listings

A web server can receive requests for directory URLs, where the path resolves to a directory, not a file. Web servers can be configured to take different actions when a client requests a directory URL:
1. Return an error.
2. Return a special, default, “index file” instead of the directory.
3. Scan the directory, and return an HTML page containing the contents.
Most web servers look for a file named index.html inside or index.htma directory to represent that directory. If a user requests a URL for a directory and the directory contains a file named index.html(or index.htm), the server will return the contents of that file.
If no default index file is present when a user requests a directory URI, and if directory indexes are not disabled, many web servers automatically return an HTML file listing the files in that directory, and the sizes and modification dates of each file, including URI links to each file.

5.7.3 Dynamic Content Resource Mapping

Web servers can map URIs to dynamic resources, that is, to programs that generate content on demand(Figure 5-11). A class of web servers called application servers connect web servers to backend applications. The web server needs to be able to tell when a resource is a dynamic resource, where the dynamic content generator program is located, and how to run the program. Most web servers provide basic mechanisms to identify and map dynamic resources.
CGI is an interface for executing server-side applications. Modern application servers have more powerful and efficient server-side dynamic content support, including Microsoft’s Active Server Pages and Java servlets.

5.7.4 Server-Side Includes(SSI)

Many web servers provide support for server-side includes. If a resource is flagged as containing server-side includes, the server processes the resource contents before sending them to the client.
The contents are scanned for certain special patterns(often contained inside special HTML comments), which can be variable names or embedded scripts. The special patterns are replaced with the values of variables or the output of executable scripts. This is an easy way to create dynamic content.

5.7.5 Access Controls

Web servers can assign access controls to particular resources. When a request arrives for an access-controlled resource, the web server can control access based on the IP address of the client, or it can issue a password challenge to get access to the resource.

5.8 Step 5: Building Responses

Once the web server has identified the resource, it performs the action described in the request method and returns the response message. The response message contains a response status code, response headers, and a response body if one was generated.

5.8.1 Response Entities

If the transaction generated a response body, the content is sent back with the response message. If there was a body, the response message usually contains:
1. A Content-Type header, describing the MIME type of the response body
2. A Content-Length header, describing the size of the response body
3. The actual message body content

5.8.2 MIME Typing

The web server is responsible for determining the MIME type of the response body. There are many ways to configure servers to associate MIME types with resources.

mime.types

The web server can use the extension of the filename to indicate MIME type. The web server scans a file containing MIME types for each extension to compute the MIME type for each resource. This extension-based type association is the most common; it is illustrated in Figure 5-12.

Magic typing

The Apache web server can scan the contents of each resource and pattern-match the content against a table of known patterns(called the magic file) to determine the MIME type for each file. This can be slow, but it is convenient, especially if the files are named without standard extensions.

Explicit typing

Web servers can be configured to force particular files or directory contents to have a MIME type, regardless of the file extension or contents.

Type negotiation

Some web servers can be configured to store a resource in multiple document formats. In this case, the web server can be configured to determine the “best” format to use(and the associated MIME type) by a negotiation process with the user(Chapter 17).

5.8.3 Redirection

A web server can redirect the browser to go elsewhere to perform the request. A redirection response is indicated by a 3XX return code. The Location response header contains a URI for the new or preferred location of the content. Redirects are useful for:
1. Permanently moved resources: A resource might have been moved to a new location, or renamed, giving it a new URL. The web server can tell the client that the resource has been renamed, and the client can update any bookmarks, etc. before fetching the resource from its new location. The status code 301 Moved Permanently is used for this kind of redirect.
2. Temporarily moved resources: If a resource is temporarily moved or renamed, the server may want to redirect the client to the new location. But, because the renaming is temporary, the server wants the client to come back with the old URL in the future and not to update any bookmarks. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
3. URL augmentation: Servers use redirects to rewrite URLs, often to embed context. When the request arrives, the server generates a new URL containing embedded state information and redirects the user to this new URL.(These extended, state-augmented URLs are also called “fat URLs.”) The client follows the redirect, reissuing the request, but now including the full, state-augmented URL. This is a useful way of maintaining state across transactions. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
4. Load balancing: If an overloaded server gets a request, the server can redirect the client to a less heavily loaded server. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
5. Server affinity: Web servers may have local information for certain users; a server can redirect the client to a server that contains information about the client. The status codes 303 See Other and 307 Temporary Redirect are used for this kind of redirect.
6. Canonicalizing directory names: When a client requests a URI for a directory name without a trailing slash, most web servers redirect the client to a URI with the slash added, so that relative links work correctly.

5.9 Step 6: Sending Responses

The server needs to keep track of the connection state and handle persistent connections with care. For non-persistent connections, the server is expected to close its side of the connection when the entire message is sent.
For persistent connections, the connection may stay open, in which case the server needs to be extra cautious to compute the Content-Length header correctly, or the client will have no way of knowing when a response ends(Chapter 4).

5.10 Step 7: Logging

Finally, when a transaction is complete, the web server notes an entry into a log file, describing the transaction performed. Most web servers provide several configurable forms of logging. Chapter 21.

5.11 For More Information