粗谈CGI

　　先看看维基百科上面关于 CGI的介绍http://zh.wikipedia.org/wiki/%E9%80%9A%E7%94%A8%E7%BD%91%E5%85%B3%E6%8E%A5%E5%8F%A3

一般我们在开发Web运用的时候很少接触CGI,这种底层的处理细节。但如果你想彻底理解Resquest-Response 过程,自己编写运用服务器就有必要详细了解CGI,很多语言的动态网页技术都是基于CGI的思想,对CGI进行扩展,比如Python的WSGI,Perl的PSGI。

　　有一篇介绍CGI很好的文章 http://www.jdon.com/idea/cgi.htm

　　我们知道HTTP Server只能处理用户的静态请求,但是如果用户的一个请求,请求的数据需要从数据库里面获取怎么办(我们称之为动态请求)。

　　我们在编写某个Web运用系统的时候,我们不使用任何任何HTTP服务器,我们使用最原始的Socket编程,我们编写程序自己解析用户的请求(HTTP协议),如果是静态文件,我们直接调用处理静态文件的方法,如果是动态请求我们则调用处理相应动态请求的方法,在整个过程中除了处理HTTP协议,我们没有使用任何其他协议，这种处理很直接,过程也很清楚。但是如果系统要增加新的功能,我们则要添加或修改对应的方法,很不容易扩展和维护。如果按照这种方式去编写一个Web网站,那么对程序员的要求就特别高,不仅要精通Socket编程和HTTP协议,还要会HTML编程。

　　下面是图示和为代码

 1 while true:
 2     conn = server.accept()             #接受连接
 3     req   =  conn.read()                 #读取用去请求
 4     headers = parse_http(reqr)    #解析用户请求
 5     if is_static(headers[url]):         #如果是静态请求
 6         res_str = do_static(req_file)
 7     
 8     else if is_dynamic(headers[url]): #如果是动态请求
 9         res_str = do_dynamic(req)
10     
11     res_str = end_handler(res_str)  #对响应的字符串进行加工
12     conn.write(res_str)                     #向用户输出相应的结果
13     conn.close()                                #关闭连接

　　按照上面那种方式的话,我们每编写一个Web系统都要重复以上步骤。而且整个处理动态请求的方法要用同一种语言。所以这种可行但不适用。其实按照上面这种思路的话编写的就是一个运用系统了。

思路没变,我们把重复的步骤提取出了,也就是除了动态处理的所有步骤，构成了一个HTTP Server。如果我们要开发一个网站系统，只用编写相应的HTTP的静态文件,和编写处理动态请求的脚本。我们称之为CGI脚本，CGI脚本可以由任何语言编写。我们知道大部分高性能的HTTP Server 都是用C/C++编写的,如果HTTP Server要调用的CGI脚本,调用CGI脚本我们可以使用Unix中的 execve 系统,下面是一些execve系统的一些函数及其用法。

 1 函数名: exec... 
 2 功  能: 装入并运行其它程序的函数 
 3 用  法: int execl(char *pathname, char *arg0, arg1, ..., argn, NULL); 
 4  int execle(char *pathname, char *arg0, arg1, ..., argn, NULL, 
 5      char *envp[]); 
 6  int execlp(char *pathname, char *arg0, arg1, .., NULL); 
 7  int execple(char *pathname, char *arg0, arg1, ..., NULL, 
 8       char *envp[]); 
 9  int execv(char *pathname, char *argv[]); 
10  int execve(char *pathname, char *argv[], char *envp[]); 
11  int execvp(char *pathname, char *argv[]); 
12  int execvpe(char *pathname, char *argv[], char *envp[]); 
13 程序例: 
14 
15 /* execv example */ 
16 #include <process.h> 
17 #include <stdio.h> 
18 #include <errno.h> 
19 
20 void main(int argc, char *argv[]) 
21 { 
22    int i; 
23 
24    printf("Command line arguments:
"); 
25    for (i=0; i<argc; i++) 
26       printf("[%2d] : %s
", i, argv[i]); 
27 
28    printf("About to exec child with arg1 arg2 ...
"); 
29    execv("CHILD.EXE", argv); 
30 
31    perror("exec error"); 
32 
33    exit(1); 
34 }

但是我们调用CGI脚本的时候要使用到一些请求的参数信息吧。在CGI里面有一个重要的名称 Environment variables 环境变量,下面列出 CGI/1.1 定义的环境变量

 1  Environment variables
 2 
 3    Environment variables are used to pass data about the request from
 4    the server to the script. They are accessed by the script in a system
 5    defined manner. In all cases, a missing environment variable is
 6    equivalent to a zero-length (NULL) value, and vice versa. The
 7    representation of the characters in the environment variables is
 8    system defined.
 9 
10    Case is not significant in the names, in that there cannot be two
11    different variable whose names differ in case only. Here they are
12    shown using a canonical representation of capitals plus underscore
13    ("_"). The actual representation of the names is system defined; for
14    a particular system the representation may be defined differently to
15    this.
16 
17    The variables are:
18 
19       AUTH_TYPE
20       CONTENT_LENGTH
21       CONTENT_TYPE
22       GATEWAY_INTERFACE
23       HTTP_*
24       PATH_INFO
25       PATH_TRANSLATED
26       QUERY_STRING
27       REMOTE_ADDR
28       REMOTE_HOST
29       REMOTE_IDENT
30       REMOTE_USER
31       REQUEST_METHOD
32       SCRIPT_NAME
33       SERVER_NAME
34       SERVER_PORT
35       SERVER_PROTOCOL
36       SERVER_SOFTWARE

从字面上大家应该看出来所代表的含义吧,是不是有点类似 HTTP headers 。讲到这里大家应该很清楚 CGI 的处理过程了吧。总结来说当Web服务器接受CGI请求时,服务器将设置一些CGI程序的环境变量，运行CGI脚本时,CGI脚本在从环境变量中获取感兴趣的变量(比如获取查询字符串 QUERY_STRING),进行处理，响应结果。至于如何设置和获取环境变量请查看详解 Unix环境变量。

　　在列出一段 HTTP Server处理CGI请求的源码:

 1 #include    <stdio.h>
 2 #include    <sys/types.h>
 3 #include    <sys/stat.h>
 4 #include    <string.h>
 5 
 6 main(int ac, char *av[])
 7 {
 8     int     sock, fd;
 9     FILE    *fpin;
10     char    request[BUFSIZ];
11 
12     if ( ac == 1 ){
13         fprintf(stderr,"usage: ws portnum
");
14         exit(1);
15     }
16     sock = make_server_socket( atoi(av[1]) );
17     if ( sock == -1 ) exit(2);
18 
19     /* main loop here */
20 
21     while(1){
22         /* take a call and buffer it */
23         fd = accept( sock, NULL, NULL );
24         fpin = fdopen(fd, "r" );
25 
26         /* read request */
27         fgets(request,BUFSIZ,fpin);
28         printf("got a call: request = %s", request);
29         read_til_crnl(fpin);
30 
31         /* do what client asks */
32         process_rq(request, fd);
33 
34         fclose(fpin);
35     }
36 }
37 
38 /* ------------------------------------------------------ *
39    read_til_crnl(FILE *)
40    skip over all request info until a CRNL is seen
41    ------------------------------------------------------ */
42 
43 read_til_crnl(FILE *fp)
44 {
45     char    buf[BUFSIZ];
46     while( fgets(buf,BUFSIZ,fp) != NULL && strcmp(buf,"
") != 0 )
47         ;
48 }
49 
50 /* ------------------------------------------------------ *
51    process_rq( char *rq, int fd )
52    do what the request asks for and write reply to fd 
53    handles request in a new process
54    rq is HTTP command:  GET /foo/bar.html HTTP/1.0
55    ------------------------------------------------------ */
56 
57 process_rq( char *rq, int fd )
58 {
59     char    cmd[BUFSIZ], arg[BUFSIZ];
60 
61     /* create a new process and return if not the child */
62     if ( fork() != 0 )
63         return;
64 
65     strcpy(arg, "./");        /* precede args with ./ */
66     if ( sscanf(rq, "%s%s", cmd, arg+2) != 2 )
67         return;
68 
69     if ( strcmp(cmd,"GET") != 0 )
70         cannot_do(fd);
71     else if ( not_exist( arg ) )
72         do_404(arg, fd );
73     else if ( isadir( arg ) )
74         do_ls( arg, fd );
75     else if ( ends_in_cgi( arg ) )
76         do_exec( arg, fd );
77     else
78         do_cat( arg, fd );
79 }

上面只列出了一部分HTTP Server代码,下面列出处理CGI请求的代码:

 1 do_exec( char *prog, int fd )
 2 {
 3     FILE    *fp ;
 4 
 5     fp = fdopen(fd,"w");
 6     header(fp, NULL);
 7     fflush(fp);
 8     dup2(fd, 1);
 9     dup2(fd, 2);
10     close(fd);
11     execl(prog,prog,NULL);
12     perror(prog);
13 }

这里用到了Unix I/O重定向技术,也就是把脚本里面的标准输出(Java的System.out.print()Python 的 print )直接连接到fd，也就是说你在脚本里面的 print 结果就是用户接受到的结果。

　　当然CGI这种技术是最基本的也是效率最低的,每次一个CGI请求都要fork()一次，而且你只能在CGI脚本要和Web Server在同一台机器上。现在出现很多技术取代它比如FastCGI,SCGI，而每个不同的语言都对CGI进行了扩展，形成了自己的规范比如Python的WSGI。而且每种语言在将Web系统部署在HTTP Server上面上的时候都有自己的解决方案,最常用的就是扩展HTTP Server的模块,编写相应的处理模块,比如Python的 mod_python其实本质上就是在Apache中嵌入一个Python解释器。而现在非常流行的一种架构方案就是 HTTP服务器做前端代理,接受用户请求,对于静态文件请求则直接响应给用户,对于动态请求则转发给运用服务器,运用服务器将处理的结果反馈给HTTP服务器,然后HTTP服务器在返回给用户。也就是 Server/Gateway 模式,比如 Python中组合 Nginx+Gunicorn,Nginx是代理服务器,而Gnicorn 是 WSGI服务器，Nginx将动态请求转发给Gnicorn,Gnicorn在将请求按照封装为符合WSGI规范的的请求,然后在调用相应的的app,由于WSGI服务器由Python编写,所以可以直接调用对应的方法即可，不用在fork()。(这里的处理过程类似于CGI但是规范之Python独有的WSGI规范,更加适合Python处理)详情处理过程可参看Python的 wsgiref模块。 WSGI类似于Java的Servlet,Gnicron类似于Tomcat.

　　如果想了解CGI的处理过程,建议直接看 Python CGI的源码，非常容易理解。

　　如果想了解 CGI/1.1 规范请点击 http://tools.ietf.org/html/draft-robinson-www-interface-00