杭电OJ第4018题 Parsing URL

　　杭电OJ第4018题，Parsing URL（题目链接）。

Parsing URL

Problem Description

In computing, a Uniform Resource Locator or Universal Resource Locator (URL) is a character string that specifies where a known resource is available on the Internet and the mechanism for retrieving it.
The syntax of a typical URL is:
scheme://domain:port/path?query_string#fragment_id
In this problem, the scheme, domain is required by all URL and other components are optional. That is, for example, the following are all correct urls:
http://dict.bing.com.cn/#%E5%B0%8F%E6%95%B0%E7%82%B9
http://www.mariowiki.com/Mushroom
https://mail.google.com/mail/?shva=1#inbox
http://en.wikipedia.org/wiki/Bowser_(character)
ftp://fs.fudan.edu.cn/
telnet://bbs.fudan.edu.cn/
http://mail.bashu.cn:8080/BsOnline/
Your task is to find the domain for all given URLs.

Input

There are multiple test cases in this problem. The first line of input contains a single integer denoting the number of test cases. For each of test case, there is only one line contains a valid URL.

Output

For each test case, you should output the domain of the given URL.

Sample Input

3
http://dict.bing.com.cn/#%E5%B0%8F%E6%95%B0%E7%82%B9
http://www.mariowiki.com/Mushroom
https://mail.google.com/mail/?shva=1#inbox

Sample Output

Case #1: dict.bing.com.cn
Case #2: www.mariowiki.com
Case #3: mail.google.com

Source

The 36th ACM/ICPC Asia Regional Shanghai Site —— Warmup

　　解题思路：简单的字符串解析，没有任何难度。不过要注意，不要输出端口号。直接用Java的正则表达式就能轻松搞定。

import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main
{
    public static void main(String args[])
    {
        Scanner cin = new Scanner(System.in);
        int n;
        String URL;
        Matcher matcher;
        Pattern pattern = Pattern.compile("([A-Za-z]+://)([^:/]+)[:/].*");

        n = cin.nextInt();
        URL = cin.nextLine();
        for ( int i = 1 ; i <= n ; i ++ )
        {
            URL = cin.nextLine();
            matcher = pattern.matcher(URL);
            if ( matcher.matches() )
                System.out.println("Case #" + i + ": " + matcher.group(2) );
        }
    }
}

　　喜欢用C语言搞也行。C语言本来可以用GNU正则表达式的。

C语言 + GNU正则表达式

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>

typedef int COUNT;

#define MAX_LENGTH 1000

int main (void)
{
    COUNT i;
    int n;
    char url[MAX_LENGTH];
    regmatch_t pmatch[4];
    regex_t match_regex;

    regcomp( &match_regex, "([A-Za-z]+://)([^:/]+)([:/].*)", REG_EXTENDED );

    scanf( "%d", &n );
    for ( i = 1 ; i <= n ; i ++ )
    {
        scanf( "%s", url );
        regexec( &match_regex, url, 4, pmatch, 0 );
        url[pmatch[2].rm_eo] = '\0';
        puts( &(url[pmatch[2].rm_so]) );
    }

    regfree( &match_regex );
    return EXIT_SUCCESS;
}

不过杭电OJ是Windows服务器，用的gcc编译器是MinGW的gcc，所以不支持GNU正则表达式，所以如果用C语言写，就只能自己解析字符串了。C代码如下：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

typedef int COUNT;

#define MAX_LENGTH 1000

int main (void)
{
    COUNT i, j;
    int n;
    bool starturl;
    char url[MAX_LENGTH];
    char outputurl[MAX_LENGTH];
    int len;
    scanf( "%d", &n );
    for ( i = 1 ; i <= n ; i ++ )
    {
        starturl = false;
        scanf( "%s", url );
        sprintf (outputurl, "Case #%d: ", i );
        len = strlen( outputurl );
        for ( j = 0 ; url[j] != '\0' ; j ++ )
        {
            if ( !starturl )
            {
                if ( url[j] == '/' )
                {
                    j ++;
                    starturl = true;
                }
            }
            else
            {
                if ( url[j] == ':' 
                        || url[j] == '/'
                        || url[j] == '\0' )
                    break;
                outputurl[len++] = url[j];
            }
        }
        outputurl[len] = '\0';
        puts( outputurl );
    }
    return EXIT_SUCCESS;
}