制作数据集(一):从google image中批量下载图片

背景:因为项目需要,希望制作一个由平面、反光材质的照片组成的数据集,如木质纹理的桌面、门面, 平坦的瓷砖地板、墙面,反光的金属表面等等。但是找不到能满足需求的数据集,所以制作了自己的数据集

方法:首先po出原文链接:https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/

  • 在谷歌图片中搜索想要下载的图片,这里仅限谷歌浏览器,国内可以通过GHelper插件使用谷歌,读者可以自行百度,如何安装GHelper
  • 如下,
  • 1、先在谷歌图片中搜索"C Ronaldo",
  • 2、滚动页面,直到加载完所有你想下载的图片
  • 3、使用“ctrl + shift + j”调出JavaScript控制台
  • 4、将下面的代码复制到控制台,随即出现一个名为‘urls.txt’的下载页面,选择合适的位置下载即可
  • /**
     * simulate a right-click event so we can grab the image URL using the
     * context menu alleviating the need to navigate to another page
     *
     * attributed to @jmiserez: http://pyimg.co/9qe7y
     *
     * @param   {object}  element  DOM Element
     *
     * @return  {void}
     */
    function simulateRightClick( element ) {
        var event1 = new MouseEvent( 'mousedown', {
            bubbles: true,
            cancelable: false,
            view: window,
            button: 2,
            buttons: 2,
            clientX: element.getBoundingClientRect().x,
            clientY: element.getBoundingClientRect().y
        } );
        element.dispatchEvent( event1 );
        var event2 = new MouseEvent( 'mouseup', {
            bubbles: true,
            cancelable: false,
            view: window,
            button: 2,
            buttons: 0,
            clientX: element.getBoundingClientRect().x,
            clientY: element.getBoundingClientRect().y
        } );
        element.dispatchEvent( event2 );
        var event3 = new MouseEvent( 'contextmenu', {
            bubbles: true,
            cancelable: false,
            view: window,
            button: 2,
            buttons: 0,
            clientX: element.getBoundingClientRect().x,
            clientY: element.getBoundingClientRect().y
        } );
        element.dispatchEvent( event3 );
    }
    
    /**
     * grabs a URL Parameter from a query string because Google Images
     * stores the full image URL in a query parameter
     *
     * @param   {string}  queryString  The Query String
     * @param   {string}  key          The key to grab a value for
     *
     * @return  {string}               value
     */
    function getURLParam( queryString, key ) {
        var vars = queryString.replace( /^?/, '' ).split( '&' );
        for ( let i = 0; i < vars.length; i++ ) {
            let pair = vars[ i ].split( '=' );
            if ( pair[0] == key ) {
                return pair[1];
            }
        }
        return false;
    }
    
    
    
    /**
     * Generate and automatically download a txt file from the URL contents
     *
     * @param   {string}  contents  The contents to download
     *
     * @return  {void}
     */
    function createDownload( contents ) {
        var hiddenElement = document.createElement( 'a' );
        hiddenElement.href = 'data:attachment/text,' + encodeURI( contents );
        hiddenElement.target = '_blank';
        hiddenElement.download = 'urls.txt';
        hiddenElement.click();
    }
    
    /**
     * grab all URLs va a Promise that resolves once all URLs have been
     * acquired
     *
     * @return  {object}  Promise object
     */
    function grabUrls() {
        var urls = [];
        return new Promise( function( resolve, reject ) {
            var count = document.querySelectorAll(
                '.isv-r a:first-of-type' ).length,
                index = 0;
            Array.prototype.forEach.call( document.querySelectorAll(
                '.isv-r a:first-of-type' ), function( element ) {
                // using the right click menu Google will generate the
                // full-size URL; won't work in Internet Explorer
                // (http://pyimg.co/byukr)
                simulateRightClick( element.querySelector( ':scope img' ) );
                // Wait for it to appear on the <a> element
                var interval = setInterval( function() {
                    if ( element.href.trim() !== '' ) {
                        clearInterval( interval );
                        // extract the full-size version of the image
                        let googleUrl = element.href.replace( /.*(?)/, '$1' ),
                            fullImageUrl = decodeURIComponent(
                                getURLParam( googleUrl, 'imgurl' ) );
                        if ( fullImageUrl !== 'false' ) {
                            urls.push( fullImageUrl );
                        }
                        // sometimes the URL returns a "false" string and
                        // we still want to count those so our Promise
                        // resolves
                        index++;
                        if ( index == ( count - 1 ) ) {
                            resolve( urls );
                        }
                    }
                }, 10 );
            } );
        } );
    }
    
    /**
     * Call the main function to grab the URLs and initiate the download
     */
    grabUrls().then( function( urls ) {
        urls = urls.join( '
    ' );
        createDownload( urls );
    } );
  • Python 爬取图片
  • 新建一个“download_images.py”文件,并在虚拟环境下,安装你需要安装的包,并将下面的代码复制到你的“download_images.py”文件
  • # import the necessary packages
    from imutils import paths
    import argparse
    import requests
    import cv2
    import os
    # construct the argument parse and parse the arguments
    ap = argparse.ArgumentParser()
    ap.add_argument("-u", "--urls", required=True,
        help="path to file containing image URLs")    # 设置刚刚保存的urls.txt路径
    ap.add_argument("-o", "--output", required=True,
        help="path to output directory of images")    # 设置图片保存的路径
    args = vars(ap.parse_args())
    # grab the list of URLs from the input file, then initialize the
    # total number of images downloaded thus far
    rows = open(args["urls"]).read().strip().split("
    ")
    total = 0
    # loop the URLs
    for url in rows:
        try:
            # try to download the image
            r = requests.get(url, timeout=60)
            # save the image to disk
            p = os.path.sep.join([args["output"], "{}.jpg".format(
                str(total).zfill(8))])
            f = open(p, "wb")
            f.write(r.content)
            f.close()
            # update the counter
            print("[INFO] downloaded: {}".format(p))
            total += 1
        # handle if any exceptions are thrown during the download process
        except:
            print("[INFO] error downloading {}...skipping".format(p))
    # loop over the image paths we just downloaded
    for imagePath in paths.list_images(args["output"]):
        # initialize if the image should be deleted or not
        delete = False
        # try to load the image
        try:
            image = cv2.imread(imagePath)
            # if the image is `None` then we could not properly load it
            # from disk, so delete it
            if image is None:
                delete = True
        # if OpenCV cannot load the image then the image is likely
        # corrupt so we should delete it
        except:
            print("Except")
            delete = True
        # check to see if the image should be deleted
        if delete:
            print("[INFO] deleting {}".format(imagePath))
            os.remove(imagePath)
    $ python download_images.py --urls urls.txt --output images/santa    # 设置路径
    [INFO] downloaded: images/santa/00000000.jpg
    [INFO] downloaded: images/santa/00000001.jpg
    [INFO] downloaded: images/santa/00000002.jpg
    [INFO] downloaded: images/santa/00000003.jpg
    ...
    [INFO] downloaded: images/santa/00000519.jpg
    [INFO] error downloading images/santa/00000519.jpg...skipping
    [INFO] downloaded: images/santa/00000520.jpg
    ...
    [INFO] deleting images/santa/00000211.jpg
    [INFO] deleting images/santa/00000199.jpg
原文地址:https://www.cnblogs.com/LuckBelongsToStrugglingMan/p/12900993.html