Jun 27, 2018

爬虫实践之网页长截图

今天在想一个问题，会不会有一天我的静态博客就废了，应该不至于哈，本地也能渲染嘛！但是假设真的没有地方可以访问我们的文章了，我们提前截个图，保存个pdf，备个份啥的也是有必要的嘛！本文就来说一下怎么为python网页内容截图，炒🐔煎🥚！

环境与依赖

python3
selenium库
PhantomJS

`selenium`库

使用pip3安装即可：

pip3 install selenium

Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.

Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.

简单来说，selenium是一个可以操作浏览器或web渲染引擎的东西，方便自动化测试，比较明显的特征就是可以模拟浏览器中用户做的操作，比如点击、滚动、拖动等等，当然也能解析网页内容，所以这就使它经常被用来爬动态页面。

可以用selenium创建chrome、Firefox、Safari、PhantomJS浏览器实例，具体使用方法以后有时间整理一下。

`PhantomJS`

这个东西不太熟悉，不过知道它是一个无界面、可脚本编程的网页浏览器，应该与chrome以--headless(使chrome无界面运行)参数运行的状态类似，为什么这里用PhantomJS？因为这个与Seleniumium库配合更容易截长图，装个phantomJS先。

访问网址：http://phantomjs.org/download.html 进行下载对应平台的应用包。

解压后会看到bin目录，其下就是phantomjs命令行运行程序，将它扔到$PATH路径中，比如这样：

mv phantomjs /usr/local/bin/

重启终端，验证一下：

➜  phantomjs -v
2.1.1

版本是2.1.1，说明安装成功！

长截图

这里十里使用selenium库和phantomjs引擎实现长截图，selenium库其实就是封装了调用phantomjs的接口，非常方便！要实现某个网页的长截图，代码非常少：

#!/usr/local/bin/python3
import os
import time
from selenium import webdriver

URL = 'https://www.smslit.top'
PIC_PATH = os.path.join(os.path.expanduser('~'), 'Pictures/python/screenshot') 
PIC_NAME = 'screenshot.png'

def screenshot_web(url=URL, path=PIC_PATH, name=PIC_NAME):
    '''capture the whole web page
    :param url: the website url
    :type url: str
    :param path: the path for saving picture
    :type str
    :param name: the name of picture
    :type name: str
    '''
    if not os.path.exists(path):
        os.makedirs(path)
    browser = webdriver.PhantomJS()
    browser.set_window_size(1200, 800)
    browser.get(url)
    time.sleep(3)
    pic_path = os.path.join(os.path.join(path, name))
    print(pic_path)
    if browser.save_screenshot(pic_path):
        print('Done!')
    else:
        print('Failed!')
    browser.close()
        
if __name__ == '__main__':
    screenshot_web()

首先构造了一个以PhantomJS引擎为核心的实例，然后get相应网址，等待3秒后，执行保存截图函数。 selenium中提供了WebDriverWait模块用来等待网页加载，这里就不展示了，使用比较暴力的死等。

执行效果

执行脚本：

python3 screenshot.py

⚠️

执行脚本后可能会看到以下内容：

/usr/local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn(‘Selenium support for PhantomJS has been deprecated, please use headless '

selenium建议使用Chrome或者Firefox的--headless模式替代PhantomJS，但是并不影响使用。

Finder中查看生成相应目录，并有相应截图：