文章目錄

說明：

今天突然想把爬取的HTML頁面轉存成PDF格式，進行一個學習，起源於這個還是很久之前看到一個爬取公衆號的文章保留爲PDF，但是想着學習自己實現一下哈，結果（懶呀，懶人總是能找到各種藉口的，一推就一倆個月過去了，今天突然就想起來了，就來實現一波，結果還真有點麻煩，代碼這個還是要自己動手寫寫，弄好之後做個自己看懂的總結就行（因爲這是你以新手的學習出發點學習的，也能幫助別人下你踩得坑），把人家寫的案例自己實現或者自己模仿找個其他案例測試測試，或者多找個文章學習學習）

一、環境配置：

1、window的wkhtmltopdf下載地址

這個不按照會報一個錯。
我的是window系統，所以需要還需要安裝一個exe文件：
下載地址1：
下載地址2
各個平臺下載的方法地址

下載的exe直接安裝即可，安裝位置建議更改到軟件盤。

記得安裝好把安裝位置的bin目錄放到環境變量中。

2、安裝pdfkit模塊：

pip install --upgrade pdfkit

二、代碼實現：

參考好幾個博客之後，我找到有以下幾種可以實現的方法，參考博客我放到下方，需要的可以去看看。

方法1–wkhtmltopdf命令url ：

剛剛安裝好，可以直接使用命令處理一個單個的url。
命令格式：wkhtmltopdf + url + 輸出名稱（可以是絕對路徑或者相對路徑）

 wkhtmltopdf https://www.liaoxuefeng.com/wiki/1016959663602400/1016959735620448 demo1.pdf

方法2–wkhtmltopdf命令html：

命令格式：wkhtmltopdf + html文件（可以是絕對路徑或者相對路徑）+ 輸出pdf路徑（可以是絕對路徑或者相對路徑）

wkhtmltopdf .\0.html demo2.pdf

方法3–pdfkit的from_url（url這個不報錯）：

注意點：

這個要把剛剛安裝的環境位置弄上，我添加到系統的環境變量中，不加入這個還是保錯，不知道有個博主寫的，他爲什麼可以下載保存，我這邊要加上這個配置。

path_wk = r'd:\tools\wkhtmltopdf\bin\wkhtmltopdf.exe'  # 安裝位置
config = pdfkit.configuration(wkhtmltopdf=path_wk)

代碼演示：

import pdfkit


path_wk = r'd:\tools\wkhtmltopdf\bin\wkhtmltopdf.exe'  # 安裝位置
config = pdfkit.configuration(wkhtmltopdf=path_wk)
# pdfkit.from_url(['google.com', 'yandex.ru', 'engadget.com'], 'out1.pdf',configuration=config)
pdfkit.from_url(['https://www.liaoxuefeng.com/wiki/1016959663602400/1016959735620448'], 'demo3.pdf',configuration=config)

方法4–pdfkit的from_file（我的雖然也能成功生成pdf，但是這個會報錯，找了幾個小時沒有找到怎麼解決，如果有懂得大佬，可以賜教一下哈）：

代碼，可以合成單個html，也可以合成多個html：

# -*- coding: utf-8
import pdfkit

path_wk = r'd:\tools\wkhtmltopdf\bin\wkhtmltopdf.exe'  # 安裝位置
config = pdfkit.configuration(wkhtmltopdf=path_wk)
pdfkit.from_file(['0.html', '1.html'], 'demo5.pdf', configuration=config)

能生成能打開pdf，其實效果可以了，就是報錯，唯一遺憾的是一直沒有找到報錯解決方法，等閒了回家用自己電腦測試試試，是不是公司電腦中其他環境問題：

問題：

我把問題放到這裏，如果有懂的大佬，歡迎留言給我講解一波哈。

Exception in thread Thread-2:
Traceback (most recent call last):
  File "D:\tools\Python3.6\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "D:\tools\Python3.6\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "D:\tools\Python3.6\lib\subprocess.py", line 1084, in _readerthread
    buffer.append(fh.read())
  File "D:\tools\Python3.6\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 639: invalid continuation byte

Traceback (most recent call last):
  File "D:/zjf_workspace/000、爬蟲代碼-基礎的/scrapy_100_工具/27、將網頁html轉存成pdf/3、pdfkit模塊/2、pdfkit模塊--from_file.py", line 6, in <module>
    pdfkit.from_file(['0.html', '1.html'], 'demo5.pdf', configuration=config)
  File "D:\tools\Python3.6\lib\site-packages\pdfkit\api.py", line 49, in from_file
    return r.to_pdf(output_path)
  File "D:\tools\Python3.6\lib\site-packages\pdfkit\pdfkit.py", line 164, in to_pdf
    raise IOError("wkhtmltopdf exited with non-zero code {0}. error:\n{1}".format(exit_code, stderr))
OSError: wkhtmltopdf exited with non-zero code 1. error:

方法5–就是使用python執行系統命令的方法執行前倆個方法，可以做到批量處理。

python執行系統命令的方法主要有下面這三個：

os.system()
os.popen()
subprocess.Popen()
新增：這三個的區別和方法，可以參考我的另一篇博客。
python執行系統命令的方法總結

後續補充新增：三、用自己的方法實現完成將廖雪峯的129頁博客保存爲一個pdf：

這個具體我不解釋了，就是前面的一個綜合，直接上代碼吧：

import os
import subprocess

import requests
from PyPDF2 import PdfFileWriter, PdfFileReader
from lxml import etree


class Merge_LiaoXueFeng:
    def __init__(self, pdf_name,path):
        self.headers = {
            "Cookie": "Hm_lvt_2efddd14a5f2b304677462d06fb4f964=1571883576; Hm_lpvt_2efddd14a5f2b304677462d06fb4f964=1571884481",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
        }
        self.urls = self.get_url_list()
        self.pdf_name = pdf_name
        self.path = path

    def get_url_list(self):
        """
        獲取所有URL目錄列表
        :return:
        """
        response = requests.get("https://www.liaoxuefeng.com/wiki/1016959663602400", headers=self.headers)
        html = etree.HTML(response.text)
        with open('ret.html', 'w', encoding='utf-8') as file:
            file.write(response.text)
        href_list = html.xpath('//*[@id="x-wiki-index"]//a/@href')
        print("myself_href", href_list)
        urls = []
        for href in href_list:
            url = "http://www.liaoxuefeng.com" + href
            urls.append(url)
        return urls

    def merge_pdf(self, infnList, outfn):
        """
        合併pdf
        :return:
        """
        pdf_output = PdfFileWriter()
        # 把所有pdf寫入一個pdf（pdf合併）
        for infn in infnList:
            pdf_input = PdfFileReader(open(infn, 'rb'))
            # 獲取 pdf 共用多少頁，把每一個pdf的所有頁數寫進一個pdf
            page_count = pdf_input.getNumPages()
            print(page_count)
            for i in range(page_count):
                pdf_output.addPage(pdf_input.getPage(i))
        pdf_output.write(open(outfn, 'wb'))

    def get_pdf_list(self):
        """
        獲取當前位置的pdf目錄下的所有pdf的絕對路徑，返回爲pdf路徑列表
         :return:
        """
        # 獲取當前pdf目錄下的所以pdf文件
        # path = os.getcwd(r"D:\zjf_workspace\000、爬蟲代碼-基礎的\scrapy_100_工具\27、將網頁html轉存成pdf\1、批量處理")
        html_path = os.path.join(self.path, 'pdf')
        file_list = os.listdir(html_path)
        pdf_list = []
        for file_one in file_list:
            # 判斷是否都是pdf文件
            if file_one.endswith('.pdf'):
                pdf_file = os.path.join(html_path, file_one)
                pdf_list.append(pdf_file)
        return pdf_list

    def run(self):
        num = 0
        subprocess_list = []
        # 1、保存文章爲pdf
        for article_url in self.urls:
            num += 1
            # 不等待結束接着運行下一個，（不建議很多運行,可以五個左右設置一個等待完成，防止多個運行電腦卡死）
            subprocess_one = subprocess.Popen(r'wkhtmltopdf {} ./pdf/{}.pdf'.format(article_url, num))
            subprocess_list.append(subprocess_one)
            if len(subprocess_list) >= 10:
                for i in subprocess_list:
                    i.wait()
                subprocess_list = []
            else:
                pass
            # os_one = os.popen(r'wkhtmltopdf {} ./pdf2/{}.pdf'.format(article_url, num))
            # os_one.close()
            # print(dir(os_one))
            # time.sleep(20)
            # 一個運行結束另一個運行（可以加個協程跑快一點）
            # os.system(r'wkhtmltopdf {} ./pdf3/{}.pdf'.format(article_url, num))

        # 最後可能不大於10，所有把後面小於10的執行完畢
        for i in subprocess_list:
            i.wait()

        # 2、獲取當前pdf目錄下的所以pdf文件
        pdf_list = self.get_pdf_list()
        print("pdf_list",pdf_list)

        # 3、合併pdf
        print('pdf下載完畢，準備合併pdf:')
        self.merge_pdf(pdf_list, self.pdf_name)


if __name__ == '__main__':
    path = os.getcwd()
    print(path)
    liaoxuefeng = Merge_LiaoXueFeng(u"廖雪峯Python_all.pdf", path)
    liaoxuefeng.run()

最終效果：

學習參考文章：

https://blog.csdn.net/hubaoquanu/article/details/66973149
https://blog.csdn.net/y101101025/article/details/62461115
https://blog.csdn.net/u012561176/article/details/83655247
https://blog.csdn.net/xc_zhou/article/details/80952168

python 將html保存爲PDF之一個學習筆記

文章目錄

說明：

一、環境配置：

1、window的wkhtmltopdf下載地址

記得安裝好把安裝位置的bin目錄放到環境變量中。

2、安裝pdfkit模塊：

二、代碼實現：

方法1–wkhtmltopdf命令url ：

方法2–wkhtmltopdf命令html：

方法3–pdfkit的from_url（url這個不報錯）：

注意點：

代碼演示：

方法4–pdfkit的from_file（我的雖然也能成功生成pdf，但是這個會報錯，找了幾個小時沒有找到怎麼解決，如果有懂得大佬，可以賜教一下哈）：

問題：

方法5–就是使用python執行系統命令的方法執行前倆個方法，可以做到批量處理。

後續補充新增：三、用自己的方法實現完成將廖雪峯的129頁博客保存爲一個pdf：

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

window的dos命令學習筆記五

window的dos命令學習筆記三

window的dos命令學習筆記四

爬蟲之lxml報錯：ValueError: Unicode strings with encoding declaration are not supported. Please use bytes

python 之釘釘羣監控信息

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結