.doc到PDF使用Python

我负责将大量的.doc文件转换为.pdf。而我的主pipe要我这样做的唯一方法是通过MSWord 2010.我知道我应该能够自动化与Python COM自动化。唯一的问题是我不知道如何以及从哪里开始。我试图寻找一些教程，但无法find任何（可能我可能有，但我不知道我在找什么）。

现在我正在读这个。不知道这将是多么有用。

一个简单的例子，使用comtypes ，转换单个文件，input和输出文件名作为命令行参数：

import sys import os import comtypes.client wdFormatPDF = 17 in_file = os.path.abspath(sys.argv[1]) out_file = os.path.abspath(sys.argv[2]) word = comtypes.client.CreateObject('Word.Application') doc = word.Documents.Open(in_file) doc.SaveAs(out_file, FileFormat=wdFormatPDF) doc.Close() word.Quit()

你也可以使用pywin32 ，除了：

 import win32com.client

接着：

 word = win32com.client.Dispatch('Word.Application')

如果你不介意使用PowerShell看看这个嘿，脚本专家！文章。所提供的代码可以用来使用wdFormatPDF枚举值（参见这里）。这篇博客文章介绍了同一个想法的不同实现。

值得注意的是，Stevens回答的工作，但确保如果使用for循环导出多个文件放置在循环之前的ClientObject或Dispatch语句 – 它只需要创build一次 – 请参阅我的问题： Python win32com.client.Dispatch循环通过Word文档并导出为PDF; 下一个循环发生时失败

unoconv（用python写）和openoffice作为无头守护进程运行。 http://dag.wiee.rs/home-made/unoconv/

对doc，docx，ppt，pptx，xls，xlsx非常好。如果您需要将文档转换或保存/转换为服务器上的某些格式，则非常有用

我已经在这个问题上工作了半天，所以我想我应该分享一些我在这个问题上的经验。史蒂芬的回答是对的，但在我的电脑上会失败。有两个关键点要解决这里：

（1）。第一次创build'Word.Application'对象时，我应该在打开任何文档之前使它（app这个词）可见。（其实我自己也不能解释为什么会这样，如果我的电脑没有这样做，当我试图在不可见模型中打开一个文档的时候程序会崩溃，那么'Word.Application'对象将被删除OS）。

（2）。做完（1）之后，程序有时会运行良好，但可能会经常失败。崩溃错误"COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))"意味着COM服务器可能无法如此迅速地响应。所以我试图打开一个文档之前加了一个延迟。

在完成这两个步骤之后，程序将会完好无损地工作。演示代码如下。如果遇到同样的问题，请尝试遵循以下两个步骤。希望它有帮助。

  import os import comtypes.client import time wdFormatPDF = 17 # absolute path is needed # be careful about the slash '\', use '\\' or '/' or raw string r"..." in_file=r'absolute path of input docx file 1' out_file=r'absolute path of output pdf file 1' in_file2=r'absolute path of input docx file 2' out_file2=r'absolute path of outputpdf file 2' # print out filenames print in_file print out_file print in_file2 print out_file2 # create COM object word = comtypes.client.CreateObject('Word.Application') # key point 1: make word visible before open a new document word.Visible = True # key point 2: wait for the COM Server to prepare well. time.sleep(3) # convert docx file 1 to pdf file 1 doc=word.Documents.Open(in_file) # open docx file 1 doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion doc.Close() # close docx file 1 word.Visible = False # convert docx file 2 to pdf file 2 doc = word.Documents.Open(in_file2) # open docx file 2 doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion doc.Close() # close docx file 2 word.Quit() # close Word Application

我尝试了接受的答案，但并不特别热衷于Word正在产生的臃肿的PDF，这通常比预期的要大一个数量级。在使用虚拟PDF打印机查看如何禁用对话框之后，我遇到了Bullzip PDF Printer，我对它的function印象深刻。现在它已经replace了我以前使用的其他虚拟打印机。你会在他们的下载页面find一个“免费社区版”。

COM API可以在这里find，可用的设置列表可以在这里find。这些设置被写入“Runonce”文件，该文件仅用于一个打印作业，然后自动删除。当打印多个PDF时，我们需要确保一个打印作业完成之后才能开始另一个打印作业，以确保每个文件的设置都正确使用。

 import os, re, time, datetime, win32com.client def print_to_Bullzip(file): util = win32com.client.Dispatch("Bullzip.PDFUtil") settings = win32com.client.Dispatch("Bullzip.PDFSettings") settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer outputFile = re.sub("\.[^.]+$", ".pdf", file) statusFile = re.sub("\.[^.]+$", ".status", file) settings.SetValue("Output", outputFile) settings.SetValue("ConfirmOverwrite", "no") settings.SetValue("ShowSaveAS", "never") settings.SetValue("ShowSettings", "never") settings.SetValue("ShowPDF", "no") settings.SetValue("ShowProgress", "no") settings.SetValue("ShowProgressFinished", "no") # disable balloon tip settings.SetValue("StatusFile", statusFile) # created after print job settings.WriteSettings(True) # write settings to the runonce.ini util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer # wait until print job completes before continuing # otherwise settings for the next job may not be used timestamp = datetime.datetime.now() while( (datetime.datetime.now() - timestamp).seconds < 10): if os.path.exists(statusFile) and os.path.isfile(statusFile): error = util.ReadIniString(statusFile, "Status", "Errors", '') if error != "0": raise IOError("PDF was created with errors") os.remove(statusFile) return time.sleep(0.1) raise IOError("PDF creation timed out")

您应该从调查所谓的虚拟PDF打印驱动程序开始。只要你find一个你应该能够写的batch file，打印您的DOC文件到PDF文件。您可能也可以在Python中执行此操作（设置打印机驱动程序输出并在MSWord中发出文档/打印命令，稍后可以使用命令行AFAIR完成）。

我会build议忽略你的主pipe，并使用具有Python API的OpenOffice。 OpenOffice已经构build了对Python的支持，并且有人为此创build了一个特定的库（ PyODConverter ）。

如果他对输出结果不满意，告诉他可能需要几周时间才能完成。

.doc到PDF使用Python

R命令将工作目录设置为源文件位置

Android – 同一个应用程序的多个自定义版本

如何自动化开发环境设置？

什么是testing线束？

实时（未保存的）Excel数据和C＃对象之间的最快接口

在单元格更改时自动执行Excelmacros

dpkg-reconfigure tzdata的非交互式方法

Matlab：从命令行运行一个m文件

如何把WebBrowser控件到IE9标准？

传统C / C ++项目中的死代码检测