Requirement to assemble PDF file into docx file

Query #1

We have requirement to assemble PDF file into docx file. To do the same we are converting PDF file into multiple images (one image per one PDF page) using Aspose’s ImageDevice (JpegDevice). After this we use these images to insert into docx using open xml library. But PDF to images conversion takes too much time, approx. 7.5 seconds for 5 Mb PDF file with 21 pages having text and images. With increasing file size (and complex content) performance degrades heavily.

Can you please let us know if there any better way to fulfill the requirement or are we doing anything wrong here? Below is the sample code of POC done for PDF to image conversion.

Query #2

Another approach we were thinking that if we can merge two PDFs (main docx is being converted to PDF using Aspose library and we need to merge another PDF to the same) to satisfy above requirement. In open xml we have content control block using which we can know to where we need to insert/merge another file. So in same way is there anything in Aspose library using which we can decide after which page we can merge another PDF’s pages? Please refer below sample code of POC done, here currently we are passing some static value to variable “insertAfterPage”. So can we have some type of tagging in PDF (same as content control block of open xml) so that we can know where to merge another PDF pages?

Please let us know if more information is required. And kindly note that we have already tried Aspose’s PDF to Word converter API but it won’t work for us as it does not maintain format properly in case of complex PDF files.”

@iticertis

Thank you for contacting support.

If you want to keep several files into one unified container then you may consider creating a PDF portfolio. Please refer to How to Create PDF Portfolio for further information about this. Moreover, creating images of all pages of a PDF document takes some time based on the content on each page. So, another approach can be to convert the DOCX file to a PDF document and then concatenate PDF files. You can insert an empty page at any index of PDF document as a marker for concatenating other file. Later, at any stage you can detect a blank page by using below code snippet and then concatenate the other PDF file with this document. Also, you can delete a particular page, blank page for instance, as per your requirements.

string inputFileName = @"c:\sample.pdf";
Document pdfDocument = new Document(inputFileName);
bool isBlank = pdfDocument.Pages[1].IsBlank(0.01d);
isBlank = pdfDocument.Pages[2].IsBlank(0.01d);

Furthermore, we can not find any sample code attached, it would not attach if file size is more than 3 MB. You can upload bigger files to Google Drive, Dropbox etc. and share the download link with us.

Please let us know if we have not properly understood your requirements or have missed some point and we would be more than happy to assist you in this regard.

pdfquery1.png (5.1 KB)
pdfquery2.png (10.8 KB)

Attaching the snipet for reference.

@iticertis

Thank you for sharing the code snippet.

Please follow suggested approach and feel free to contact us if you need any further assistance.

Insert blank page approach will not work for us as our DOCX is very dynamic and we cannot obtain page number in advance using open xml. So is there any way to add some tagging into target PDF file while converting from DOCX (same as open xml/word has content control block) so that we can get PDF page containing that tag and we can do concatenation? Kindly note that we’re using Aspose.Word to convert docx into PDF.

Also suggest if there is any custom approach we can take to complete this requirement.”

@iticertis

We would like to share with you that you do not need any page number or page count for inserting a blank page. It does not matter how dynamic the DOCX file is. As soon as you convert the DOCX file to a PDF document, use below lines of code and a blank page will be added at the end of that PDF document, irrespective of any number of pages.

        // Load source PDF document
        Aspose.Pdf.Document document = new Aspose.Pdf.Document("SavedByWordsAPI.pdf");

        //Add a page at the end of PDF document ; No page count/page number is required
        document.Pages.Add();

        //Save the document containing blank page at the end of it
        document.Save("ReadyToConcatenate.pdf");

Later, you can detect a blank page and save its index/page number; then concatenating other PDF document with this document.

Can you please add some details about this, including any document or diagram for our reference.

Furthermore, we are gathering information on how to insert a control block or tag a page before generating a PDF document using Aspose.Words API. In the meanwhile, please try using suggested approach as a quick and simple solution for your requirements.

@iticertis

In addition to our previous post, you can simply concatenate a document at the end of another document without finding the last page. Please try using below code snippet for your kind reference.

// Open first document
Document pdfDocument1 = new Document(dataDir + "Concat1.pdf");

// Open second document
Document pdfDocument2 = new Document(dataDir + "Concat2.pdf");

// Add pages of second document to the first 
pdfDocument1.Pages.Add(pdfDocument2.Pages);
dataDir = dataDir + "ConcatenatePdfFiles_out.pdf";

// Save concatenated output file
pdfDocument1.Save(dataDir);  

It will concatenate second PDF document with first PDF document without needing to specify any page number. We hope this will be helpful.

@iticertis,

Please ZIP and attach your input Word document and expected output PDF here for our reference. Please also share some detail about the “tagging” that you want in output PDF. We will then provide you more information about your query.