Whenever I discover text that needs to be replaced in my main document, how could I insert a pdf document here?

alexey.noskov · July 28, 2023, 5:27pm

@panCognity The provided code does exactly what you need. The find/replace mechanism searches for keywords that matches the provided regular expression:

new Regex(@"\[(\d+_\w+)\]")

i.e. placeholders like [11111111_sometext] and in IReplacingCallback implementation insert the document with the name 11111111_sometext.pdf at the matched placeholder:

// Put the document before the current DocumentBuilder section.
string path = $@"C:\Temp\{e.Match.Groups[1].Value}.pdf";
if (!File.Exists(path))
    return ReplaceAction.Skip;

Document replacementDocument = new Document(path);
foreach (Section s in replacementDocument.Sections)
{
    Section dstSectopn = (Section)doc.ImportNode(s, true, ImportFormatMode.UseDestinationStyles);
    builder.CurrentSection.ParentNode.InsertBefore(dstSectopn, builder.CurrentSection);
}

If there is no document with the specified name, the placeholder is skipped.

panCognity · July 28, 2023, 8:27pm

You are absolutely right! Earlier, you mentioned these points, though the results weren’t quite right. The code works fine now that I have rewritten it! Thank you so much! I really appreciate your help!

panCognity · July 28, 2023, 8:37pm

I cannot get the right result if I use the actual files rather than the samples I sent you. Please find attached the actual files and the result file named Sample without Bookmarks_Out.docx.:
4097362984_Kinhsh.pdf (862.4 KB)
4097362984_Simvasi.pdf (58.1 KB)
4097362984_Kinhsh.pdf (862.4 KB)
4097376683_Simvasi.pdf (57.9 KB)
Sample without Bookmarks.pdf (242.8 KB)
Sample without Bookmarks_Out.docx (166.7 KB)

alexey.noskov · July 29, 2023, 4:15am

@panCognity Could you please elaborate the problem in more details? Unfortunately, it is not quite clear what is your expected result. If possible please provide screenshots of the problems and specify on which page the problem is observed.

Also, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

panCognity · July 29, 2023, 7:17am

A shrinked version of the input document is displayed. Additionally, their image appears on all pages. An inserted pdf document does not show the letters, it appears to be incomprehensible. Attached are some pictures showing the problem.

Though, converting all pdf files (including main document) into docx seems to work. The result is fine! I attach it as Sample without Bookmarks_Out.docx.
Screenshot 2023-07-29 101203.png (14.6 KB)
Screenshot 2023-07-29 101130.png (36.9 KB)
Screenshot 2023-07-29 101059.png (7.8 KB)
Sample without Bookmarks_Out.docx (7.0 MB)

alexey.noskov · July 29, 2023, 2:57pm

@panCognity The problem occurs upon reading PDF document in Aspose.Words DOM.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-25731,WORDSNET-25732

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

panCognity · July 31, 2023, 3:29pm

I look forward to hearing back from you. In the meantime, I have forwarded my email once again to my client, so that he will purchase the licenses necessary to continue developing their .NET application.

alexey.noskov · July 31, 2023, 5:14pm

@panCognity Sure, we will keep you updated and let you know once the issues are resolved or we have more information for you.

aspose.notifier · September 5, 2023, 11:17am

The issues you have found earlier (filed as WORDSNET-25732) have been fixed in this Aspose.Words for .NET 23.9 update also available on NuGet.

panCognity · October 16, 2023, 2:25pm

Thank you! Sorry for being late to reply. I had an accident. Anyway, I will try it out and get back to you If I have any inquiries.

panCognity · October 16, 2023, 2:57pm

The problem still remains. Do you want me to open a deferent ticket with the issue?

alexey.noskov · October 16, 2023, 3:03pm

@panCognity There were reported two issue in this topic:

WORDSNET-25732 - Font size is changed after converting PDF to DOCX
WORDSNET-25731 - Content is damaged after converting PDF to DOCX.

The WORDSNET-25732 has been resolved. WORDSNET-25731 is not resolved yet.

panCognity · October 16, 2023, 3:28pm

Ok. I will wait to be resolved. Thank you for your answer.

panCognity · January 12, 2024, 4:32pm

Documents.zip (2.3 MB)

Hi.

I convert all pdf files to docx files. Then I replace all placeholders of main file (which is converted from ΤΣΙΓ_ΝΔ_3_0025.pdf file) with additional docs to it, so there is a bigger docx file. When finished, I save both a docx file and a pdf file. Finally, I compress pdf file, so it has a smaller size. I saw that the “ΤΣΙΓ_ΝΔ_3_0025.docx” file has some issues in tables (lines, fields). Is there a way I can fix those problems?

alexey.noskov · January 12, 2024, 6:04pm

@panCognity Could you please specify where the problem is? Your documents are quite big and it is not quite clear what problem you mean. It would be great if you simplify the example to demonstrate the problem and provide simple code that will allow us to reproduce the problem.

panCognity · January 15, 2024, 11:05am

ΤΣΙΓ_ΝΔ_3_0025_Output.zip (4.9 MB)

I am sending you the conversion to docx code, the code for find and replace text in the document, the code for compressing the output file and the output file itself. In the output file, I have drawn a red rectangle where the problem occurs. It is exactly the same in the converted docx file, before I save it to a pdf. There are other files that are much larger than this one. However, the problem remains the same. As I had some issues with greek fonts and images (There is an open ticket for this (see your reply: WORDSNET-25731 has not yet been resolved.) Therefore, I have chosen to convert all pdf files to docx files first, replace the placeholders in docx file and then save it to both docx and pdf file, as you can see from the code I have attached.

Main template is ΤΣΙΓ_ΝΔ_3_0025.pdf file. All other pdf files will be embedded in the ΤΣΙΓ_ΝΔ_3_0025.pdf file, replacing placeholders.

alexey.noskov · January 15, 2024, 12:10pm

@panCognity Thank you for additional information. Unfortunately, I cannot reproduce the problem on my side. Could you please create a simple console application, which includes all required documents and resource and allows us to reproduce the problem on our side.

Also, please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents. While conversion PDF document to MS Word document Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity. So it is not always possible to retain PDF document layout to MS Word document or after PDF-DOCX-PDF roundtrip.

panCognity · January 15, 2024, 12:28pm

Agwges.zip (17.3 KB)

Project_Horizon.zip (2.1 MB)

Attached is the console app (Agwges.zip) and also sample files my application uses (Project_Horizon.zip).

After you inspect this problem, I would like to be advised also on how could I improve compression, when I have larger files.

alexey.noskov · January 15, 2024, 6:26pm

@panCognity Thank you for additional information. The problem is not in Aspose.Words. It is caused by Aspose.PDF while conversion from PDF to DOCX. The problem can be reproduced by the following simple code:

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"C:\\Temp\\Ææêé_îâ_3_0025.pdf");
Aspose.Pdf.DocSaveOptions saveOptions = new Aspose.Pdf.DocSaveOptions
{
    Format = Aspose.Pdf.DocSaveOptions.DocFormat.DocX,
    Mode = Aspose.Pdf.DocSaveOptions.RecognitionMode.Flow,
    RelativeHorizontalProximity = 5.5f,
    // Enable the value to recognize bullets during conversion process
    RecognizeBullets = true
};
pdfDocument.Save(@"C:\Temp\out.docx", saveOptions); // The problem is already there in the DOCX produced by Aspose.PDF

Aspose.Words.Document doc = new Aspose.Words.Document(@"C:\Temp\out.docx");
doc.Save(@"C:\Temp\out.pdf");

So you should report the problem to Aspose.PDF team in the appropriate forum.

I am afraid there is no way to speed up the document processing, since you are converting from PDF to DOCX. As I have mentioned models of these formats are quite different. So the conversion process is quite complex and time consuming.

panCognity · January 16, 2024, 8:31am

I have understood this. That’s why I asked if there is any workaround/solution to this. Could you suggest the appropriate forum? I want to make sure that I will report it in the right place. Thank you in advance!