Word-to-searchable-pdf-headers-are-missing

Gpatil · April 5, 2024, 10:54am

Hi Team,

As suggested by below support reqeust
https://forum.aspose.com/t/word-to-searchable-pdf-headers-are-missing/280590/4

Will it be possible to fiz this issue in Aspose.Word . I have attached the sample documents in the above request.

alexey.noskov · April 5, 2024, 11:57am

@Gpatil Could you please elaborate your problem in more details? In the MS Word attached in the mentioned thread there are only two pages, while in the attached PDF there are 12 pages. So it is not clear how to produce the problematic PDF from the attached MS Word document.

Gpatil · April 5, 2024, 12:56pm

Hi @alexey.noskov

You can check in the attached 12 Pages Searchable PDF in earlier ticket .
The headers are not consistent. The header appeared in first 1 page and then in last 2 pages.
The headers contain Red Bold TexT, Name, DOB ,MEMBER ID are missing for some middles pages.

This happen after we try to convert docx to searchable pdf as per earlier ticket

The Original Doc File is attached

A210118625-MRN-NOPHI.docx (46.0 KB)

alexey.noskov · April 5, 2024, 1:24pm

@Gpatil Thank you for additional information. Unfortunately, I cannot reproduce the problem using the latest 24.4 version of Aspose.Words. Here is PDF produced on my side using the following simple code:

Document doc = new Document(@"C:\Temp\in.docx");

PdfSaveOptions pdfSaveOptions = new PdfSaveOptions()
{
    ZoomFactor = 100,
    ZoomBehavior = PdfZoomBehavior.FitPage,
    SaveFormat = Aspose.Words.SaveFormat.Pdf,
    ColorMode = ColorMode.Normal,
    FontEmbeddingMode = PdfFontEmbeddingMode.EmbedAll,
    EmbedFullFonts = true,
    Compliance = PdfCompliance.PdfUa1,
    ImageCompression = PdfImageCompression.Auto,
    TextCompression = PdfTextCompression.None,
    MemoryOptimization = true,
    JpegQuality = 50,
    PageMode = PdfPageMode.FullScreen,
    UseHighQualityRendering = true,
    OptimizeOutput = true,
    HeaderFooterBookmarksExportMode = HeaderFooterBookmarksExportMode.All,
    AdditionalTextPositioning = true,
    DisplayDocTitle = true,
    ExportDocumentStructure = true,
    UseCoreFonts = true
};

doc.Save(@"C:\Temp\out.pdf", pdfSaveOptions);

out.pdf (7.6 MB)

Gpatil · September 18, 2024, 7:32pm

Hi @alexey.noskov

Can you please assist me . I am not sure if this is a Aspose.Word Issue or Aspose.Ocr issue or Issue in my attached code snippet

For some reason, I have to create a searchable PDF by dividing the document pages into individual pages, performing OCR on each page, then combining the results.

The code snippet below creates a searchable PDF with some in-between pages that lack headers. .

I need the header in all the pages as it is their in original doc file.

I am using Aspose.OCR 24.8.4

A210118625-MRN.docx (45.6 KB)
NO_PHIA210118625-MRN_rendered.pdf (5.9 MB)
code_snippet1.zip (1.1 KB)

Regards,
Gajanan

alexey.noskov · September 19, 2024, 4:48am

@Gpatil First of all, it is not quite correct to use Document.ExtractPages method when save to fixed page format, like PDF. It is better to use PdfSaveOptions.PageSet property:

Document doc = new Document(@"C:\Temp\in.docx");
PdfSaveOptions saveOptions = new PdfSaveOptions();
for (int i = 0; i < doc.PageCount; i++)
{
    saveOptions.PageSet = new PageSet(i);
    doc.Save($@"C:\Temp\page_{i}.pdf", saveOptions);
}

Also, it is not quite clear why you use Aspose.OCR here, since PDF produced by Aspose.Words is already searchable. So you can simply use the following code:

Document doc = new Document(@"C:\Temp\in.docx");
doc.Save(@"C:\Temp\out.pdf");

Gpatil · September 19, 2024, 12:29pm

Thanks @alexey.noskov that helped.

In our case we got Images in doc files some of then are rotated or skewed.

We leverage OCR to get it straightened right now the acuracy is not 100% but eventually it will get better.

We process lots of tiff/jpg/bmp/xps along with docs since we have a common work flow for all type of documents as a standard process we do OCR for all of them .

Regards,
Gajanan

alexey.noskov · September 19, 2024, 12:33pm

@Gpatil Thank you for additional information. I think the logic should be adjusted to avoid OCR operation on the files that does not required this, like the document attached in the initial post.

Gpatil · September 19, 2024, 1:40pm

Hi @alexey.noskov

In the event that a 50-page document lacks images, OCR processing will be avoided ; but, if even a single page contains an image, OCR processing will be applied.

Is it possible to detect images in doc files if yes can you pls share that code snippet ?

alexey.noskov · September 19, 2024, 4:47pm

@Gpatil Images in MS Word documents are represented as shapes. Please see our documentation to lean more:
https://docs.aspose.com/words/net/working-with-images/