Converting large pdf file to single excel sheet data is coming without any space

I am trying to convert large pdf file having multiple tabels in it to excel , however after converting I am getting all the data without any space, space is getting removed after convertion and some data is missing from the sentence. And the formatting is also very bad.

I am using function app .net 6.0 c#. Attaching the result output excel snapshot
Screenshot_2024-03-16-22-46-31-659_com.whatsapp-edit.jpg (130.5 KB)

@Rashidkhan1107

Would you please share the sample PDF document for our reference so that we can test the scenario in our environment and address it accordingly.

@asad.ali (Edit:- mentioned you in the post)

I had raised the issue with my personal email id , here i am attaching the input pdf file and the output excel file for which space is missing and file is getting corrupted i.e. some part text is missing. not able to upload the output file as this forum does not authorize to attach the xlsx file so attaching the snapshot.

Aspose issue.PNG (67.5 KB)

P123879.pdf (732.5 KB)

using below code to convert pdf file into excel (xlsx) :-

 Workbook workbook = new Workbook();
 Worksheet worksheet = workbook.Worksheets[0];
 int rowIndex = 0;   
 Aspose.Pdf.Document pdf = new Aspose.Pdf.Document(pdfFilePath);
 Aspose.Pdf.ExcelSaveOptions exl = new ExcelSaveOptions();

 //exl.ConversionEngine = ExcelSaveOptions.ConversionEngines.NewEngine;
 exl.Format = ExcelSaveOptions.ExcelFormat.XLSX;
 exl.InsertBlankColumnAtFirst = true;
 exl.MinimizeTheNumberOfWorksheets = true;
 exl.ExtractOcrSublayerOnly = true;
 pdf.Save(outputPath, exl);

@rashidkhan1

We are checking it and will get back to you shortly.

@asad.ali

Ok, waiting for the update.
Thank You.

@Rashidkhan1107

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-56844

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

@asad.ali

kindly please let me know How much more time it will take to resolve the issue as i have raised the issue 1 week before

@rashidkhan1

The issues in free support model are prioritized on first come first serve basis and resolution time of a ticket depends upon the number of issues logged prior to it. We are afraid that we cannot share any ETA at the moment as the ticket is not yet investigated. As soon as its analyzed, we will be able to share some updates with you regarding its resolution ETA. Please be patient and spare us some time.

We are sorry for the inconvenience.

I understand based on the aspose.com Free Support Policies we cannot get an exact estimations on when a fix can or will be delivered. How can we gauge where our ticket sites in the “first come, first served” que and it there is a possibility a fix might be addressed in this month’s release? Is the described issue something also reported by others?

@asad.ali
kindly please response to the query which Bart has raised.

@barttirino2024

All the issues, including yours, are logged in our internal issue tracking system to which we are afraid that you cannot have access. The resolution time in free support model usually depends upon the number of issues logged prior to it as well as the complexity level and nature of the issue itself. When we say first come first serve, it doesn’t necessarily mean that we are having same kind of issues in the queue reported by other.

An issue can be a document specific or it may be related to the specific type of documents. Therefore, the issues with same description or forum thread also do not confirm that they are same issue. Whether they are document specific or not, is decided and determined after the investigation.

Nevertheless, your concerns have been recorded in our issue tracking system under the logged ticket and we will surely consider them during analysis. As soon as the ticket is investigated, we will share updates with you about its ETA. Please spare us some time.

We apologize for the inconvenience.

Thanks for the reply to Asad…. Just so I am clear the fact that our issue has logged in our internal issue tracking system is not commitment you will be able to deliver a fix on the coming weeks or months, correct?

Our struggle is until this issue is resolve it make it difficult to move onto our next internal development sprint for a project we intend to deliver in June 2024.

Yes, your understandings are correct. Sadly, we cannot share any reliable or promising ETA at the moment. Once issue investigation is carried out, we will be in position to share how soon it can be resolved.

We do understand your concerns and have updated the ticket to the next level of priority. We will surely let you know in case some progress is made. Please also note that paid support option is recommended in such cases where some issue or enhancement is happened to be a showstopper. Unlike free support, your issue gets the highest priority and is resolved on urgent basis in paid support.

Asad – I am just following up to see if there has been traction on the ticket you team is investigating for us?

@barttirino2024

The issue lies in the extremely small size of the text, which barely reaches 2 pixel in size and the spaces are less than 1 pixel wide. Our table recognition engine is unable to operate with such small values. As a workaround, we recommend enlarging the page size before converting. Please refer to the code snippet below.

This could snippet resolve issues related to spacing and some others. If you need more help with formatting, please share more details and describe what you need in detail.

Document doc = new Document(testdata + "PDFNET_56844/P123879.pdf");

int ScaleX = 9;
int ScaleY = 5;

PdfFileEditor fileEditor = new PdfFileEditor();
PdfFileEditor.ContentsResizeParameters parameters = PdfFileEditor.ContentsResizeParameters.PageResize(
         doc.Pages[2].PageInfo.Width * ScaleX,
         doc.Pages[2].PageInfo.Height * ScaleY);

int[] pages = new int[doc.Pages.Count - 1];
for (int i = 2; i <= doc.Pages.Count; i++)
{
    pages.SetValue(i, i - 2);
    Console.WriteLine(i);
}

fileEditor.ResizeContents(doc, pages, parameters);
ExcelSaveOptions options = new ExcelSaveOptions();
doc.Save(testdata + "PDFNET_56844/" + version + ".xlsx", options);

You could also experiment with the ScaleX and ScaleY coefficients to find the most appropriate values for your targets.

The KPMG Dev team reviewed you notes,but have some questions which might need to be clarified from Aspose team.

  • Can ScaleX and ScaleY be set dynamically in code as PDFs we will be receiving of different size?
  • What is allowed maximum value of ScaleX and ScaleY based on which if we can at least determine the maximum allowed PDF size?
  • Maximum allowed pages within a single PDF?

We observed that some unnecessary texts are appearing while converting PDF into Excel.

  • How can we restrict unnecessary text from appearing in excel?

@barttirino2024

Please allow us to perform some investigation against your above question. We will let you know as soon as analysis is done.

Can you please share the screenshot of the unnecessary text in excel? Is it observed for the same file which you already shared here?

@barttirino2024

  • Can ScaleX and ScaleY be set dynamically in code as PDFs we will be receiving of different size?

For our table recognition engine (and for the people who will read the xlsx files), it would be more comfortable if the normal font size started from 8-10pt (if it’s not superscript or subscript). So, you can try to determine the scaleY factor by analyzing the font sizes in your document.

For example,

Document doc = new Document("P123879.pdf");
float minDesiredFontSize = 8;
float minTextHigth;

PdfFileEditor fileEditor = new PdfFileEditor();

foreach (Page page in doc.Pages)
{
    minTextHigth = GetMinFontSize(page);
    if (minTextHigth < minDesiredFontSize)
    {
      double originalWidth = page.GetPageRect(false).Width;
      double originalHeight = page.GetPageRect(false).Height;
      double ScaleY = (minDesiredFontSize * 1.25) / minTextHigth;

      ResizePage(doc, page.Number, originalWidth * ScaleY * 1.5, originalHeight * ScaleY);
   }
}

ExcelSaveOptions options = new ExcelSaveOptions();
options.MinimizeTheNumberOfWorksheets = true;
doc.Save("output.xlsx", options);
 private static float GetMinFontSize(Page page)
 {
     float minTextHigth = -1;

     TextFragmentAbsorber absorber = new TextFragmentAbsorber();
     absorber.Visit(page);

     foreach (TextFragment fragment in absorber.TextFragments)
     {
         if (minTextHigth == -1)
         {
             minTextHigth = fragment.TextState.FontSize;
         }
         else if (!fragment.TextState.Subscript && !fragment.TextState.Subscript)
         {
             minTextHigth = Math.Min(minTextHigth, fragment.TextState.FontSize);
         }
     }

     return minTextHigth;
 }

 private static void ResizePage(Document doc, int pageNumber, double targetWidth, double targetHeight)
 {
     Console.WriteLine("Resize. Target Width: " + targetWidth + " Target Height: " + targetHeight);

     PdfFileEditor.ContentsResizeParameters par = PdfFileEditor.ContentsResizeParameters.PageResizePct(targetWidth, targetHeight);
     par.TopMargin = PdfFileEditor.ContentsResizeValue.Units(0);
     par.BottomMargin = PdfFileEditor.ContentsResizeValue.Units(0);
     par.LeftMargin = PdfFileEditor.ContentsResizeValue.Units(0);
     par.RightMargin = PdfFileEditor.ContentsResizeValue.Units(0);
     new PdfFileEditor().ResizeContents(doc, new int[] { pageNumber }, par);
 }

This code was written as a sample and was not tested on other documents. For X, we scaled by 1.5 times (originalWidth * ScaleY * 1.5) because the input font is condensed, causing the text to go out of the borders when default non-condensed Windows fonts are used.

  • What is allowed maximum value of ScaleX and ScaleY based on which if we can at least determine the maximum allowed PDF size?

In PDF Version 1.4 (your sample) and 1.5 (Acrobat 6.0) the maximum PDF page size is 200x200 inches (). In PDF Version 1.6 (Acrobat 7.0) and newer the theoretical PDF page size is bigger but in reality most programs do not properly support any sizes above 200"x200". When these values are exceeded, our software also may not behave as expected.

  • Maximum allowed pages within a single PDF?

There is no limit to the page count in a PDF, but there are limits on the number of objects used on the pages. In general, a PDF can have thousands of pages, but significant performance issues may arise as the page count increases, especially when opening or converting the document.

Thanks I have shared with the KPMG Dev team for review.