I am trying to convert large pdf file having multiple tabels in it to excel , however after converting I am getting all the data without any space, space is getting removed after convertion and some data is missing from the sentence. And the formatting is also very bad.
I had raised the issue with my personal email id , here i am attaching the input pdf file and the output excel file for which space is missing and file is getting corrupted i.e. some part text is missing. not able to upload the output file as this forum does not authorize to attach the xlsx file so attaching the snapshot.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFNET-56844
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.
The issues in free support model are prioritized on first come first serve basis and resolution time of a ticket depends upon the number of issues logged prior to it. We are afraid that we cannot share any ETA at the moment as the ticket is not yet investigated. As soon as its analyzed, we will be able to share some updates with you regarding its resolution ETA. Please be patient and spare us some time.
I understand based on the aspose.com Free Support Policies we cannot get an exact estimations on when a fix can or will be delivered. How can we gauge where our ticket sites in the “first come, first served” que and it there is a possibility a fix might be addressed in this month’s release? Is the described issue something also reported by others?
All the issues, including yours, are logged in our internal issue tracking system to which we are afraid that you cannot have access. The resolution time in free support model usually depends upon the number of issues logged prior to it as well as the complexity level and nature of the issue itself. When we say first come first serve, it doesn’t necessarily mean that we are having same kind of issues in the queue reported by other.
An issue can be a document specific or it may be related to the specific type of documents. Therefore, the issues with same description or forum thread also do not confirm that they are same issue. Whether they are document specific or not, is decided and determined after the investigation.
Nevertheless, your concerns have been recorded in our issue tracking system under the logged ticket and we will surely consider them during analysis. As soon as the ticket is investigated, we will share updates with you about its ETA. Please spare us some time.
Thanks for the reply to Asad…. Just so I am clear the fact that our issue has logged in our internal issue tracking system is not commitment you will be able to deliver a fix on the coming weeks or months, correct?
Our struggle is until this issue is resolve it make it difficult to move onto our next internal development sprint for a project we intend to deliver in June 2024.
Yes, your understandings are correct. Sadly, we cannot share any reliable or promising ETA at the moment. Once issue investigation is carried out, we will be in position to share how soon it can be resolved.
We do understand your concerns and have updated the ticket to the next level of priority. We will surely let you know in case some progress is made. Please also note that paid support option is recommended in such cases where some issue or enhancement is happened to be a showstopper. Unlike free support, your issue gets the highest priority and is resolved on urgent basis in paid support.
The issue lies in the extremely small size of the text, which barely reaches 2 pixel in size and the spaces are less than 1 pixel wide. Our table recognition engine is unable to operate with such small values. As a workaround, we recommend enlarging the page size before converting. Please refer to the code snippet below.
This could snippet resolve issues related to spacing and some others. If you need more help with formatting, please share more details and describe what you need in detail.
Document doc = new Document(testdata + "PDFNET_56844/P123879.pdf");
int ScaleX = 9;
int ScaleY = 5;
PdfFileEditor fileEditor = new PdfFileEditor();
PdfFileEditor.ContentsResizeParameters parameters = PdfFileEditor.ContentsResizeParameters.PageResize(
doc.Pages[2].PageInfo.Width * ScaleX,
doc.Pages[2].PageInfo.Height * ScaleY);
int[] pages = new int[doc.Pages.Count - 1];
for (int i = 2; i <= doc.Pages.Count; i++)
{
pages.SetValue(i, i - 2);
Console.WriteLine(i);
}
fileEditor.ResizeContents(doc, pages, parameters);
ExcelSaveOptions options = new ExcelSaveOptions();
doc.Save(testdata + "PDFNET_56844/" + version + ".xlsx", options);
You could also experiment with the ScaleX and ScaleY coefficients to find the most appropriate values for your targets.
Can ScaleX and ScaleY be set dynamically in code as PDFs we will be receiving of different size?
For our table recognition engine (and for the people who will read the xlsx files), it would be more comfortable if the normal font size started from 8-10pt (if it’s not superscript or subscript). So, you can try to determine the scaleY factor by analyzing the font sizes in your document.
This code was written as a sample and was not tested on other documents. For X, we scaled by 1.5 times (originalWidth * ScaleY * 1.5) because the input font is condensed, causing the text to go out of the borders when default non-condensed Windows fonts are used.
What is allowed maximum value of ScaleX and ScaleY based on which if we can at least determine the maximum allowed PDF size?
In PDF Version 1.4 (your sample) and 1.5 (Acrobat 6.0) the maximum PDF page size is 200x200 inches (). In PDF Version 1.6 (Acrobat 7.0) and newer the theoretical PDF page size is bigger but in reality most programs do not properly support any sizes above 200"x200". When these values are exceeded, our software also may not behave as expected.
Maximum allowed pages within a single PDF?
There is no limit to the page count in a PDF, but there are limits on the number of objects used on the pages. In general, a PDF can have thousands of pages, but significant performance issues may arise as the page count increases, especially when opening or converting the document.