Aspose.Words save to PDF - original tables created in Word are not recognized by Aspose.PDF.TableAbsorber

mwillisad6a8 · August 1, 2017, 7:26pm

I am trying to convert a Word document to a PDF using Aspose.Words, however the Word Document has pre-existing tables in it that I need to access once it is in PDF format. When i create the PDF using Aspose.Words.Document.Save function the PDF file is created. But when i create a new Aspose.PDF.Document object from the resulting file and attempt to use TableAbsorber to retrieve the table list, it always comes back with a 0 count. If I save the Word file manually as a PDF file, then the TableAbsorber will come back with table count greater than 0 and I can iterate through the tables. I noticed that the PDF file is much larger when i save it manually. Is there a setting in the SaveOptions for Aspose.Words that will create the PDF file so that the table are recognizable to Aspose.PDF.TableAbsorber?

tilal.ahmad · August 2, 2017, 5:50am

@mwillisad6a8

Thanks for your inquiry. We will appreciate it if you please share your input Word document, output PDF document and expected output PDF document along with your sample code. You may ZIP and attach these documents to the post. We will look into the issue and will guide you accordingly.

mwillisad6a8 · August 2, 2017, 2:00pm

Word and PDF Files.zip (352.0 KB)

Attached are the files. Below is the sample code:

Import Word doc and save it as PDF using Aspose.Words:
var filename = @“C:\beanstalk\pdfGeneratorTestFiles\Original Word Doc with Tables”;
var doc = new Document(filename + “.docx”);
doc.Save(filename + " Saved with Aspose Words.pdf", SaveFormat.Pdf);

Import resulting PDF doc and attempt to access tables using Aspose.PDF:
Document pdfDocument = new Document(@“C:\beanstalk\pdfGeneratorTestFiles\Original Word Doc with Tables Saved with Aspose Words.pdf”);
TableAbsorber absorber = new TableAbsorber();
absorber.Visit(pdfDocument.Pages[1]);
//table count is 0
var tableCount = absorber.TableList.Count;

If i import attached file “Original Word Doc with Tables Saved Manually From Word.pdf” then the table count renders 8 tables, which is somewhat confusing as well since there should only be 3 tables.

To give you a little background, we have a requirement to concatenate many PDF files into one PDF file and provide a linked TOC. The TOC has to be in a specific format and we would like to allow for a TOC to be created as “template” in a Word document with various placeholders or tokens for data. From there I would need to insert data into the table to create the TOC for the file (using document names and page numbers). I would then need to convert the Word document to a PDF and pre-pend the this PDF as the TOC. However, i also need a way to put the hyperlinks on the TOC to the appropriate page numbers. So i need a way to accurately access the text data in the PDF file so i can put the correct page number on the hyperlink.

Also, if you can think of another way to meet this requirement I am open to suggestions. Thanks!

tilal.ahmad · August 3, 2017, 4:09am

@mwillisad6a8

Thanks for sharing the additional information. We have tested the scenario and noticed the reported Table recognition issue in Aspose.Words resultant PDF. We have logged a ticket WORDSNET-15742 in our issue tracking system for further investigation and rectification. We will notify you as soon as it is resolved.

mwillisad6a8:

If i import attached file “Original Word Doc with Tables Saved Manually From Word.pdf” then the table count renders 8 tables, which is somewhat confusing as well since there should only be 3 tables.
To give you a little background, we have a requirement to concatenate many PDF files into one PDF file and provide a linked TOC. The TOC has to be in a specific format and we would like to allow for a TOC to be created as “template” in a Word document with various placeholders or tokens for data. From there I would need to insert data into the table to create the TOC for the file (using document names and page numbers). I would then need to convert the Word document to a PDF and pre-pend the this PDF as the TOC. However, i also need a way to put the hyperlinks on the TOC to the appropriate page numbers. So i need a way to accurately access the text data in the PDF file so i can put the correct page number on the hyperlink.
Also, if you can think of another way to meet this requirement I am open to suggestions. Thanks!

Furthermore, we are looking into this part of query and will update you soon.

We are sorry for the inconvenience.

mwillisad6a8 · August 7, 2017, 7:00pm

Thanks for the reply. In an attempt to find another way to satisfy our requirement of have a TOC “template” rather than having to construct it in the code (essentially hard coding every aspect of the TOC page), I am attempting to use an XML template. I have had minor success but when i bind the XML file using Aspose.PDF the “GetObjectById” call is not returning all objects. I have attached my test XML template and sample code is below:

Document pdfDocument = new Document();
pdfDocument.BindXml(@“C:\beanstalk\pdfGeneratorTestFiles\XMLTemplateTest.xml”, null);

//THESE 2 ARE RETURNED PROPERLYXMLTemplateTest.zip (830 Bytes)

var page = pdfDocument.GetObjectById(“pageid”);
var table = pdfDocument.GetObjectById(“mainTocTable”);
//ALL OF THESE BELOW COME BACK NULL
var header = pdfDocument.GetObjectById(“header”);
var caseTitle = pdfDocument.GetObjectById(“caseTitle”);
var caseTitleFrag = pdfDocument.GetObjectById(“caseTitleFrag”);
var defaultRow = pdfDocument.GetObjectById(“defaultRow”);
var cellTitle = pdfDocument.GetObjectById(“cellDocTitle”);

Are there only certain object that are supported for this method? Also, the documentation is quite frustrating as I am constantly looking at older links that then point to Aspose.PDF.Generator, but Generator is obsolete now and there aren’t any updated examples. I am currently using version 17.5.

Thanks

codewarior · August 8, 2017, 7:36am

@mwillisad6a8,

Thanks for sharing the details.

I have tested the scenario and have managed to reproduce the above stated issue. The problem has been logged as PDFNET-43171 and communicated to development team. They will further look into the details of this problem and as soon as we have some definite updates regarding its resolution, we will let you know. We are sorry for this inconvenience.