Extract table from PDF and inset it in a Word document by replacing a placeholder

Hi.

Im using Aspose.Total for .NET.

I have this requirement: to extract a table from a pdf file. But first I need to find a piece of text. This text should indicate that the table that comes after it is the table I want extracted. I know how to find a piece of text in a pdf file. I dont know how to extract the table thats right after the text. Can you help me with that?

Once I have the table I need to insert it in a word document by searching for a placeholder and replacing the placeholder with the table itself. I’ve seen examples on the support forum on how to find a placeholder and by using a replace callback to insert a table but that example inserts a new table. I want to insert an existing table. Furthermore, the pdf table is an Aspose.Pdf.Text.AbsorbedTable and the word table is a Aspose.Words.Tables.Table. How do I convert one to the other?

Would it be easier just to convert the pdf to word and extract the table? And if I do that how do i find the table by searching for a piece of text and getting the table thats right after that piece of text?

Thanks!

@dustyenterprise

Please note that PDF and Word documents are quite different file formats. Aspose.Pdf.Text.AbsorbedTable and Aspose.Words.Tables.Table belong to different products. However, you can extract the content of table and insert them into Word document. Following article explains how to extract the content from the PDF file.
Extract Table from Existing PDF Document

After reading the text from the PDF, you can create new table in MS Word document using Aspose.Words. Please read the following article.
Introduction and Creating Tables

Moreover, Aspose.Words for .NET provides support to import PDF into its DOM. You can read the table of PDF from Aspose.Words’ DOM and insert the Table node into another document at any location. You can use NodeImporter.ImportNode method to import a node from one document into another.

Thank you for the information but your answer is not really addressing my biggest problem.
How do I select the table that comes after a piece of text? For example: A table has a title that is just a paragraph that sits before the table. I have an extraction rule that says: find this text, extract the table that is under this text and keep the formatting.
How can I achieve this?

I was able to convert the pdf to word, so the problem with the 2 types of tables is no longer an issue. However, I need to know how to extract the table after I find a piece of text and how do I take that Aspose.Word.Tables.Table and insert it in another word document by replacing a placeholder (just a text, not a bookmark). This is not clear to me and I cannot find any info about this.

So basically imagine i traverse all the elements of the word document and look for a piece of text. Once I find it I want to ask: give me your next element that is a table.

Thanks.

Here is some code I found on the forum:

ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e) {
            // This is a Run node that contains either the beginning or the complete match.
            var currentNode = e.MatchNode;

            // The first (and may be the only) run can contain text before the match,
            // in this case it is necessary to split the run.
            if (e.MatchOffset > 0)
                currentNode = SplitRun((Run)currentNode, e.MatchOffset);

            // This array is used to store all nodes of the match for further removing.
            var runs = new ArrayList();

            // Find all runs that contain parts of the match string.
            var remainingLength = e.Match.Value.Length;
            while (remainingLength > 0 && currentNode != null && currentNode.GetText().Length <= remainingLength) {
                runs.Add(currentNode);
                remainingLength -= currentNode.GetText().Length;

                // Select the next Run node.
                // Have to loop because there could be other nodes such as BookmarkStart etc.
                do {
                    currentNode = currentNode.NextSibling;
                }
                while (currentNode != null && currentNode.NodeType != NodeType.Run);
            }

            // Split the last run that contains the match if there is any text left.
            if (currentNode != null && remainingLength > 0) {
                SplitRun((Run)currentNode, remainingLength);
                runs.Add(currentNode);
            }

            **HERE INSERT EXISTING TABLE THAT I EXTRACTED**

            foreach (Run run in runs)
                run.Remove();

            return ReplaceAction.Skip;
        }

@dustyenterprise

In this case, you need to implement IReplacingCallback interface and find the text before table that you want to extract and insert it into another table. Please read following article.
Find and Replace

After getting text (matched node), please get the next table node below the desired text using Node.NextSibliing property or Node.NextPreOrder method. Once you have Table node, you can use NodeImporter.ImportNode method to import the table into another document.

If you still face problem, please ZIP and attach your input documents (PDF and Word) along with your expected output document. Please also share the text before the table. We will then provide you code example according to your requirement.

Thank you. I think i know what I have to do now but I have another problem. I can’t save the pdf to word. I’ve even made a simple console application to test and I get the same error as I get in my main project.

using System;
using Aspose.Pdf;
using System.IO;

namespace TestSaveToWord {
    class Program {
        static void Main(string[] args) {
            Aspose.Pdf.License pdfLicense = new Aspose.Pdf.License();
            pdfLicense.SetLicense("Aspose.Total.lic");

        var pdfPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "test.pdf");
        var convertedDocumentPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "converted.docx");
        var pdfDocument = new Aspose.Pdf.Document(pdfPath);
        pdfDocument.Save(convertedDocumentPath, SaveFormat.DocX);
    }
}

}

Here’s the error:
System.NullReferenceException: “Object reference not set to an instance of an object.”
at #=ziKqhTtbZyw4BkKUoT3mSpy$$his87ZdTcN0Ipx1P5Xl1.#=zY0HQtGNGVXQL()\n at #=zYZ5l8bOG4jJ0dhDSp4m6BvfA7peTLd8o7VEFOvOW$nIViAB4GA==.#=z_6qXGUg=(#=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=, #=zT402Il0$NabGYcvoJ7Px6jagchRV6Y26r288wTszdHS5 #=zKLaXDfM3B9co, #=zu6AdRgQQR8r3Yj$A4nwTdC$ENBMeulK1gg== #=zccMJ$Sk=)\n at #=zNZBu6ev$eg5$HS2CqfQEhMbcocL5SQSW3o6JucC__CVL5VK$kg==.#=z_6qXGUg=(#=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=, #=zT402Il0$NabGYcvoJ7Px6jagchRV6Y26r288wTszdHS5 #=zKLaXDfM3B9co)\n at #=ztxquhYRowjnsZQxP8GiFPl9msiEAGiFvny2PTFP6sJjS.#=zBudKJpbY2zTPqyCaRw==(#=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=, List1 #=zKwldhkE=)\n at #=ztxquhYRowjnsZQxP8GiFPl9msiEAGiFvny2PTFP6sJjS.#=zQhPGlV$TEzLjLEdYag==(#=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=, List1 #=zKwldhkE=)\n at #=ztxquhYRowjnsZQxP8GiFPl9msiEAGiFvny2PTFP6sJjS.#=zU7PTtTBj985L(#=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=, List`1 #=zapfjkeHIyIOq)\n at #=ztxquhYRowjnsZQxP8GiFPl9msiEAGiFvny2PTFP6sJjS.#=ziRIh4sc=(#=zxomKc57lhYWWyabT41VMWKEgQxuu #=z94utm9E=, #=z8JzRwMDnRYyPoUkGXOTktEFHW4$NFkcKwQ== #=z78kbUMo=, #=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL #=zyhysdIE=)\n at #=zpzdvbeStk5TK_Z4qaPrct0JUvLhq.#=zYuQR3N5vpScV(Document #=z94utm9E=, #=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL& #=z2P9l6qDLdfYyHHWn1A==, UnifiedSaveOptions #=zMLVC6rY=, Int32& #=zWj9Zjj7SgTpd)\n at #=zpzdvbeStk5TK_Z4qaPrct0JUvLhq.#=znI6ZvrMYMhbQHVaCJQ==(Document #=z6skZ2su9HLnq, #=z5wmixPgpid2CI$g7rMTfGvRodRnNvyTQCLJbq93Lh3sL& #=zyhysdIE=, UnifiedSaveOptions #=zMLVC6rY=, Int32& #=zp9R_w8KHi3$f)\n at #=znNxpe9SuOaMhbr9aRPqKf7g=.#=zIy49nFs=(Document #=z6skZ2su9HLnq, Stream #=zdtdDQcQJo9RS, DocSaveOptions #=zMLVC6rY=)\n at Aspose.Pdf.Document.#=z4laxV7ngq0$_(String #=zNo3Cfw2oI6lO, SaveFormat #=zLP9ruF0=)\n at Aspose.Pdf.Document.Save(String outputFileName, SaveFormat format)\n at TestSaveToWord.Program.Main(String[] args) in /Users/[redacted]/Documents/Development/TestSaveToWord/TestSaveToWord/Program.cs:14

I basically get a null reference exception every time I try to save to a word document. Can you help please? Thank you.

@dustyenterprise

Could you please ZIP and attach your input PDF here for testing? We will investigate the issue and provide you more information on it.

I’ve attached the file. Thanks.
test.pdf.zip (223.9 KB)

@dustyenterprise You can achieve what you need using Aspose.Words and IReplacingCallback. Foe example see the following code, that extracts table after the specified text into a separate document.

Document doc = new Document(@"C:\Temp\test.pdf");

FindReplaceOptions opt = new FindReplaceOptions();
ExtractTableCallback tableExtractor = new ExtractTableCallback();
opt.ReplacingCallback = tableExtractor;
doc.Range.Replace("Amestecuri", "", opt);

// Put the table into a separate document. Here you can put the table into your template.
Document testDocument = new Document();
if (tableExtractor.Table != null)
    testDocument.FirstSection.Body.AppendChild(testDocument.ImportNode(tableExtractor.Table, true));

testDocument.Save(@"C:\Temp\out.docx");
private class ExtractTableCallback : IReplacingCallback
{
    public ReplaceAction Replacing(ReplacingArgs args)
    {
        Node currentNode = args.MatchNode;
        while (Table == null && currentNode != null)
        {
            Table = currentNode as Table;
            currentNode = currentNode.NextPreOrder(args.MatchNode.Document);
        }

        return ReplaceAction.Skip;
    }

    public Table Table { get; set; }
}

Hi @alexey.noskov . Thanks for the reply.
The code bellow throws an error.

Aspose.Words.UnsupportedFileFormatException has been thrown

As far as i can tell that line creates a word document by loading a pdf. But in practice I get the error I mentioned.

Since you mentioned this

I assumed that new Document() is a Aspose.Words document. What am I doing wrong?

Thanks!

@dustyenterprise Most likely you are using old version of Aspose.Words. Loading PDF documents feature was introduced in 20.2.0 version for .NET Standard 2.0 dll and then in 20.4.0 version we added .NET Framework 4.6.1 dll, that also includes this feature.

Ok. I have an older version. Is there a way to do this with an earlier version? Is there no way I can convert a pdf to a Word document? Seems everything I tried so far ends up in an exception. And I find it very strange that one cannot extract content from a pdf file and insert it in a word file. My license expiry date is 20190401. I think I’m on 19.7 if I recall correctly.

@dustyenterprise

Please use the latest version of Aspose.Words for .NET 21.2.0 to get the desired output.

Please note that we do not provide support for older released versions of Aspose.Words. Moreover, we do not provide any fixes or patches for old versions of Aspose products either. All fixes and new features are always added into new versions of our products.

We always encourage our customers to use the latest version of Aspose.Words as it contains newly introduced features, enhancements and fixes to the issues that were reported earlier.

Yeah, thats a real shame. Because converting one file format to another seems to me like basic functionality which you are unable to provide. And you are asking a ton of money for this. I cannot just upgrade my license because this project has a budget and the allocated amount was spent already on the original license. Now you are asking me to pay 1800 dollars (the license upgrade price) for something that should have been there in the first place? There is no such budget. I find this very disappointing.

@dustyenterprise

You are using Aspose.Words 19.7 and this version does not support import of PDF into Aspose.Words’ DOM.

The loading of PDF document was introduced in later versions. You can use these versions of Aspose.Words to avoid the exception.

@dustyenterprise Is 19.7 the latest version you can update to? You can check the expiration date of your license file by opening it in any text editor and finding SubscriptionExpiry tag. It will look like this:

 <SubscriptionExpiry>20210930</SubscriptionExpiry>

It means that you can free upgrade to version of Aspose product published before 09/30/2021.

19.4 or 19.7 is the latest I can upgrade to. I can’t remember right now. My subscription expiry date is 20190401. I understand that this feature was added later and while it is my responsibility to check before I buy, I still consider file format conversion basic functionality and it should have been present already in much much lower versions than 20.4.0. Right now I have a very disappointed client and a lot of trouble for me. I have to find a way to manually convert some stuff or find a solution from a competing product that hopefully is not expensive.

@dustyenterprise I understand your disappointment, but I cannot agree that conversion from PDF to Word is basic functionality.
Aspose.Pdf also provides functionality to convert PDF to MS Word documents, however, output word document structure is close to PDF and that is why is hard to edit. You can use the following code to achieve the conversion using Aspose.Pdf:

Aspose.Pdf.Document pdfDoc = new Aspose.Pdf.Document(@"C:\Temp\test.pdf");
pdfDoc.Save(@"C:\Temp\out.docx", Aspose.Pdf.SaveFormat.DocX);

but if you open the output document, you can notice that the tables just look like tables, but are built using frames. As you understand, such ‘table’ is very hard to recognize programmatically.

@alexey.noskov I decided to do the conversion myself from an AbsorbedTable to a Words.Tables.Table by going through all rows and all cells and creating a new words table with the same rows and cells. But even this does not work. With the PDF file I provided in an earlier post the table absorber provides 8-9 tables on the page that i’m looking at, when in reality theres only one table on that page. What the table absorber considers tables are actually columns and rows. It considers the table header row as a table, it considers some columns as tables, it even doubles some columns as tables, meaning the collection of tables provided by the table absorber has duplicates, or what seem like duplicates. So I’m really out of options. I cannot reliably convert a pdf table to a word table. Not even with the limited functionality provided in the 19.X versions. Do you have any solution for this? Is there ANY way I can do this without buying a new subscription to have access to 2.X? Buying a subscription is out of the question. I don’t have that kind of money and my client doesn’t have that kind of money (the required money for the license was provided through a grant). I’m just by myself, a single developer. Thank you!

@dustyenterprise,

Please first request for a 30-day Temporary License.

Then test on your end and confirm, if you will actually be able to meet your requirement by using the latest 21.2 version of Aspose.Words for .NET API alone (without using Aspose.PDF API etc)?

If the latest version of Aspose.Words helps to resolve the problem on your end, then we will ask our sales team to see if they can help you with the license upgradation.