Extract text from xml file using Aspose.Pdf.Text.TextAbsorber

diego.tosato · May 30, 2014, 10:28am

Hi there,

is there any reason why it is not possible to extract text from xml file using Aspose.Pdf.Text.TextAbsorber?

MemoryStream stream = ;

Aspose.Pdf.LoadOptions options = new XmlLoadOptions();

using (Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(stream, options))

{

// Create TextAbsorber object to extract text.

Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();

// Apply TextAbsorber.

string extractedText = string.Empty;

pdfDocument.Pages.Accept(textAbsorber);

extractedText = textAbsorber.Text;

}

Here a straightforward example of xml file.

<?xml version="1.0" encoding="UTF-8"?>

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum

Thanks,

Diego

codewarior · June 1, 2014, 1:50pm

Hi Diego,

Thanks for contacting support.

The reason text is not being extracted using the TextAbsorber object is because the source XML is not in the correct format and its contents are not loaded. Please note that the source XML should be in Aspose.Pdf compatible format and in case you need to use your existing XML file, please try using an XSLT to make it compatible with source XML. Please try using the following code snippet and XML to accomplish desired results.

I would also suggest you to visit the following links for further details on

[C#]

Aspose.Pdf.LoadOptions options = new XmlLoadOptions();

// Apply TextAbsorber.
string extractedText = string.Empty;

using (Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document("c:/pdftest/source.xml", options))

{
    // Create TextAbsorber object to extract text.
    Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber();

    pdfDocument.Pages.Accept(textAbsorber);
    extractedText = textAbsorber.Text;
}

Console.WriteLine(extractedText);

[XML]

<?xml version="1.0" encoding="utf-8" ?>
<Pdf xmlns="Aspose.Pdf">
  <Section>
    <Text>
      <Segment>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum</Segment>
    </Text>
  </Section>
</Pdf>