BindXml Generating Pages?

Mitsobar · July 11, 2017, 12:27am

After I call BindXml to create a document, I look to see how many pages got generated, before I save, and I get a page count of zero. Why?

How can I verify the BindXml function worked before I save the PDF? It returns no result.

asad.ali · July 11, 2017, 8:38am

@Mitsobar

Thanks for contacting support.

I have tested the scenario by generating PDF from sample XML with following code snippet, while using Aspose.Pdf for .NET 17.7 and I was unable to notice any issue. The API returned correct count of pages.

Document doc = new Document();
doc.BindXml(dataDir + "XMLDOM.xml");
int count  = doc.Pages.Count;
doc.Save(dataDir + "DOMPdf.pdf");

We will really appreciate if you can please share some more details, i.e your code snippet along with sample input document(s) and your environment details, so that we can test the scenario in our environment and address it accordingly.

Best Regards,
Asad Ali

Mitsobar · July 11, 2017, 2:27pm

Has the XML schema changed from when we used the Generator BindXML?
RevisionHistoryPage.zip (651 Bytes)
BTW…It is a pain with this new format to always need to zip any uploads.

Is there documentation to detail the differences?

I’m seriously disappointed in the nonexistence of documentation that has accompanied the dropping of the Generator object.

You API reference documentation is sorly missing details. The API Reference seems to only have the Method signatures and property names.
I can already get all that from .Net’s Object Browser.

asad.ali · July 11, 2017, 9:14pm

@Mitsobar

Thanks for sharing sample file.

I have checked the file and observed that its structure was old Aspose.Pdf.Generator based, which has been obsoleted and not recommended any more.

As old legacy Aspose.Pdf.Generator approach has been deprecated, so old XML schema is also not supported in latest version(s) of the API. In new version, XML template for PDF generation, should be according to new XML Schema which is DOM based. For your reference, I have attached a sample DOM based XML template and generated PDF document from that template.

XMLDOM.zip (512 Bytes)
DOMPdf.pdf (40.7 KB)

Furthermore, you can find new XML Schema in the XML folder, which is located in the installation directory of the Aspose.Pdf for .NET. You may please follow the new schema, in order to create DOM based XML templates for PDF generation.

We are really sorry for the inconvenience which you have faced. We have recorded your feedback and will try to make uploading feature more convenient.

Currently very little amount of articles have been added in the product documentation regarding new DOM structure (i.e Convert XML file to PDF). Please note that, updating documentation is an ongoing process and we have been working to add new topics related to new features as well as remove old/deprecated references.

We humbly apologize for the trouble which you have faced regarding missing details in the documentation. We will definitely provide all necessary details/information in the API documentation as soon as possible. Meanwhile, in a case if you face any issue regarding creating DOM based XML template, please feel free to let us know. We will certainly try to assist you accordingly.

We are sorry for this inconvenience.

Best Regards,
Asad Ali

Mitsobar · July 12, 2017, 2:51pm

I get an internal server error 500 trying to download the zip and PDF.

asad.ali · July 12, 2017, 7:39pm

@Mitsobar

Thanks for writing back.

Please double check if you are logged in, before downloading the attachments and if issue still persists, please let us know. We will address this issue accordingly. Furthermore, for your convenience, I have uploaded file to dropbox and shared the link here as well. Please download the files from given links.

DOMXML
DOMPDF

We are sorry for the inconvenience faced.

Best Regards,
Asad Ali

Mitsobar · July 12, 2017, 8:39pm

So I got the XML changed and it is generating a page.
Now I’m trying to use a TextAbsorber to get all the text fragments so I can replace some text.
The Absorber is coming back with no fragments.

    HistoryDoc.BindXml(ComponentPath + "RevisionHistoryPage.xml", null);

   Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();

    HistoryDoc.Pages[1].Accept(textFragmentAbsorber);

    Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

    Log(String.Format("Fragments {0}", textFragmentCollection.Count));

    foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
    {
      LogRequest(textFragment.Text);

      switch (textFragment.Text)
      ....

RevisionHistoryPage.zip (510 Bytes)

Mitsobar · July 12, 2017, 8:45pm

The text I’m looking for is in a table. From what I read absorber should be traversing the DOM looking for all the TextFragments.

Notice the text in the fragments are in Segments. Will that prevent the absorber from finding the text?

The schema does not allow text to be directly in the TextFragment element. It only allows segments.

Mitsobar · July 12, 2017, 8:47pm

Is there a setting I need to make to allow searching? I can find no place in the TextAbsorber documentation that tells me if that is so.

Mitsobar · July 12, 2017, 9:34pm

I may need to punt with using the TextAbsorber.

There is an overload for BindXml that takes a single string.
The documentation is not clear what the parameter is for. It is named “file”.
Is this a file path or the XML in string form?

There is another that I’m using that takes a file path for the XML and XSL.

asad.ali · July 13, 2017, 9:09am

@Mitsobar

The constructor with an overload with name “file” refers to the XML file path.

Furthermore, we are looking into the scenario related to fragment absorbing and will get back to you shortly. Please be patient.

Best Regards,
Asad Ali

asad.ali · July 13, 2017, 5:44pm

@Mitsobar

Thanks for your patience.

I have tested the scenario with your code snippet and shared file. I have observed that TextFragmentAbsorber was not absorbing text after calling BindXml() method. However, when I saved the document into MemoryStream object by calling Document.Save() method, the text was absorbed successfully from the page(s) of the document. Please check following code snippet, which I have used to get text from the PDF after generating through XML.

Document HistoryDoc = new Document();
HistoryDoc.BindXml(dataDir + "RevisionHistoryPage.xml");

HistoryDoc.Save(new MemoryStream());

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
HistoryDoc.Pages[1].Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
 foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine(textFragment.Text);
}
HistoryDoc.Save(dataDir + "RevisionHistoryPage.pdf");

For your reference, I have also attached an output RevisionHistoryPage.pdf (2.0 KB)
generated by above code snippet. Furthermore, I have also created an investigation ticket as PDFNET-43039 in our issue tracking system, to confirm that if it is really necessary to save the document after BindXml() method to make text absorption possible.

Our product team will further look into the details of logged ticket and as soon as we receive some feedback from their side, we will update you. Please be patient and spare us little time.

We are sorry for this inconvenience.

Best Regards,
Asad Ali

Mitsobar · July 21, 2017, 2:12pm

Adding the save worked. Too bad it is a workaround.

Is there more documentation on the schema…having just the schema alone is not really enough to go on.

For instance: The FontStyle attribute in the TextState element is an integer but there is no documentation on what the values mean.

In a TextSegment how does one include a line break?

There seems to be things in the old PDF schema that I could not find a way of doing in the new.

asad.ali · July 21, 2017, 6:15pm

@Mitsobar

Thanks for writing back.

As we have logged an investigation ticket regarding this behavior, that we have observed, so until we receive some feedback from product team, we cannot say for sure if its a default behavior or a workaround.

We are sorry for the inconvenience caused, as we are working over adding more documentation related to new DOM based XML structure. Furthermore, you may send us your old template and generated PDF from that template. We will try to convert it as per new structure in our environment and share our findings with you.

asad.ali · October 21, 2019, 7:21pm

@Mitsobar

Please take into account that BindXml method only binds PDF and XML. But actually processing (adding text, etc.) performed on the saving due to performance reasons. To change Document object to flush added content without final saving please invoke ProcessParagraphs() method. Alternative way is intermediate saving of the document before accepting of TextFragmentAbsorber.

We used the following code for testing:

Document HistoryDoc = new Document();
HistoryDoc.BindXml(dataDir + "RevisionHistoryPage.xml");
HistoryDoc.ProcessParagraphs();

Aspose.Pdf.Text.TextFragmentAbsorber textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
HistoryDoc.Pages[1].Accept(textFragmentAbsorber);
Aspose.Pdf.Text.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

foreach (Aspose.Pdf.Text.TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine(textFragment.Text);
    DrawRectangleOnPage(textFragment.Rectangle, textFragment.Page);
}
HistoryDoc.Save(dataDir + "RevisionHistoryPage_fragments.pdf");

Output: RevisionHistoryPage_fragments.pdf (2.1 KB)