How do I create an XML file from a PDF file?


#1

Hello,

using the latest version of Aspose.Pdf (18.10), I try to write out a PDF document converted to XML using SaveXML(). However, the file ist almost empty. It only contains:

<?xml version="1.0" encoding="UTF-8"?>

I came across a hint that the PDF document must be tagged. What does this mean?

Alternatively I also tried:

document.Save(new FileStream(“d:\test\test.xml”, FileMode.Create), SaveFormat.Xml);

but it crashes with “Object reference not set to an instance of an object”:

bei Aspose.Pdf.Structure.TextElement.(OperatorCollection , String )
bei .(Element , XmlWriter )
bei .(Element , XmlWriter )
bei .(Document , Stream , XmlSaveOptions )
bei PdfAsposer.Program.Main(String[] args) in D:\Intern\Projects\PdfAsposer\PdfAsposer\Program.cs:Zeile 26.

What am I doing wrong?

Thank you in advance,

Eckhard Zucker


#2

@eckhardzucker

Thanks for contacting support.

Aspose.PDF API offers the feature to convert PDF into XML where the XML schema corresponds to the ebook MobiXML standard. In case you are facing any issue, you may please share your sample PDF document with us so that we can test the scenario in our environment and address it accordingly.


#3

Good morning, @asad.ali and thank you for your reply.

We did manage to create an output file in the MobiXML format, but this is not what we need.

Our intend is as follows: We want to convert one particular PDF document into it’s Aspose document model XML representation. We’d like to use this XML document as a template. On the basis of this template, by removing or adding or modifying tables and forms on so on, we would like to create new XML documents. These new XML documents will then be converted into PDF. These new PDF documents will resemble the aforementioned original PDF document, however with differences in their contents.

We tried to serialize the PDF document using the .NET XmlSerializer. But it aborted due to non existing parameterless contructors:

private static void Serialize1(Aspose.Pdf.Document document)
{
    var xmlSerializer = new XmlSerializer(document.GetType());
    using (var xmlTextWriter = new XmlTextWriter("d:\\test\\test.xml", Encoding.UTF8))
    {
        xmlSerializer.Serialize(xmlTextWriter, document);
    }
}

System.InvalidOperationException
HResult=0x80131509
Message=Fehler beim Reflektieren des Typs ‘Aspose.Pdf.Document’.
Source=System.Xml
StackTrace:
at System.Xml.Serialization.XmlReflectionImporter.ImportTypeMapping(TypeModel model, String ns, ImportContext context, String dataType, XmlAttributes a, Boolean repeats, Boolean openModel, RecursionLimiter limiter)
at System.Xml.Serialization.XmlReflectionImporter.ImportElement(TypeModel model, XmlRootAttribute root, String defaultNamespace, RecursionLimiter limiter)
at System.Xml.Serialization.XmlReflectionImporter.ImportTypeMapping(Type type, XmlRootAttribute root, String defaultNamespace)
at System.Xml.Serialization.XmlSerializer…ctor(Type type, String defaultNamespace)
at PdfAsposer.Program.Serialize1(Document document) in D:\dev\cand\law\law-cand\LAW\Anwalt\Intern\Projects\PdfAsposer\PdfAsposer\Program.cs:line 42
at PdfAsposer.Program.Main(String[] args) in D:\dev\cand\law\law-cand\LAW\Anwalt\Intern\Projects\PdfAsposer\PdfAsposer\Program.cs:line 21

Inner Exception 1:
InvalidOperationException: Fehler beim Reflektieren der Eigenschaft ‘PageInfo’.

Inner Exception 2:
InvalidOperationException: Fehler beim Reflektieren des Typs ‘Aspose.Pdf.PageInfo’.

Inner Exception 3:
InvalidOperationException: Fehler beim Reflektieren der Eigenschaft ‘DefaultTextState’.

Inner Exception 4:
InvalidOperationException: Fehler beim Reflektieren des Typs ‘Aspose.Pdf.Text.TextState’.

Inner Exception 5:
InvalidOperationException: Das Element ‘Aspose.Pdf.Text.TextState.Font’ des Typs ‘Aspose.Pdf.Text.Font’ kann nicht serialisiert werden. Details finden Sie in der inneren Ausnahme.

Inner Exception 6:
InvalidOperationException: ‘Aspose.Pdf.Text.Font’ kann nicht serialisiert werden, weil dafür kein parameterloser Konstruktor verfügbar ist.

(Inner Exception 6 translates to something like “‘Aspose.Pdf.Text.Font’ cannot be serialized, because there’s no parameterless constructor.”)

Our template technique described above has been operational for years, but it uses the old Aspose PDF object model of version 11.7 (I believe), which is no longer supported. Here is a snippet of that XML document template:

<?xml version="1.0" encoding="utf-8" ?>
<Pdf CultureInfo="de-DE" IsUnicode="true" xmlns="Aspose.Pdf">
  <!--IsContentsModifyingAllowed="true">-->
  <!--linker Rand:2.5cm, oberer Rand:2.1cm unterer Rand:1.9cm rechter Rand:1.3cm-->
  <!--<Section PageWidth="598.29" PageHeight="846.15" FontName="Helvetica" PageMarginLeft="71.23" PageMarginTop="59.83" PageMarginBottom="54.13" PageMarginRight="36.79">-->
  <Section PageWidth="598.29" PageHeight="846.15" PageMarginLeft="55" PageMarginTop="59.83" PageMarginBottom="54.13" PageMarginRight="37.04" FontName="Helvetica" FontSize="9">

    <!--PageBorderMarginBottom="0" PageBorderMarginTop="0" PageBorderMarginRight="0" PageBorderMarginLeft="0">-->

    <!-- Falzmarken -->
    <!-- 1. Marke: ~ 10.5cm vom oberen Blattrand 
         2. Marke: ~ 21cm vom oberen Blattrand -->
    <Graph  Width="10" Height="0.1">
      <Line Color="gray" Position="-35, -238, -25, -238"/>
      <!-- alternative Positionsbeginn ganz links am Blattrand -->
      <!--<Line Color="Black" Position="-55, -238, -35, -238"/> -->
    </Graph>
    <Graph  Width="10" Height="0.1">
      <Line Color="gray" Position="-35, -536, -25, -536"/>
      <!--<Line Color="Black" Position="-55, -536, -35, -536"/>-->
    </Graph>

    <Text Alignment="Left" MarginLeft="12" MarginTop="0">
      <Segment FontName="Helvetica-Bold" FontSize="13">Title of the template document</Segment>
    </Text>
    <Text Alignment="Center">
      <Segment FontName="Helvetica" FontSize="9">- for demonstration purposes -</Segment>
    </Text>

    <!--Deckblatt -->
    <Table ColumnWidths="220 271" MarginTop="20">
      <Row>
        <Cell>

          <!-- linker Bereich-->
          <Table ID="Empfaenger" ColumnWidths="220" MarginLeft="2" DefaultCellPaddingBottom="2" DefaultCellPaddingTop="2">

            <!--Zeile Receiver-->
            <Row FixedRowHeight="14">
              <Cell>
                <Text ID="txtReceivrName">
                  <Segment ID="Receiver.Name" FontName="Helvetica" FontSize="9">{FF}</Segment>
                  <Segment FontName="Helvetica" FontSize="9">Receiver</Segment>
                </Text>

                <Text ID="Receiver.OldName" MarginBottom="13" TextWidth="147" Left="68" Top="-11" FontName="Helvetica" PositioningType="ParagraphRelative" ReferenceParagraphID="txtReceiver">
                  <Segment FontSize="9">{ST_147_14_Receiver.Name}</Segment>
                </Text>
              </Cell>
            </Row>

And lastly, could you please explain what “tagged PDF” means? Can we tag the original PDF document ourselves in order to have in written out with SaveXml()? What does SaveXml() expect in order to produce more than:

<?xml version='1.0' encoding='utf-8'?><Document xmlns="Aspose.Pdf" />

Thank you,

Eckhard Zucker


#4

One more idea:

Maybe we could operate on the original PDF document itself, using it as the template and modifying it with the Aspose PDF API.

We investigated further, but there was one point that we don’t understand: We didn’t manage to find out how GetObjectById() works. What ID does it expect? In the debugger, inspecting an Aspose.Pdf.Document, there’s no ID to be found.

In order to operate on an existing Aspose PDF document, we need to identify entities, right?

Could this approach be promising?

Thank you again,

Eckhard Zucker


#5

@eckhardzucker

Thanks for sharing more details.

We would like to share with you that old XML Schema was based upon Aspose.Pdf.Generator model which has been obsolete and replaced with Aspose.Pdf DOM model. Recent XML to PDF conversion feature follows new XML Schema structure and element sequence. You may further check “Generate PDF from XML” article(s) in API documentation for information in this regard.

We have logged investigation tickets as PDFNET-45737 and PDFNET-45735 in our issue tracking system for further investigation about SaveXml() and Tagged PDF Exception respectively.

I am afraid currently API can convert PDF to XML only based on MobiXml standards. However, we have logged a feature request as PDFNET-45736 in our issue tracking system for implementation of required feature.

As soon as we have some definite updates regarding resolution of logged tickets, we will surely inform you within this forum thread. Please be patient and spare us little time.

We are sorry for the inconvenience.


#6

@asad.ali

Thank you for your reply.

We followed the link Generate PDF from XML but unfortunately didn’t get anywhere. Neither the documentation nor the XML schema file seem to be complete.

For example, what is the XML representation for something like this:

Document document = new Document();
Page page = document.Pages.Add();
RadioButtonField radioButtonField = new RadioButtonField(document.Pages[1]);
RadioButtonOptionField radioButtonOptionField = new RadioButtonOptionField(page, new Aspose.Pdf.Rectangle(0, 0, 22, 22));

Could you please point us to some more detailed documentation regarding XML representation of Aspose objects?

Thank you very much!


#7

@eckhardzucker

Thanks for getting back to us.

We have also noticed that current XML Schema provided by Aspose.PDF does not contain element sequence for creating form fields. However, an enhancement request as PDFNET-45782 is already logged in our system for the sake of implementation. We will surely investigate and implement the feature and in case of any additional updates, we will let you know. Please spare us little time.

We are sorry for this inconvenience.


#8

@eckhardzucker

Our PDF to XML converter currently works with Tagged PDF because it saves logical structure into XML and we cannot achieve this if PDF document is not a Tagged PDF. Regarding other logged ticket, we will let you know once some updates are available.


#9

All right, @asad.ali, thank you!