Cant convert pdf to xml

eyalsadeh · November 15, 2023, 3:12pm

hi,
we tried the Convert PDF to XML via Java | Aspose.PDF example and got an error :Exception in thread “main” class com.aspose.pdf.exceptions.PdfException: Tagged pdf expected. Please use tagged pdf file for converting to xml format or use MobiXml for untagged pdf.

this is the code

alexey.noskov · November 15, 2023, 3:16pm

@eyalsadeh Your question is related to Aspose.PDF, so I have moved your request into the appropriate forum category. My colleagues from Aspose.PDF team will help you shortly.

asad.ali · November 15, 2023, 10:19pm

@eyalsadeh

Can you please share what type of output XML do you expect from the API? Can you please share your sample source and expected output files for our reference? We will investigate the feasibility and share our feedback with you.

eyalsadeh · November 16, 2023, 6:39am

We expect to have the text of the pdf, their locations, size, pages , and pother relevant data on the pdf. Our PDF is attached
template.pdf (410.1 KB)

This is the code that we used.
import com.aspose.pdf.Document;
import com.aspose.pdf.SaveFormat;

public class Main {
    public static void main(String[] args) throws Exception {
        // load PDF with an instance of Document
        Document document = new Document("template.pdf");
// save document in XML format
        document.save("output.xml", SaveFormat.Xml);

    }
}

pls let us know what should we do.

asad.ali · November 16, 2023, 2:03pm

@eyalsadeh

We are afraid that this feature is not yet available in the API. As error message also stated that you can also generate MobiXml from a PDF that may not contain all the information that you need at the moment. Therefore, we have logged a feature request as PDFJAVA-43311 in our issue tracking system.

We will look into its details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

eyalsadeh · November 16, 2023, 2:16pm

ok, if the pdf is more simple then will this feature work?

asad.ali · November 16, 2023, 6:57pm

@eyalsadeh

We are afraid that it would not work because such implementation has not been made to the API. Can you please share an expected output XML for our reference? It would help us in investigation the ticket.

eyalsadeh · November 19, 2023, 7:16am

ok, then why do u provide a code example that is doing this? Are you sure about this?

Basically, we need to see the pdf text locations in the doc.

asad.ali · November 19, 2023, 10:24pm

@eyalsadeh

We are checking it and will get back to you shortly.

asad.ali · December 3, 2023, 8:44pm

@eyalsadeh

Several algorithms are used to convert XML to a resulting XML depends on it’s type, which can be tagged as document XML, excel-XML, svg-XML, or APS-XML.
The attached XML (which was received from the online converter) corresponds to APS-XML, so we recommend using the following code to obtain the saved result from the library:

Document document = new Document( dataDir+"template.pdf");
// save document in XML format
document.save( dataDir+"output_23_10_.xml", SaveFormat.Aps);