Convert pdf to xml to json

vmamilla · November 20, 2020, 3:39pm

Convert PDF to XML | Online and Free – not to extract xml from here
does that mean we can’t use the product or do we need to buy before testing

can provide the pdf file if required

I am trying to to convert pdf to Json, It is converting but it is giving the words before font size {“@x”:“52.875”,“@y”:“168.469”,“@width”:“100.835”,“@height”:“47.131”,“#text”:
{“@width”:“612”,“@height”:“792”,“font”:[{“@size”:“42”,“@face”:“RFPLAD+Arial”,“@src”:

asad.ali · November 22, 2020, 6:38pm

@vmamilla

Would you please provide a sample PDF document along with expected output XML file in .zip format? We will test the scenario in our environment and share our feedback with you accordingly.

vmamilla · November 23, 2020, 3:55pm

Hi Asad,

These documents, I am trying to convert pdf to xml giving 404 error.

Thanks,
Venkat.employeeguide.pdf (6.5 MB)
guide-to-fmla.pdf (2.3 MB)

asad.ali · November 24, 2020, 10:33am

@vmamilla

We tried to convert your files while using Aspose.PDF for .NET 20.11 and following code snippet. We did not notice any issue.

Document doc = new Document(dataDir + "guide-to-fmla.pdf");
doc.Save(dataDir + "guide-to-fmla.xml", SaveFormat.MobiXml);

guide-to-fmla.zip (4.1 MB)

In case you are facing some issue over Free Apps Domain, you may please create a post in respective forum where you will be assisted accordingly.

vmamilla · November 25, 2020, 4:38pm

Hi Asad,

My question I am not able to parse here Convert PDF to XML | Online and Free

From code I am able to parse only 4 pages, each and every is showing these kind of font size and co-ordinates.

{“@x”:“52.875”,“@y”:“168.469”,“@width”:“100.835”,“@height”:“47.131”,“#text”:
{“@width”:“612”,“@height”:“792”,“font”:[{“@size”:“42”,“@face”:“RFPLAD+Arial”,“@src”:

How can we remove these. I need to convert PDF to Json, I am trying pdf to xml then Json, I am getting all the special charters fonts.

If, I try pdf to txt also from .net converting only one line.

I am using this code in .net

// Open document
    Document pdfDocument = new Document(_dataDir + "demo.pdf");
    TextAbsorber ta = new TextAbsorber();
    ta.Visit(pdfDocument);
    // Save the extracted text in text file
    File.WriteAllText(_dataDir + "input_Text_Extracted_out.txt",ta.Text);

guide-to-fmla.pdf (2.3 MB)

Thanks,
Venkat.

asad.ali · November 26, 2020, 8:05am

@vmamilla

As requested earlier, you need to post this issue in the respective forum in order to get is addressed properly.

Furthermore, the 4 pages limitation is due to trial version. You can download a free 30-days temporary license in order to evaluate the API without any restriction.

This is because the API supports conversion to MobiXml format only. We are afraid that you cannot convert to XML or JSON at the moment. However, we will further investigate the feasibility if you could please share a sample expected output format.

This is also due to the limitation of trial version usage. Please apply a valid license in order to use the API without any restriction.

vmamilla · December 1, 2020, 9:28pm

Hi Asad,

I need to break the pdf document into text(I am getting this) after that each section header, each paragragh seperate, needs to break section and sub section.

guide-to-fmla.pdf (2.3 MB)

Thanks,
Venkat.

asad.ali · December 2, 2020, 4:50pm

@vmamilla

Would you please share a bit more details of your requirements? How you want to break the PDF? Do you want to convert each section into a separate PDF document? What are the parameters upon which you want to split the PDF file?

vmamilla · December 2, 2020, 5:07pm

Hi Asad,

I need to break each paragragh text as section(either pdf file or word document), header, title sub section as text. we are getting like font size and images, we don’t want images and fonts, only we need text each paragraph level.

Thanks,
Venkat.

asad.ali · December 3, 2020, 4:31pm

@vmamilla

Generally you can extract only text from PDF document using the code given in the linked article. As per our understandings, you wish to generate an XML document from PDF but output XML should only contain the definition of the text. Please let us know if our understandings are correct.

Also, please share a sample output file which is your desired format. We will try to generate one using the API and share our feedback with you.

vmamilla · December 4, 2020, 2:25pm

Hi Asad,

I am able extract from word document to text, If I am converting to XML giving font1.ttf, font2.ttf, font2.ttf… and each word is giving font size width and height, x and y coordinates. The output XML should only contain the definition of text only.

Military Leave

Thanks,
Venkat.

asad.ali · December 6, 2020, 8:11pm

@vmamilla

We need to further investigate whether your this requirement can be achieved or not. For the sake of analysis, we have logged a ticket as PDFNET-49119 in our issue tracking system. We will further look into its details and keep you posted with the status of its resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.