Converting PDF to WORD of 1500 pages

manuel.corini · November 20, 2018, 4:26pm

Hi guys,

I’m trying to convert a file from a .pdf to .doc format, i’m using the library Aspose PDF 18.9.1 with temporany license and Java 8. The PDF is of 42 MB with few pictures and few tables, I’m getting this error after 45 minutes of processing, the document has 1500 pages:

java.lang.StackOverflowError
    at com.aspose.pdf.internal.l212I.I6l.l0I(Unknown Source)
    at com.aspose.pdf.internal.l212I.I6l.lif(Unknown Source)
    at com.aspose.pdf.internal.l2127.I1.ll(Unknown Source)
    at com.aspose.pdf.internal.l212I.I7.lif(Unknown Source)
    at com.aspose.pdf.internal.l2127.I1.ll(Unknown Source)
    at com.aspose.pdf.internal.l212I.I6l.lif(Unknown Source)
    ...

I have got the OutOfMemory that I have solved increasing the heap space to 2048 MB, what could I try ?
This is the code :

    Document pdfDocument = new Document(pdf);
    // Create DocSaveOptions object
    DocSaveOptions saveOption = new DocSaveOptions();
    // Set format DOC
    saveOption.setFormat(DocSaveOptions.DocFormat.Doc);
    // Set the recognition mode as Flow
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
    // Enable the value to recognize bullets during conversion process
    saveOption.setRecognizeBullets(true);
    // Create OutputStream
    OutputStream fos = new FileOutputStream(pathDOC);
    // Save the resultant DOC file
    pdfDocument.save(pathDOC, saveOption);

Where pathDoc is the String variable path to save the doc.
This is the file: [drive.google.com/open?id=18XU_058OHcOfWY68NnpNDjBGa95OHbeO](https://drive.google.com/open?id=18XU_058OHcOfWY68NnpNDjBGa95OHbeO)

asad.ali · November 20, 2018, 11:23pm

@manuel.corini

Thanks for contacting support.

We have tested the scenario in our environment and also observed the PDF document shared by you. Please note the API performance depends upon the structure, size and complexity of a document. Larger PDF document may need more memory in order to get processed through API because, it needs to load all required resources into memory. Increasing the heap size is already recommended in order to prevent such issues.

Furthermore, we do improve memory consumption and API performance in every monthly revision of the API, therefore, we always recommend to use latest version. You may please try using Aspose.PDF for Java 18.10 in your environment and in case increasing heap size does not help either, please let us know. We will further proceed to assist you.

manuel.corini · November 21, 2018, 8:16am

Thank you for the answer. Let me know if I understood well.
I have to increase the heap space from 2048 MB to 4096 MB and I have to download the new version 18.10, keeping the same temporany license ?

manuel.corini · November 21, 2018, 10:27am

I did like you said, upgrading the library and improving the heap space to 4096 MB, I’m getting the same error, after 45 minutes of process:

nested exception is java.lang.StackOverflowError] with root cause
java.lang.StackOverflowError
at com.aspose.pdf.internal.l87p.l1f.lc(Unknown Source)
at com.aspose.pdf.internal.l87p.l1f.lI(Unknown Source)
at com.aspose.pdf.internal.l87t.lt.lf(Unknown Source)
at com.aspose.pdf.internal.l87p.lI.lI(Unknown Source)
at com.aspose.pdf.internal.l87t.lt.lf(Unknown Source)
… etc …
Any help ?

asad.ali · November 21, 2018, 6:00pm

@manuel.corini

Thanks for sharing your feedback.

We have logged this issue as PDFJAVA-38180 in our issue tracking system after observing the scenario in our environment. We will definitely look into details of the issue and keep you posted with the status of its correction. Please be patient and spare us little time.

We are sorry for the inconvenience.

manuel.corini · December 27, 2018, 10:23am

Hi asad.ali,

I’m still waiting for an answer.
I’m asking if there is a manner to cut off the PDF file using ASPOSE PDF API.
I would like to cut off, for example every 300 pages, convert in doc format and
at the end merge all files.

asad.ali · December 27, 2018, 7:36pm

@manuel.corini

Thanks for your inquiry.

I am afraid that earlier logged issue is not yet resolved due to other pending issues in the queue. As issue was logged under free support model, it has low priority and will get resolved on first come first serve basis. However, we have started investigation of the issue and as soon as we have some definite updates regarding its resolution, we will let you know. Please spare us little time.

You may please use following code snippet in order to generate document with 300 pages and save it as .docx in your system. Later, you can merge them as per your procedure in your environment.

Document doc = new Document(dataDir + "input.pdf");
int counter = 0;
Document newDoc = new Document();
for(int i = 1; i <= doc.getPages().size(); i++)
{
  if(counter < 300) {
  newDoc.getPages().add(doc.getPages().get_Item(i));
  counter++;
 }
 else {
  DocSaveOptions doptions = new DocSaveOptions();
  doptions.setFormat(SaveFormat.DocX);
  newDoc.save("WordFile.docx", doptions);
  newDoc = new Document();
  counter = 0;
 }
}

manuel.corini · January 17, 2019, 9:57am

I would like to inform you, that I solved the problem dividing the document in blocks,
and after I used the appendDocument method:

public void getContent(String pathDOC,InputStream pdf) throws Exception {
	
	// Create DocSaveOptions object
	DocSaveOptions saveOption = new DocSaveOptions();
	// Set format DOC
	saveOption.setFormat(DocSaveOptions.DocFormat.Doc);
	// Set the recognition mode as Flow
	saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
	// Enable the value to recognize bullets during conversion process
	saveOption.setRecognizeBullets(true);
	// Load source PDF file
	Document pdfDocument = new Document(pdf);	
	// Create a new Document object to divide the entire document
	Document newDoc = new Document();
	// Create new Document to append each block converted
	com.aspose.words.Document dstDocx = new com.aspose.words.Document();
	// Delete first default page
	dstDocx.removeAllChildren();	
	// Dividing block of 400 pages
	int counter = 0, index = 0;
	for(int i = 1; i <= pdfDocument.getPages().size(); i++)
	{
		newDoc.getPages().add(pdfDocument.getPages().get_Item(i));
		if(counter < 400 && i != pdfDocument.getPages().size()) {
		  counter++;
		}
		else {
			// Convertion PDF to .doc with Aspose.PDF
			newDoc.save(pathDOC + "_" + ++index + ".doc", saveOption);
			// Load the file to be converted from .doc to .docx with Aspose.Words
			com.aspose.words.Document document = new com.aspose.words.Document(pathDOC + "_" + index + ".doc");	
			// Save document in Words
			document.save(pathDOC + "_" + index + ".docx", SaveFormat.DOCX);
			// Load source & destination documents
			com.aspose.words.Document srcDoc = new com.aspose.words.Document(pathDOC + "_" + index + ".docx");
			// set the appended document to start from a new page
			srcDoc.getFirstSection().getPageSetup().setSectionStart(SectionStart.NEW_PAGE);
			// append the source document using its original styles
			dstDocx.appendDocument(srcDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
			// Create a new instance to save performance
			newDoc = new Document();
			counter = 0;
		 }
	}
	// Save final result
	dstDocx.save(pathDOC + ".docx");
	/* Delete temp file */
	deleteTempFile(pathDOC, index);
}

/* Delete temp file doc and docx */
public void deleteTempFile(String pathDOC, int index) {
	for(; index > 0; index--) {
		try {
			Files.delete(Paths.get(pathDOC + "_" + index + ".doc"));
			Files.delete(Paths.get(pathDOC + "_" + index + ".docx"));
		} catch (NoSuchFileException x) {
			System.err.format("%s: no such" + " file or directory%n", pathDOC + "_" + index );
		} catch (DirectoryNotEmptyException x) {
			System.err.format("%s not empty%n", pathDOC + "_" + index );
		} catch (IOException x) {
			// File permission problems
			System.err.println(x);
		}
	}
}

}

Thank you for your help.

asad.ali · January 17, 2019, 3:33pm

@manuel.corini

Thanks for your acknowledgment.

It is good to know that your issue has been resolved. Please keep using our API and in case of any further assistance, please feel free to let us know.

aspose.notifier · May 26, 2020, 7:02pm

The issues you have found earlier (filed as PDFJAVA-38180) have been fixed in Aspose.PDF for Java 20.5.