Pdf to word conversion - ignore spacing difference

imsureshbala · May 18, 2020, 9:12am

Hi Team,

When i am converting the pdf to word (using save method), i am getting lot of runs (for every word) in a paragraph (even after i use joinRunsWithSameFormatting). This creates my processing logic as every run is converted to span

When i analyzed (after saving the coverted document object as WORD_ML format), i am seeing the only difference is with the w:spacing value. is there any way, we can configure to ignore these differences when calling joinRunsWithSameFormatting or during the pdf to word conversion

Paragraph in the pdf starts like below

In late 1996 year, there are lot of conflicts between countries on legal transaction of money laundering

			<w:rPr>
					<w:rFonts w:ascii="Garamond" />
					<w:color w:val="000000" />
					<w:spacing w:val="0" />
					<w:sz w:val="20" />
				</w:rPr>
				<w:t>In</w:t>
			</w:r>
			<w:r>
				<w:rPr>
					<w:rFonts w:ascii="Garamond" />
					<w:color w:val="000000" />
					<w:spacing w:val="11" />
					<w:sz w:val="20" />
				</w:rPr>
				<w:t> </w:t>
			</w:r>
			<w:r>
				<w:rPr>
					<w:rFonts w:ascii="Garamond" />
					<w:color w:val="000000" />
					<w:spacing w:val="0" />
					<w:sz w:val="20" />
				</w:rPr>
				<w:t>late</w:t>
			</w:r>

asad.ali · May 18, 2020, 4:38pm

@imsureshbala

Would you kindly share your sample input and output documents with us along with the code snippet that you are using for conversion. We will test the scenario in our environment and address it accordingly.

imsureshbala · May 19, 2020, 11:01am

sample_pdf.pdf (29.8 KB)

com.aspose.pdf.Document mypdf = new com.aspose.pdf.Document(inStream);
DocSaveOptions myoption = new DocSaveOptions();
myoption.setMode(DocSaveOptions.RecognitionMode.Flow);
myoption.setMaxDistanceBetweenTextLines(3.5f);
myoption.setFormat(DocSaveOptions.DocFormat.DocX);
myoption.setAddReturnToLineEnd(false);
mypdf.save(outStream, myoption );

Hi team,
I have attached the sample pdf and the code used to convert pdf to docx. After the conversion, if you try to process the word, as paragraph and run, we can see that run is created for every word

Rufust · May 19, 2020, 12:44pm

In a couple of the recent tutorials, we have explored different ways to perform PDF to Excel and Excel to PDF conversions. But since the essential part of many PDF files is text, you may want to convert them into an editable Microsoft Word document rather than export to an Excel sheet [MyPrepaidCenter](https://myprepaidcenter.net/)

rameshkb · May 19, 2020, 5:01pm

Hi Team,
I am working with @imsureshbala and we are actually converting PDF to WORD(Docx) before running through the document object for extracting the details. As mentioned in the original post, the problem we are seeing is the text has been split into multiple runs for each word. When we look at the intermediate WordML, we notice that spaces between words are marked as separate "rPr"s and it has “w:spacing” set to a number, where as other "rPr"s with valid text are having zero. Not sure if this is implying anything to solve the problem.

But, can you please help us understand what data in PDF would contribute to this spacing value, and how we could overcome this in our processing. We tried to check the WordSpacing in the Text fragments of PDF content, but that is not helping as well.

Interestingly, here is a similar thread that was raised in Aspose Words forum years ago and this is very similar to what we are seeing in our PDF to DOCX conversion(First stage of our flow). Perhaps, do we need to have an equivalent solution here in PDF processing as well.

asad.ali · May 19, 2020, 8:26pm

@rameshkb

Thanks for sharing the sample file and details.

Would you please share how you are processing the output document and what are the steps to notice the issue?

rameshkb · May 19, 2020, 8:37pm

Thanks @asad.ali. Here is the flow of steps:

Get a aspose pdf document object from the input file
Save it as docx with the options @imsureshbala mentioned in his response.
With the updated/converted stream, created a aspose word document object.
Process through document object model - iterating thru sections/paragraphs/runs. (Idea is to run thru same algorithm we have for WORD)

The issue we are seeing is when the PDF document is processed, the RUN Nodes coming out of the above steps after docx conversion is having multiple splits (One run per word) instead of having one run per similar formatted text sequence.

Please let us knwo if additional info is needed to understand the issue.,

asad.ali · May 19, 2020, 8:40pm

@rameshkb

Thanks for sharing the details.

Would you please share the code snippet for the above steps. This would help us investigate the issue accordingly.

rameshkb · May 19, 2020, 8:46pm

Sure @asad.ali.

Here is the snippet, continuing after what @imsureshbala had given:

byte[] docBytes = outStream.toByteArray();
ByteArrayInputStream transformedInStream = new ByteArrayInputStream(docBytes);
doc = new Document(transformedInStream); // Creates a Aspose Word document obj from docx stream saved from pdf earlier.

This doc object is then used to iterate/navigate through the structure.

asad.ali · May 20, 2020, 5:04pm

@rameshkb

We were able to observe multiple splits in output DOCX while processing the nodes. We have logged an investigation ticket as PDFJAVA-39424 in our issue tracking system in order to find whether the issue is related to PDF to DOCX conversion or within the Aspose.Words API. We will further let you know as soon as we have some updates in this regard. Please be patient and spare us some time.

We are sorry for the inconvenience.

tahir.manzoor · May 20, 2020, 5:28pm

@rameshkb

Please note that all text of the Word document is stored in runs of text. One paragraph can contains one or multiple Run node. Could you please share your expected output? We will then provide you more information about your query.

rameshkb · May 20, 2020, 5:50pm

Thanks @asad.ali for the acknowledgement. It would be great, if you could get us a solution/workaround at the earliest. We are relying on Aspose for our content processing, and it is critical for a nearing product release.

@tahir.manzoor
Yes. Thats exactly what we are looking for. Multiple runs inside a paragraph and each RUN having blocks of text with same/similar formatting. This works well using Aspose Words API for the original docx documents. The scenario we are seeing here is when we use the PDF - We convert PDF to DOCX using Aspose PDF and then run the created Words document object along for further processing.

It is this document object that was from converted PDF had issues, where we are seeing RUN element for each word/space within the paragraph.

Please let us know if you need further information.

tahir.manzoor · May 20, 2020, 10:41pm

@rameshkb

When PDF is converted into DOC/DOCX using Aspose.PDF, it creates run node for each word. We will investigate this issue and provide you more information on it.

You can use Document.joinRunsWithSameFormatting method as shown below to join the Run nodes with same font formatting. Hope this helps you.

com.aspose.words.Document doc = new com.aspose.words.Document(MyDir + "in.docx");
doc.joinRunsWithSameFormatting();

rameshkb · May 21, 2020, 4:20am

Sure @tahir.manzoor. We have been using joinRunsWithSameFormatting() already in our code, which again works well in the original docx stream, but not on the converted stream from PDF we have in our scenario.

tahir.manzoor · May 22, 2020, 10:24am

@rameshkb

We have logged this problem in our issue tracking system as WORDSNET-20494 . You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

rameshkb · May 26, 2020, 1:57am

Thanks @asad.ali @tahir.manzoor for the acknowledgement.

Any workarounds or resolutions to this problem at the earliest could help us. Thanks for understanding.

asad.ali · May 26, 2020, 7:00pm

@rameshkb

Thanks for your feedback.

We will surely inform you as soon as logged ticket is resolved. However, please note that it is logged under normal support and will be investigated on first come first serve basis. We have recorded your concerns and will surely consider them during investigation. Please spare us some time.

We are sorry for the inconvenience.

rameshkb · June 2, 2020, 4:22pm

@tahir.manzoor

One followup question on joinRunsWithFormatting(). The method was able to join the text in some cases and is not in certain cases. Please see below two sets of RUNs (rPr in WordML).

The runs in the first block was joined, where as the second set wasn’t joined with the API and the difference we see in the markup, is “w:spacing” val, which we have indicated earlier.

Can you please look at it and let us know if this could be controlled or have a workaround to have this working.

<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>This</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>is</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>dummy</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>text</w:t>
</w:r>
</w:p>

<w:p>
<w:pPr>
<w:spacing w:before=“200” w:after=“0” w:line=“260” w:line-rule=“exact” />
<w:ind w:left=“0” w:right=“858” w:first-line=“0” />
<w:jc w:val=“left” />
<w:rPr>
<w:rFonts w:ascii=“BLEBMU+Times-Roman” />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>Second</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“38” />
<w:sz w:val=“20” />
</w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>bold.</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“VTDBIP+Times-Bold” />
<w:b />
<w:color w:val=“000000” />
<w:spacing w:val=“38” />
<w:sz w:val=“20” />
</w:rPr>
<w:t> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:ascii=“BLEBMU+Times-Roman” />
<w:color w:val=“000000” />
<w:spacing w:val=“0” />
<w:sz w:val=“20” />
</w:rPr>
<w:t>This</w:t>
</w:r>
<w:r>

tahir.manzoor · June 2, 2020, 7:15pm

@rameshkb

Could you please ZIP and attach the Word document here for testing? We will investigate this issue and share the workaround if possible. Thanks for your cooperation.

tahir.manzoor · July 1, 2020, 9:19pm

@rameshkb

It is to inform you that we have closed this issue WORDSNET-16325 as ‘Not a Bug’.

The Word document generated by Aspose.PDF has many Run nodes and every run contains spacing attribute with slightly different values such as 0, 1, 2. Aspose.Words joins only strictly equal attributes and these runs remain as is.