Extract document's content based on heading styles using Java

Hi Team,

As per requirement, we need to extract word document contents based on Heading 1 style but it’s not going to extract the content with all Heading 1.

Some time its extract 5 Heading 1 out of 9 Heading1.

I have attached java code and word document for your reference.

Extraction with heading 1.zip (6.2 KB)

Thank you!

@purusadh

To ensure a timely and accurate response, please attach the following resources here for testing:

  • Your input Word document.
  • Please attach the output Word file that shows the undesired behavior.
  • Please attach the expected output Word file that shows the desired behavior.
  • Please create a simple Java application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

Hi Tahir,

I have shared Input document and sample java code without compilation error.
Here we are saving extracted data in database based on Heading 1, so I am not going to attach output word file.
If you can see in below code it’s not going find Heading 1 from input document in if condition.

// Not getting all heading 1 for input document
if (para.getParagraphFormat().isHeading() && para.getParagraphFormat().getStyle().getName().equals(Constants.HEADING_STYLE)) {
System.out.println("Document heading 1 :: " + para.getText());
Paragraph paragraph = new Paragraph(doc);
para.getParentNode().insertBefore(paragraph, para);
builder.moveTo(paragraph);
builder.startBookmark(Constants.BOOKMARK_NAME + i);
builder.endBookmark(Constants.BOOKMARK_NAME + i);

				i++;
			}

Expected output: There are 9 Heading 1 in attached document, so It should go IF block 9 times.

Extraction with heading 1.zip (30.9 KB)

Thanks,
Purushottam

@purusadh

Unfortunately, we have not found the input Word document in your shared ZIP file. Please ZIP and attach your input document here for testing. Thanks for your cooperation.

Hi Tahir,

Please find attached java code with both input document.

Extraction with heading 1.zip (1.3 MB)

Thank you!

@purusadh

Your input document contains the eight paragraph with “Heading 1” style. You shared in “9 heading 1.png” that following text has heading style. However, its style is not “Heading 1”. Please check the attached image for detail. Style Heading.png (60.2 KB)

Content of the AIR6516 RSA Model

Hi Tahir,

When you run attached java program with “AIR6516.docx”, it’s going to extract only 4 Heading 1 style instead of 8.

Please check and update me, If I missed something.

Thank you!

@purusadh

Perhaps, you are using old version of Aspose.Words. Please use the latest version of Aspose.Words for Java 20.2. We have attached the test result in this post for your kind reference.
styles.png (30.5 KB)

Hi Tahir,
I have updated the Aspose word with the latest version 20.0, but still, I am getting the same issue.
Sometimes it extracts all Heading 1 but most of the time getting an incorrect results.
Please check the attached screenshot.

result with new aspose lib 20.2.png (55.1 KB)

Thank you!

@purusadh

Please make sure that you are using the same documents that you shared in this thread. We have not found this issue at our end.

Could you please share the steps that you are using at your end? We will investigate the issue and provide you more information on it.

Hi Tahir,

I have check existing code and found “checkLicence()” method is not placed at right location.
I have put above method at beginning of my logic and after that code working fine.

Thanks,
Purushottam

@purusadh

Thanks for your feedback. Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.