Save document structure in PDF

troy.l.parrish · February 17, 2022, 6:21pm

Hi,

We are working on saving off our documents as PDF and having it be accessible. We are saving the document structure and we see the tags but we are not getting the language in the tags even though it is specified in the document. I have attached a sample Word document and the resulting PDF. What is interesting to note is when inspecting the tags the language appears in the “content” tab of the tag inspector in Adobe Acrobat pro but not in the “tag” section. This is the code that was used to create the attached:

    `com.aspose.words.Document docx = new       com.aspose.words.Document("C:/path/to/Desktop/hello.docx");		
	 PdfSaveOptions pdfSaveOptions = new PdfSaveOptions();
	 pdfSaveOptions.setExportDocumentStructure(true);
	docx.save("C:/path/to/Desktop/hello.pdf", pdfSaveOptions);`

I am wondering if we are missing something or is there something that can be done to get the language from the content tab to tag tab.

tag.PNG (3.1 KB)
content.PNG (3.3 KB)
Hello.docx (11.3 KB)
hello.pdf (14.6 KB)

Konstantin.Kornilov · February 17, 2022, 6:46pm

@troy.l.parrish Writing the language as the attribute of the PDF marked content sequence (which is displayed in Acrobat on the Content tab) is a valid way to export the language. It should be handled by the accessibility conforming PDF readers.
However you could use an getExportLanguageToSpanTag/setExportLanguageToSpanTag option to export language as an attribute of the Span tag instead of marked content sequence if it is more convenient for you.

troy.l.parrish · February 17, 2022, 7:24pm

Konstantin,
Thanks for the reply. I have discovered the setExportLanguageToSpanTag. This gets me most of the way there. One thing to note about marking the language in the content sequence is that I found it impossible to interrogate that data to elicit the language being set on that tag. I was iterating over the structure elements coming from the following using Aspose PDF:
Element rootElement = pdfDoc.getTaggedContent().getRootElement(); List<StructureElement> structuredElements = rootElement.findElements(StructureElement.class, true);

StructureElement.getLanguage() is always null, since it is not set in the tag but in the content sequence. Is there a way to get the language attribute out of the marked content sequence?

When setExportLanguageToSpanTag is used, the language is set in the “tag” tab itself and I can interrogate and find the language, allowing me to manipulate the tag as needed. It would be helpful to not have to set the span tag to accomplish this.

Konstantin.Kornilov · February 18, 2022, 12:36pm

@troy.l.parrish This question is related to the Aspose.PDF product family. I moved this topic to appropriate forum.

asad.ali · February 18, 2022, 6:56pm

@troy.l.parrish

We need to investigate the feasibility of this requirement. Therefore, a ticket as PDFJAVA-41347 has been logged in our issue management system. We will further analyze this case and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

aspose.notifier · March 29, 2022, 5:05pm

The issues you have found earlier (filed as PDFJAVA-41347) have been fixed in Aspose.PDF for Java 22.3.