Issues with converting word to pdf

serendipity.zhq · December 9, 2024, 1:34am

When I converted a PDF file using Word, I tried to use the following code to turn all the text in the PDF file into black. However, the Chinese double quotes in the processed file were lost. What is the reason? When I converted a PDF file using other products, and then processed it with the same code, the double quotes did not disappear.

jar:words-24.3-jdk17.jar
jdk:1.8
code:

private static void convertContentToBlackAndWhite(PDPage page, PDDocument document) throws IOException {

        COSDictionary pageDict = page.getCOSObject();
        COSBase contents = pageDict.getDictionaryObject(COSName.CONTENTS);

        if (contents instanceof COSStream) {
            processContentStream((COSStream) contents, page, document);
        } else if (contents instanceof COSArray) {
            for (COSBase base : (COSArray) contents) {
                if (base instanceof COSStream) {
                    processContentStream((COSStream) contents, page, document);
                }
            }
        }
    }

    private static void processContentStream(COSStream cosStream, PDPage page, PDDocument document) throws IOException {

        byte[] streamBytes = toByteArray(cosStream.getUnfilteredStream());
        String content = new String(streamBytes, StandardCharsets.UTF_8);

        Pattern pattern = Pattern.compile("([0-9.]+) ([0-9.]+) ([0-9.]+) rg");
        Matcher matcher = pattern.matcher(content);

        StringBuffer result = new StringBuffer();

        while (matcher.find()) {

            String replacement;
            if (Objects.equals(matcher.group(1), "1") && Objects.equals(matcher.group(2), "0") && Objects.equals(matcher.group(3), "0")){
                replacement = String.format("%d %d %d rg", 0, 0, 0);
            }else {
                float r = Float.parseFloat(matcher.group(1)) * 0.2989f;
                float g = Float.parseFloat(matcher.group(2)) * 0.5870f;
                float b = Float.parseFloat(matcher.group(3)) * 0.1140f;

                replacement = String.format("%.4f %.4f %.4f rg", r+g+b, r+g+b, r+g+b);
            }

            matcher.appendReplacement(result, replacement);
        }
        matcher.appendTail(result);
        PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.OVERWRITE, true, true);

        contentStream.appendRawCommands(result.toString().getBytes());

        contentStream.close();

Why are there different results? Can you help analyze it?

Professionalize.Discourse · December 9, 2024, 1:35am

@serendipity.zhq

It seems you are experiencing an issue with the conversion of a Word document to PDF, specifically regarding the loss of Chinese double quotes. This can occur due to various reasons related to font embedding, character encoding, or the way the text is processed during the conversion.

Here are a few suggestions to troubleshoot and potentially resolve the issue:

Font Embedding: Ensure that the fonts used in your Word document are embedded in the PDF. If the font containing the Chinese double quotes is not embedded, it may not render correctly in the PDF. You can check the font settings in your Word document and ensure that the necessary fonts are included during the conversion.
Character Encoding: Verify that the character encoding settings are correctly configured. Sometimes, issues with character encoding can lead to the loss of specific characters during conversion. Make sure that the encoding supports the characters you are using.
Update Aspose.Words: Ensure that you are using the latest version of Aspose.Words for Java. Updates often include bug fixes and improvements that may resolve issues related to document conversion.
Test with Different Documents: Since you mentioned that other products do not have this issue, try converting different Word documents with similar content to see if the problem persists. This can help identify if the issue is document-specific.
Consult Documentation: For detailed guidance on handling different document formats and conversion processes, refer to the Aspose.Words documentation. It provides comprehensive information on loading, modifying, and saving documents, which may help you identify any misconfigurations in your code.

If the problem continues after trying these suggestions, consider reaching out to the Aspose support team with specific details about your implementation and the documents you are working with.

For further reference, you can explore the Aspose.Words for Java documentation here.

If you have any more questions or need additional assistance, feel free to ask!

serendipity.zhq · December 9, 2024, 1:50am

I have tried all the above methods and have eliminated these problems before I come to consult.

alexey.noskov · December 9, 2024, 6:42am

@serendipity.zhq Unfortunately, it is not quite clear how to reproduce the problem. As I can see simple conversion of your input document to PDF gives correct result:

Document doc = new Document("C:\\Temp\\in.docx");
doc.save("C:\\Temp\\out.pdf");

The code you proved above does not use Aspose.Words. Could you please create a simple code example that will allow us to reproduce the problem? Or describe your requirements that are not fulfilled with the simple Word to PDF conversion.

serendipity.zhq · December 12, 2024, 8:10am

Why does garbled characters appear in the content stream of the PDF file generated by the above code? How to solve it?

alexey.noskov · December 12, 2024, 8:34am

@serendipity.zhq Could you please attach your input and output documents here for testing?
Usually such problems occur when fonts used in the document are not available. If Aspose.Words cannot find the fonts used in the document the fonts are substituted. This might lead into the layout differences due to differences in fonts metrics. You can implement IWarningCallback to get a notification when font substitution is performed.
The following articles can be useful for you:
https://docs.aspose.com/words/java/specify-truetype-fonts-location/
https://docs.aspose.com/words/java/install-truetype-fonts-on-linux/

serendipity.zhq · December 12, 2024, 8:41am

Original word document:
test (2).docx (10.1 KB)

pdf generated by aspose:
test1.pdf (7.2 KB)

code:

WordsConvert wordsConvert = new WordsConvert();
wordsConvert.setFontPath(fontPath);
wordsConvert.convertPdf(srcPath, dstFile);

fontPath is a folder containing multiple fonts

alexey.noskov · December 12, 2024, 8:57am

@serendipity.zhq As I can see symbols are displayed correctly:

serendipity.zhq · December 12, 2024, 9:03am

The pdf file can be displayed normally, but when I use the code to view the pdf content stream, the pdf operator will be garbled. I want to know if this is normal

alexey.noskov · December 12, 2024, 9:10am

@serendipity.zhq Yes, this is normal. You are reading PDF as TXT.

serendipity.zhq · December 12, 2024, 9:11am

Thanks and have a nice day!