AsposeWords extract images from pdf error

David_Matin · March 10, 2025, 5:43am

I’m using the api to extract images from a pdf file, but nothing is extracted. But I use aspose.words to convert this pdf to markdown and there are image files in the md file, this is because of what. And, the image obtained after converting to md is not consistent with what is seen in the pdf, the background color becomes black, causing the image recognition to fail.
extract code

file_path = '7cdbfd2e-5893-43f1-a892-4eff48069881.pdf'
document = ap.Document(file_path)
img_dmc_ids = []
for pdfPage in document.pages:
    image_list = pdfPage.resources.images
    img_base_path = file_path[:file_path.rfind('.pdf')]
    img_base_path = f"{img_base_path}_output"
    if not os.path.exists(img_base_path):
        os.makedirs(img_base_path)
    for image in image_list:
        # print(image)
        try:
            image_format = str(image.grayscaled.raw_format).lower()
        except:
            image_format = "png"
        image_name = str(uuid.uuid4()) + '.' + image_format
        image_path =  image_name
        outputImage = io.FileIO(image_path, "w")
        # Save output image
        image.save(outputImage)
        outputImage.close()

convert code

document = aw.Document(source_file)
save_options = aw.saving.MarkdownSaveOptions()
save_options.image_resolution = 300
document.save(f"{result}.md", save_options)

7cdbfd2e-5893-43f1-a892-4eff48069881_page1.pdf (163.5 KB)

Professionalize.Discourse · March 10, 2025, 5:43am

@David_Matin

It seems you are experiencing issues with extracting images from a PDF file using Aspose.Words, and you have also noted inconsistencies with the images when converting the PDF to Markdown. Here are some insights that may help you resolve these issues:

Image Extraction: The code you provided attempts to extract images from the PDF by accessing the resources of each page. However, if no images are being extracted, it could be due to the way the images are embedded in the PDF. Some images may not be directly accessible as resources, or they may be embedded in a way that requires different handling.
Image Format Handling: In your extraction code, you are trying to determine the image format using image.grayscaled.raw_format. If this fails, you default to PNG. Ensure that the images in the PDF are indeed in a format that can be processed. You might want to check if the images are in a supported format or if they are being processed correctly.
Markdown Conversion: When converting the PDF to Markdown, the images may be processed differently, which could lead to discrepancies in appearance (like background color changes). This can happen if the conversion process applies different rendering settings or if the images are altered during the conversion.
Image Background Color: If the background color of the images changes to black during conversion, it may be due to the color space settings or rendering options used during the conversion process. You can try adjusting the image color space settings in the PdfSaveOptions when saving the document to see if that resolves the issue.
Debugging Steps:
- Verify the image extraction logic by checking if the images are indeed present in the PDF using a PDF viewer.
- Experiment with different image formats and extraction methods.
- Review the conversion settings used for Markdown to ensure they align with your expectations for image rendering.

If you continue to face issues, consider providing more details about the specific types of images in the PDF and any error messages you encounter. This information can help in diagnosing the problem more effectively.

Feel free to reach out if you need further assistance!

alexey.noskov · March 10, 2025, 5:57am

@David_Matin As I can see you are using Aspose.PDF for image extraction, so I would suggest to report this problem in Aspose.PDF forum.
https://forum.aspose.com/c/pdf/10

You are using Aspose.Words for conversion from PDF to MD. You can use Aspose.Words to extract images from your document:

doc = aw.Document("C:\\Temp\\in.pdf")

i = 0
for s in doc.get_child_nodes(aw.NodeType.SHAPE, True) :
    s = s.as_shape()
    if s.has_image :
        s.image_data.save("C:\\Temp\\image_" + str(i) + aw.FileFormatUtil.image_type_to_extension(s.image_data.image_type))
        i = i + 1

David_Matin · March 10, 2025, 6:01am

ok，let me try u code

David_Matin · March 10, 2025, 6:08am

@alexey.noskov Why I extracted the picture is so, and in the pdf looks different

alexey.noskov · March 10, 2025, 6:52am

@David_Matin Could you please elaborate what you mean? As i can see the image is extracted as is from PDF.

David_Matin · March 10, 2025, 7:14am

@alexey.noskov
image.jpg (71.2 KB)

This picture, the thumbnail is white background, but after zooming in is a black background, the pdf is seen in the white background

alexey.noskov · March 10, 2025, 7:18am

@David_Matin The extracted image actually has transparent background.