Word的解析是否有demo包含一个完整的文本内容,图片,流程图,表格和超链接的提取

需求是想提取word的完整信息,能够以一个结构化的信息输出,比如json形式,markdwon形式。以json为例,aspose word是否有一个完整的demo,能汇总文本内容,图片,流程图,表格和超链接的信息。

@kevinsen

Cause

To extract comprehensive content from a Word document, including text, images, flowcharts, tables, and hyperlinks, you need to utilize the capabilities of Aspose.Words. However, there may not be a single demo that covers all these elements in one go.

Solution

You can achieve this by combining various features of the Aspose.Words library. Here’s a structured approach to extract the required elements and output them in a structured format like JSON or Markdown.

Code

Here’s a sample code snippet that demonstrates how to extract text, images, tables, and hyperlinks from a Word document:

import com.aspose.words.*;

import java.util.ArrayList;

public class ExtractContent {
    public static void main(String[] args) throws Exception {
        Document doc = new Document("input.docx");
        ArrayList<String> extractedContent = new ArrayList<>();

        // Extract text
        for (Paragraph paragraph : (Iterable<Paragraph>) doc.getChildNodes(NodeType.PARAGRAPH, true)) {
            extractedContent.add(paragraph.getText());
        }

        // Extract images
        for (Shape shape : (Iterable<Shape>) doc.getChildNodes(NodeType.SHAPE, true)) {
            if (shape.hasImage()) {
                extractedContent.add("Image: " + shape.getImageData().getImageType());
            }
        }

        // Extract tables
        for (Table table : (Iterable<Table>) doc.getChildNodes(NodeType.TABLE, true)) {
            extractedContent.add("Table: " + table.toString(SaveFormat.TEXT));
        }

        // Extract hyperlinks
        for (Field field : (Iterable<Field>) doc.getChildNodes(NodeType.FIELD_START, true)) {
            if (field.getType() == FieldType.FIELD_HYPERLINK) {
                extractedContent.add("Hyperlink: " + field.getResult());
            }
        }

        // Output extracted content in JSON format
        System.out.println(extractedContent);
    }
}

Next Steps

  1. Adjust the code to fit your specific requirements, such as the output format (JSON or Markdown).
  2. Test the code with different Word documents to ensure it captures all desired elements.
  3. If you need further customization or specific features, refer to the Aspose.Words documentation for additional methods and classes.

This approach will effectively help you extract all necessary content from your Word documents. If you have any further questions or need assistance, feel free to ask!

thank you!
我在官方的文档中没有找到word提取图片坐标位置的信息,请问是否有code demo提供下,thanks!

@kevinsen 在这种情况下,您需要使用 LayoutCollectorLayoutEnumerator。 以下是代码:

Document doc = new Document("input.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);

Shape shape = (Shape)doc.GetChild(NodeType.Shape, 0, true);
Object entity = layoutCollector.GetEntity(shape);

layoutEnumerator.Current = entity;

// Get rendered rectangle of shape (position and size on the page).
System.Drawing.RectangleF rect = layoutEnumerator.Rectangle;

Console.WriteLine($"X={rect.Left}; Y={rect.Top}; Width={rect.Width}; Width={rect.Height}");