Evaluation of the Aspose.Pdf API

Deepsa · July 28, 2017, 5:18am

Hi,

We are evaluating “Aspose.Pdf” and would like to know if this supports the below functionalities:

PDF page information like width, height
Word level text extraction
Word style information like Font-family, Font-size, Font-Color
Font extraction
Background image extraction
Hidden text identification
Vertical text identification
PDF update\create

It would be great if you can provide the samples for the same.

This Topic is created by imran.rafique using the Email to Topic plugin.

imran.rafique · July 28, 2017, 10:58am

@Niit_deependra.khangarot,
Thank you for contacting support.

Sample code to get the page size of a PDF document:
[C#]

// import a PDF document
var document = new Document("blah.pdf");
// get width and height of the page by index
double width = document.Pages[1].Rect.Width;
double height = document.Pages[1].Rect.Height;

Please refer to this help topic: Manipulate Page in a PDF File

Please refer to these help topics: Extract Text from PDF and Formatting PDF Document

Sample code to retrieve fonts:
[C#]

Document pdf = new Document(@"c:\temp\test_pdfextractor.pdf");
for (int i = 1; i <= pdf.Pages.Count; i++)
{
    foreach (Aspose.Pdf.Text.Font font in pdf.Pages[i].Resources.Fonts)
        Console.WriteLine(font.FontName);
}

Please refer to this help topic: Introduction to the DOM API

Kindly send us your source PDF, we will investigate and share our findings with you. Your response is awaited.

Best Regards,
Imran Rafique

imran.rafique · July 31, 2017, 5:48pm

Hi,

We are evaluating “Aspose.Pdf” and would like to know if this supports the below functionalities:

PDF page information like width, height
Word level text extraction
Word style information like Font-family, Font-size, Font-Color
Font extraction
Background image extraction
Hidden text identification
Vertical text identification
PDF update\create

It would be great if you can provide the samples for the same.

codewarior · July 31, 2017, 6:00pm

@Deepsa,

Your queries are answered in above post. Should you have further query, please feel free to contact.

Deepsa · August 3, 2017, 4:54am

@imran.rafique , Thanks for the reply.

@codewarior,

Most of the our queries resolved but still we have few more:

Font style (Weight, Color, Size):

How to get font style for particular word?
We tried to get the color but it’s extracting different color.

Vertical text identification

Sample file is attached. There is a three lines of vertical text near figure 1.
How we can get the transformation information for the same?

Line height calculation

How we can calculate line height?

Word spacing

Word spacing can be extracted or not?

Output for different resolution

Is it possible to get extraction for different resolutions?

I have tried Aspose PDF to html and it’s output contains all the required information that means there is a way to get all this. May be we are not getting exact methods or properties.

Please provide some samples too.

Thanks

VerticalText.pdf (61.8 KB)

codewarior · August 3, 2017, 7:24pm

@Deepsa,

Thanks for sharing the details.

Please visit the following link for required information on How to search and Get Text from All the Pages of PDF Document along with formatting information. If you still face any issue, please share the input PDF file.

I have tested the scenario of text extraction but I am afraid currently the API is not able to extract rotated text instances. For the sake of correction, I have logged it as PDFNET-43152 in our issue tracking system.

I am afraid the feature is currently not supported. However for the sake of implementation, I have logged it as PDFNET-43151. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

Please try using following code line.
textFragment.TextState.WordSpacing

Can you please share some further details regarding this requirement, so that we can reply accordingly.

Deepsa · August 10, 2017, 10:20am

@codewarior, thanks for the reply.

We have one more issue:

We are not able to extract actual font color of text, for the PDFs we have it’s coming out different than the original color.

Is there any conversion formula to get the exact color?
We have tried some other PDFs and there text color was extracted properly.

I have attached the sample page.

Sample.pdf (1.8 MB)

imran.rafique · August 10, 2017, 8:16pm

@Deepsa,
You can get the original color of each text element as below:

[C#]

// Open document
Document pdfDocument = new Document(@"C:\Pdf\test213\Sample.pdf");

// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);

// Get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

// Loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    Aspose.Pdf.Color color = textFragment.TextState.ForegroundColor;
    //color.ToRgb();
    Console.WriteLine("Text : {0} ", textFragment.Text);
}

The ToRgb() method of the Color class allows to convert the color into RGB. We have tested your PDF with the latest version 17.8 and could not find the issue of incorrect color codes. If this does not help, then kindly share your code snippet. We will investigate and share our findings with you.

Best Regards,
Imran Rafique