Text Transformation Matrix

Gaurav.A · October 31, 2014, 2:03am

Hi,

I am extracting pdf text to create HTML document. I am able to extract the text but facing issue with rotated text. I need to get the transformation matrix of the rotated/skewed texts so that I can replicate the same in HTML.

I am using TextFragmentAbsorber class for text extraction. Also, is there any way to get the bold/italic information about text styling.

Attached is the sample document.

codewarior · October 31, 2014, 6:44am

Gaurav.A:

I am extracting pdf text to create HTML document. I am able to extract the text but facing issue with rotated text. I need to get the transformation matrix of the rotated/skewed texts so that I can replicate the same in HTML.

Hi Gaurav,

Thanks for contacting support.

I am afraid the current release of Aspose.Pdf for .NET cannot return rotation angle for text instance. Or in other words, it cannot determine of the text is vertically placed or not. For
the sake of correction, I have logged it in our issue tracking system as PDFNEWNET-37710.

Gaurav.A:

I am using TextFragmentAbsorber class for text extraction. Also, is there any way to get the bold/italic information about text styling.

Currently we can set TextFragment formatting as Bold/Italic but I am afraid it does not support the feature to get / determine if particular TextFragment has Bold/Italic formatting. I have separately logged it as PDFNEWNET-37711 in our issue tracking system. We will investigate these requirements in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

codewarior · October 31, 2014, 6:47am

Hi Gaurav,

As a workaround, you may consider saving the PDF file in HTML format using following code snippet. When using this approach, the text rotation is preserved but contents appear as image in resultant file.

[C#]

//open
document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/sample2.pdf");

// save output in HTMl format

pdfDocument.Save("resultant.html", SaveFormat.Html);

asad.ali · August 19, 2018, 7:55pm

@Gaurav.A

Thanks for your patience.

In reference to logged ticket (PDFNET-37710), it has been resolved in Aspose.PDF for .NET 18.8. You may please download latest version of the API and use following code snippet to extract rotation of the text:

//open document
Document pdfDocument = new Document(myDir + "sample2.pdf");
//create TextAbsorber object to find all the phrases matching the regular expression
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");//like 1999-2000
//set text search option to specify regular expression usage
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.TextSearchOptions = textSearchOptions;
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments
foreach (TextFragment textFragment in textFragmentCollection)
{
    Console.WriteLine("Text : {0} ", textFragment.Text);
    Console.WriteLine("Rotation : {0} ", textFragment.TextState.Rotation);
}

In case of you still face any issue, please feel free to let us know.

aspose.notifier · February 7, 2019, 4:46pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan