Bold Content Extraction

Hamza_Ghojaria · January 22, 2025, 12:52pm

We currently hold a permanent license for Aspose.pdf . we would like to know if you could help us in getting some idea or feature that Aspose has to extract bold data from the current license which we have.

It would be of great help. Documentation links or suggestions would also help. Please let me know if you have any questions.

asad.ali · January 22, 2025, 5:03pm

@Hamza_Ghojaria

Can you please share a sample PDF document with us so that we can test some code sample to extract bold text from it and share our feedback with you accordingly?

Hamza_Ghojaria · January 23, 2025, 1:26pm

could you please use this link and download the PDF and look for bold text on page number 97

asad.ali · January 23, 2025, 4:06pm

@Hamza_Ghojaria

Please check below sample code snippet if it helps:

Document pdfDocument = new Document(dataDir + "pcaob-release-no.-2023-003---noclar.pdf");
//foreach (Aspose.Pdf.Page page in pdfDocument.Pages)
//{
    var textFragmentAbsorber = new Aspose.Pdf.Text.TextFragmentAbsorber();
    pdfDocument.Pages[97].Accept(textFragmentAbsorber);
    Aspose.Pdf.Text.TextFragmentCollection textFragments = textFragmentAbsorber.TextFragments;
    foreach (var item in textFragments)
    {
        if (item.TextState.FontStyle == FontStyles.Bold)
        {
            Console.WriteLine(item.Text);
        }
    }
//}

Hamza_Ghojaria · January 23, 2025, 5:07pm

can you please help for the same in python.

asad.ali · January 24, 2025, 12:11pm

@Hamza_Ghojaria

Please check and try below code sample:

import aspose.pdf as ap

# Load the PDF document
document = ap.Document("input.pdf") 

# Instantiate a TextFragmentAbsorber object
txtAbsorber = ap.text.TextFragmentAbsorber()

# Search text
document.pages[97].accept(txtAbsorber) 

# Get reference to the found text fragments
textFragmentCollection = txtAbsorber.text_fragments

# Parse all the searched text fragments and replace text
for txtFragment in textFragmentCollection:
    if text_fragment.text_state.font_style == 1:
        print(text_fragment.text)

Hamza_Ghojaria · January 24, 2025, 1:53pm

import aspose.pdf as ap

#Load the PDF document
document = ap.Document("input.pdf") 

#Instantiate a TextFragmentAbsorber object
txtAbsorber = ap.text.TextFragmentAbsorber()

#Search text
document.pages[97].accept(txtAbsorber) 

#Get reference to the found text fragments
textFragmentCollection = txtAbsorber.text_fragments

#Parse all the searched text fragments and replace text
for txtFragment in textFragmentCollection:
    print(txtFragment.text_state.font_style )
    print(text_fragment.text)

    if txtFragment.text_state.font_style == FontStyles.BOLD:
        print(text_fragment.text)

The txtFragment.text_state.font_style returns a 0,1,2,3 value how it can be validated with FontStyles.Bold. you can check it from the print statement above. also FontStyle.Bold is not valid can you please share an updated code ?

asad.ali · January 24, 2025, 8:08pm

@Hamza_Ghojaria

The integer values for the styles are as below:

Regular = 0
Bold = 1
Italic = 2

We have updated the code snippet as well.