The attached word document already has searchable text. Would you please share why you want to process it using Aspose.OCR and generate PDF document? You can simply convert such files to PDF using Aspose.Words and obtained a searchable PDF because input Word document already has text content instead of images:
The attached document was just a sample to start with, we receive docx files of multiple page which has images within, We wanted to convert those doc to searchable pdf so if the images in the doc has text should be searchable too.
Also Our application convert most of the images to searchable pdf format, So we wanted to that consistency across all major formats.
Like this 1PagewithImage.docx (34.7 KB)
Hi @asad.ali
As suggested I did tried this too. but I am getting empty or incorrect pdf. Seems Memory overload is not working as expected when given input to recognize
public static void wordtopdf()
{
AsposeOcr api = new AsposeOcr();
License lic = new License();
Aspose.Words.License licw = new Aspose.Words.License();
lic.SetLicense("Aspose.Total.NET.lic");
licw.SetLicense("Aspose.Total.NET.lic");
List<RecognitionResult> recognitionResult = new List<RecognitionResult>();
string path = @"D:\OCR\TIFFs\2PagewithImage.docx";
// Initialize PDF output stream
using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
{
using (MemoryStream WritePdfStream = new MemoryStream())
{
Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
op.Encoding = System.Text.Encoding.UTF8;
op.LoadFormat = Aspose.Words.LoadFormat.Docx;
//Saving Normal PDF
Document docDisk = new Document(DocInputStream, op);
docDisk.Save(@"D:\OCR\TIFFs\NonSearchable_Disk.pdf", Aspose.Words.SaveFormat.Pdf); // <<=== This WORKS though , but if we use this pdf for input for to make searchable it wont(we are more interested in memory overload )
DocInputStream.Position = 0;
Document docMem = new Document(DocInputStream, op);
docMem.Save(WritePdfStream, Aspose.Words.SaveFormat.Pdf); // <<=== This might be working but some issue
//Making Searchable PDF
WritePdfStream.Position = 0;
recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings() { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));
AsposeOcr.SaveMultipageDocument(@"D:\OCR\TIFFs\Searchble_Memory.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);
}
}
}
We are checking and investigating the task as it has been logged under the ticket ID OCRNET-620 in our issue tracking system. We will be updating you in this forum thread as soon as the ticket is resolved. Please be patient and spare us some time.
Hi @asad.ali
Looking at the status below issue it seems to be resolved but ,When I took Aspose.Word 22.12 it is still showing same result. Could you please assist me with any new settings you have added
We are not specialized in extracting text from PDFs. So after using Aspose.Words you have got the PDF with combined (text and image) content.
Aspose.OCR can extract images and recognize the text on them. But we can’t extract text. The only setting you have to use is the pages number
recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings(0,3) { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));
Your image places on the second page. The result PDF is attached
What we can advise to get one completed PDF with text and image using Aspose.OCR - convert your .docx file into images and then recognize images. For example
using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
{
using (MemoryStream WritePdfStream = new MemoryStream())
{
Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
op.Encoding = System.Text.Encoding.UTF8;
op.LoadFormat = Aspose.Words.LoadFormat.Docx;
//Saving image
Document docDisk = new Document(DocInputStream, op);
docDisk.Save(@"D:\imgs\ISSUES\NET620\res\NonSearchable_Disk.tiff", Aspose.Words.SaveFormat.Tiff); // <<=== for example TIFF is multipage format
recognitionResult.AddRange(api.RecognizeTiff(@"D:\imgs\ISSUES\NET620\res\NonSearchable_Disk.tiff", new DocumentRecognitionSettings(0,3) { Language = Language.Eng, DetectAreasMode = DetectAreasMode.TABLE})); // TABLE mode is better for mixed text and table content
AsposeOcr.SaveMultipageDocument(@"D:\imgs\ISSUES\NET620\res\Searchble_Memory.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);
}
}
or you can choose the way. Simply take into account that Aspose.OCR is the image recognition library, and for now, we can’t extract text from PDF which is already Searchable or mixed. Searchble_Memory.pdf (110.7 KB) Searchble_Memory1.pdf (78.9 KB)