Get rotation degree of scanned PDF page

sacramentoda · August 2, 2017, 6:47pm

Hi,

I am trying to get the rotation degree of a scanned pdf page using PdfPageEditor. After reading a lot of answers on aspose forum, here is what I tried.

Since you cannot get the rotation(orientation) of scanned PDF image directly, I used the workaround mentioned here (Check orientation of a (scanned) pdf document - #3 by raabw). I used Aspose.OCR to convert the page to Searchable PDF page.

But after OCR’ing the page I could only read image or text using ocrEngine object. So as mentioned here (Pdf To Searchable PDF mantaining original format), the workaround for this is to create searchable PDF of off text read from previous step.

But if the PDF is rotated to say 90 degree, the searchable PDF comes as all gibberish. In order to avoid this I used FloatBox instead as mentioned here (Add rotated text to existing PDF - Rotate Text to 90 Degrees in PDF using Aspose.PDF for .NET) to create the searchable PDF, as a result it brings the text but no orientation from original pdf.

So, I can’t use this searchable PDF to get the orientation of the original PDF.

Is there an easier way to achieve this.

Thanks

codewarior · August 3, 2017, 10:52am

@sacramentoda,

Thanks for contacting support.

Can you please share the input PDF file and code snippet causing this problem, so that we can test the scenario in our environment. We are sorry for this inconvenience.

sacramentoda · August 7, 2017, 8:44pm

Please see attached the code, input file and output file.

Thanks

Code.zip (320.3 KB)

codewarior · August 8, 2017, 8:09am

@sacramentoda,

Thanks for sharing the sample files and code snippet.

As per my understanding, you are using legacy Aspose.Pdf.Generator to create the PDF file after the text content is extracted from scanned document. I am working on updating the code snippet according to latest Document Object Model and will keep you updated with my findings. We are sorry for this inconvenience.

codewarior · August 8, 2017, 8:13pm

@sacramentoda,

Thanks for your patience.

I have tested the scenario using code logic which you have shared earlier and instead of using Aspose.Pdf.Generator, I have tried using Aspose.Pdf namespace. As per my observations, the resultant file do not contain proper content. However, another approach is to use Aspose.Pdf with Tesseract-OCR and as a result, all the content is rendered in PDF file but I am afraid not all the content is searchable. For the sake of correction, I have logged it as PDFNET-43177 in our issue tracking system. We will further look into the details of this problem and will keep you updated on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

For your reference, I have also attached the output generated with following code snippet. input_searchable.pdf (375.8 KB)

[C#]

Document doc = new Document(@"C:\pdftest\Code\input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save(@"C:\pdftest\Code\input_searchable.pdf");

static string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"C:\pdftest\Code\";
    img.Save(dir + "ocrtest.jpg");
    ///V3.02
    System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
    info.Arguments = @"C:\pdftest\Code\ocrtest.jpg C:\\pdftest\\Code\\out hocr";
    System.Diagnostics.Process p = new System.Diagnostics.Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"C:\pdftest\Code\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}