Do OCR'd TIFFs converted into PDF wrap the original TIFF, or convert the image type?

RokeJulianLockhart · December 4, 2023, 2:24pm

I had the same question as Making a new PDF with an embedded Tiff searchable - #6 by bns_group, but located Free Online PDF OCR - Convert PDF to Text as the solution.

Unfortunately, the tool isn’t usable for me due to You're invited to talk on Matrix, so I can’t test any advice provided yet. Regardless, I would like to confirm that when I convert a TIFF file to an OCR PDF,

the PDF retains the original TIFF image
at its original quality (not upscaled or downscaled)
and with all of its metadata (at least, all which can be stored inside a PDF, so not none, and hopefully the rest transferred into the PDF)

inside it rather than stripping it and converting it into a different image type.

If you wonder why I ask, imagine if converting an OGG file into a FLAC or MKV container removed the attributes and degraded the quality of the file by converting it into something like an MP4, merely wrapped by the aforementioned container files. It would of course be unacceptable to any audiophile.

Thanks.

asad.ali · December 4, 2023, 8:00pm

@RokeJulianLockhart

We need to gather the technical details in order to answer your questions. Therefore, an investigation ticket as OCRNET-765 has been generated in our issue tracking system to carry out the investigation. We will look into its details and let you know as soon as the ticket is resolved. Please be patient and spare us some time.

asad.ali · February 1, 2024, 8:36am

@RokeJulianLockhart

Unfortunately, we can’t guarantee that all metadata will be saved in the output PDF. But we try to create an output PDF with the same original size and resolution. You can test our solution and decide it is suitable for you.

RokeJulianLockhart · February 1, 2024, 3:08pm

@asad.ali,

asad.ali:

Unfortunately, we can’t guarantee that all metadata will be saved in the output PDF.

Do you support transferring all the metadata that both TIFF and PDF support? Additionally, do you support Extended attributes - ArchWiki (I expect not)? I ask per Extended attributes - ArchWiki

All major Linux file systems including Ext4, Btrfs, ZFS, and XFS support extended attributes.
asad.ali:

But we try to create an output PDF with the same original size and resolution.

Apologies for being pedantic, but try and guarantee are different — is there a reason why image quality might be diminished?
asad.ali:

You can test our solution and decide it is suitable for you.

Apologies — I’m trying. The 20 MiB limit at Convert TIFF image to searchable PDF online has made it a bit difficult for the moment, though…

asad.ali · February 1, 2024, 10:20pm

@RokeJulianLockhart

Thanks for the feedback. We have kept the ticket open for further investigation and we will soon return to you with the feedback against your recent comments.

RokeJulianLockhart · February 3, 2024, 2:38pm

I’ve uploaded a TIFF with the configuration depicted at

image506×673 47.4 KB

to Convert TIFF image to searchable PDF online. The resultant PDF does not appear to be searchable. Additionally, its resolution is not even comparable to that of the original.

asad.ali · February 4, 2024, 4:19pm

@RokeJulianLockhart

Would you kindly share the same TIFF for our reference too?

RokeJulianLockhart · February 4, 2024, 10:48pm

@asad.ali, I don’t appear to be able to upload it to this forum, so it’s available at 20240203T141155GMT.

asad.ali · February 5, 2024, 8:57am

@RokeJulianLockhart

Thanks for providing the sample input. Online Aspose App implements Aspose.OCR Cloud API. However, this information will definitely help us in understanding your requirements more clearly and implement them in On-Premise API. We will let you know once some progress is made towards the resolution of the logged ticket. Please spare us some time.

anna.pylaieva · February 6, 2024, 3:40pm

Hi, I’m a developer of the Aspose.OCR team (downloadable libraries). I converted your TIFF file to PDF using our library. You can evaluate the result in a PDF file. I can’t transfer your additional information from the TIFF file, but the size, physical size and resolution are the same as in the TIFF file.
Also, I want to note that we do not support all TIFF file formats, and we may not support some specific formats. However, we work with the majority.
Please use our download library to test your files and share your thoughts.
The code and the output PDF are attach
tiff_pdf.pdf (150.7 KB)

tiff_pdf.zip (1.0 KB)

      AsposeOcr api = new AsposeOcr();
      License lic = new License();
      lic.SetLicense(@"Aspose.OCR.Product.Family2024.lic");
      PreprocessingFilter filter = new PreprocessingFilter
            {
                //if you need
                // PreprocessingFilter.AutoSkew()
            };
       OcrInput input = new OcrInput(InputType.TIFF);
       input.Add("your.tiff");
        var result = api.Recognize(input, new RecognitionSettings
            {
                DetectAreasMode = DetectAreasMode.PHOTO
            });
      AsposeOcr.SaveMultipageDocument("D://result.pdf", SaveFormat.Pdf, result);

RokeJulianLockhart · February 21, 2024, 1:36am

@anna.pylaieva, I appear to get wildly different resolutions based upon the viewer:

https://download.opensuse.org/repositories/openSUSE:/Factory/standard/x86_64/MozillaFirefox-122.0.1-1.1.x86_64.rpm

2690×1526 575 KB
https://download.opensuse.org/repositories/openSUSE:/Factory/standard/x86_64/gwenview5-23.08.4-2.1.x86_64.rpm

2690×1526 133 KB
https://download.opensuse.org/repositories/openSUSE:/Factory/standard/x86_64/okular-23.08.4-1.4.x86_64.rpm

2690×1526 632 KB

but regardless, they’re definitely all lower (as the converted PDFs demonstrate versus the original) and that black and white border certainly weren’t there beforehand — it’s as if they’be been converted to JPEG (non-XL) and back to TIFF.

Regardless, I’m incredibly thankful for the code excerpt.

asad.ali · February 21, 2024, 3:56pm

@RokeJulianLockhart

Thanks for the feedback. The ticket status is still open and we will further try to improve the functionality while keeping your comments in view. We will update you here once we have some more updates to share.

RokeJulianLockhart · August 4, 2024, 5:21pm

@asad.ali, have you had a chance to diagnose this?

To summarise, because it’s been a while, it appears that:

https://products.aspose.app/ocr/pdf-ocr doesn’t usually output OCR’d PDFs, especially not meaningfully OCR’d PDFs - most text remains unrecognised.
https://api.products.aspose.app/ocr/en/tiff-to-pdf outputs low-quality PDFs.

asad.ali · August 4, 2024, 7:21pm

@RokeJulianLockhart

We are afraid that the earlier logged ticket has not been yet resolved. However, it is still under the phase of investigation and as soon as we resolve the task, we will update you in this forum thread. We apologize for the delay and the inconvenience.

RokeJulianLockhart · August 4, 2024, 10:44pm

No worries. There’s not really any rush. I’ll wait for you next time. Thanks for the response.