PDF -> HTML conversion with visibility:hidden text

msoliver · February 5, 2018, 2:24pm

We’re using Aspose.Pdf to convert PDF files to HTML and we’ve run into certain PDF files that are converted with the text displayed as an image and additionally there is hidden elements in the HTML that contain the actual text. We’d like to understand why the conversion does this and if we can prevent it. It only happens on certain PDF files. The HTML looks something like this:

	<div class="stl_420">
		<div class="stl_03">
			<img src="4757987_files/img_02.png" alt="" class="stl_421">
		</div>
		<div class="stl_view">
			<div class="stl_05 stl_422">
				<div class="stl_01 stl_423" style="top: 2.9234em; left:4.76em;"><span class="stl_424 stl_09 stl_10" style="visibility:hidden;">our</span><span class="stl_424 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2801em;">&nbsp;</span><span class="stl_425 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2757em;">experience</span><span class="stl_425 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2464em;">&nbsp;</span><span class="stl_426 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2167em;">with</span><span class="stl_426 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2271em;">&nbsp;</span><span class="stl_180 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2229em;"><span class="vocabADVENTMED descriptorADVEN04210 " startpos="70406">infectious</span> &nbsp;</span></div>
				<div class="stl_01 stl_427" style="top: 2.9798em; left:22.78em;"><span class="stl_370 stl_18 stl_29" style="visibility:hidden; word-spacing:0.2634em;">of </span><span class="stl_83 stl_09 stl_10" style="visibility:hidden; word-spacing:0.2953em;">the &nbsp;</span></div>

Thanks!

Mike Oliver
CCC

asad.ali · February 5, 2018, 8:15pm

@msoliver

Thanks for contacting support.

Would you please share the code snippet, with which you are trying to convert PDF files into HTML. Also, please make sure that you have tested the scenario using latest version of the API. We will test the scenario in our environment and address it accordingly.

msoliver · February 5, 2018, 9:16pm

HI -

We have been using the latest API. The code we’re using is below. Again, this only happens on a specific PDF, all others that we have converted so far have been fine. Could I send this PDF file somehow?

		using (Document doc = new Document(filePath))
		{
			string htmlFilePath = string.Format("{0}\\convertDoc\\{1}.html", AppState.GetSetting("DocumentCachePath"), _id.ToString());
			HtmlSaveOptions htmlOptions = new HtmlSaveOptions
			{
				FixedLayout = true,
				CustomProgressHandler = new HtmlSaveOptions.ConversionProgressEventHandler(RetainProgressInfo),
				FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats,
				RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground
			};

			doc.Save(htmlFilePath, htmlOptions);
		}

asad.ali · February 6, 2018, 7:36am

@msoliver

Thanks for getting back to us.

Sometimes, issues may be related with specific PDF document(s) and it would really be helpful if you can share the specific PDF document with us. In order to share the files, you may attach them with your post (Upload_Files.png (8.9 KB)). In case your file size is more than 3MB, you may please upload it to some public file sharer (e.g Dropbox, Google Drive) and share the link with us. We will investigate the scenario in our environment and share our feedback accordingly.

msoliver · February 6, 2018, 2:56pm

HI - here’s a link of the PDF in question. Please take a look and let me know. Thanks!

Mike

asad.ali · February 6, 2018, 9:22pm

@msoliver

Thanks for sharing the link to file. However the uploaded file is not public and I was unable to access the file. I have sent you a request for permissions to view/download the file. Please accept the request, so that I can download the file and log an issue after testing the scenario.

msoliver · February 6, 2018, 10:09pm

Accepted. Please try again.

asad.ali · February 7, 2018, 4:30am

@msoliver

Thanks for helping us in replicating the issue.

We have managed to reproduce the same issue in our environment and for the sake of detailed investigation, we have logged it as PDFNET-44150 in our issue tracking system. Product team will further investigate the issue and share their findings. As soon as we receive some definite updates regarding investigation results, we will share them with you. Please be patient and spare us little time.

We are sorry for the inconvenience.

asad.ali · February 9, 2018, 11:14am

@msoliver

After an initial investigation, it is found that your source PDF is OCRed (i.e it obtained after an OCR operation over images). If you zoom it you can see that letters look rather weird/distorted for vector font - because they are not letters but images. Moreover there is a hidden text layer over the image, which is why you are able to select text from PDF.

Product team will further look into the feasibility of this requirement and as soon as they share some feedback, we will inform you. Please spare us little time.

msoliver · February 9, 2018, 1:16pm

Understood. Thanks for the quick response and support!

Mike

simoncropp · March 24, 2020, 2:39am

any update on this issue?

asad.ali · March 24, 2020, 4:10pm

@simoncropp

Regretfully there is no updates yet regarding issue resolution. We will surely inform you as soon as we make some significant progress towards its rectification. Please spare us some time.

We are sorry for the inconvenience.

simoncropp · July 13, 2020, 11:18pm

any update on this?

asad.ali · July 14, 2020, 5:57pm

@simoncropp

We are afraid that earlier-logged ticket is not yet resolved. We will let you as soon as we have additional updates regarding its resolution. Please spare us some time.

We are sorry for the inconvenience.

sravan.matta · September 16, 2020, 1:14pm

Yes same thing happening for me. I have converted HTML to PDF first then later did vice versa same file using aspose.pdf java api but in html i see all div content visibility as hidden so only images coming up in html doc but no content.
ex.image.png (4.7 KB)

asad.ali · September 17, 2020, 6:13pm

@sravan.matta

We apologize for your inconvenience due to this issue. We have updated the issue information accordingly and logged the details that you have provided. We will provide an update in this forum thread as soon as we have some regarding ticket resolution. Please give us some time.

arithmer · February 22, 2021, 12:50am

@asad.ali

Hello,
I observe a similar scenario in the attached pdf document and converted htmls.
①財務報告に係る内部統制基準・実施基準1.pdf (page 128,129)
sample.zip (899.3 KB)

Does my case fall on the same issue discussed above ?
If so, the conditions for this case to arise can be described in the same way ?

Condition:
1. what seems to be letters are actually images.
2. there is a hidden text layer over the image (That is why one is able to select text from PDF.)

I think my case is No.2.

asad.ali · February 22, 2021, 9:13pm

@shun1985

We have logged another issue as PDFNET-49470 in our issue tracking system for your file. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

simoncropp · August 10, 2023, 9:37pm

any update on this?

asad.ali · August 10, 2023, 11:51pm

@simoncropp

We are afraid that the tickets could not get resolved yet due to their nature and complexity. We will surely update you once issues are resolved. We apologize for the inconvenience.