Font issue when trying to copy each node from source docx to destination docx

gayudearest · February 17, 2017, 8:38am

Hiii Aspose Team

I am trying to read each and every node in a source word document and then trying to merge two paragraph based on certain conditions in the newDocument object.
In that case I find that the font for both of it shows Times 12 but the size in the destination seems to be bigger and hence there occurs unnecessary page break for the same.
Can you please help me how can I resolve this issue urgently?

Attached the source code and the source and destination documents respectively for your reference.

In the Page no 2 of the destination document you can see a blank page coming because of the respective increase in the rendering of the fonts in the source and destination document.

Please help me how can I resolve this issue as it is very urgent…

tahir.manzoor · February 20, 2017, 11:01am

Hi Gayatri,

Thanks for your inquiry. We have tested the scenario using latest version of Aspose.Words for Java 17.2.0 and have not found the blank page issue in output document. Please use Aspose.Words for Java 17.2.0. See the attached output document and image for detail.

If you still face problem, please share the problematic section of output document. Please manually create your expected Word document using Microsoft Word and attach it here for our reference. We will investigate how you want your final Word output be generated like. We will then provide you more information on this.

tahir.manzoor · February 21, 2017, 1:33am

Hi Gayatri,

Please note that Aspose.Words mimics the same behavior as MS Word does. You are adding section breaks of type new page in your document. If the font size of document is changed, the position of section break may move to the next page (3rd page). In this case, the blank page will appear in output document.

We suggest you please replace section break new page with continuous. Hope this helps you.

gayudearest · February 21, 2017, 1:52am

Actually I simply tried copying source formatting to the destination formatting, still the font size shows same but is actually different. I tried the following code and still facing the same issue :-

private static void copyFullDocument() throws Exception {

String MyDir = “\\lngdays-dev069\Render\Gayatri\CC POC\”;
StringBuffer pargraphsText = new StringBuffer();
byte[] buff = new byte[8000];
InputStream is;
try {
is = new FileInputStream(MyDir + “Test-source.docx”);
int bytesRead = 0;
ByteArrayOutputStream bao = new ByteArrayOutputStream();
try {
while ((bytesRead = is.read(buff)) != -1) {
bao.write(buff, 0, bytesRead);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

byte[] data = bao.toByteArray();
ByteArrayInputStream inStream = new ByteArrayInputStream(data);
ByteArrayOutputStream outStream = new ByteArrayOutputStream();

// String filePath = “\\lngdays-dev069\Render\Gayatri\CC
// POC\Connie1_2016-Ohio-5124 pdf
// 00500000B990MD-with-highlights.docx”;

// The document that the content will be appended to.

Document dstDoc = new Document();
dstDoc.removeAllChildren();

// The document to append.

Document srcDoc = new Document(inStream);

// Append the source document to the destination document.

// Pass format mode to retain the original formatting of the source
// document when importing it.

dstDoc.appendDocument(srcDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

// Save the document.

dstDoc.save(MyDir + “Out3.docx”);
} catch (

Exception e) {
// TODO Auto-generated catch block
System.out.println(“Exception is >>>>” + e.getMessage());
e.printStackTrace();
}

}

Please use the attached source documentin previous post as the source document.
Is it microsoft word behaviuor that the same font is getting rendered differently in both the documents ? If it is so how can I get it resolved?

gayudearest · February 21, 2017, 1:59am

When I ran the same code, in my PC the page number “2” comes in next page .

So any idea why is it happening so and is it word behaviour. If it is work behaviour how can I rectify it in my system.

tahir.manzoor · February 21, 2017, 11:22am

Hi Gayatri,

Thanks for sharing the detail. Please note that formatting is applied on a few different levels. For example, let’s consider formatting of simple text. Text in documents is represented by Run element and a Run can only be a child of a Paragraph. You can apply formatting

1) to Run nodes by using Character Styles e.g. a Glyph Style,
2) to the parent of those Run nodes i.e. a Paragraph node (possibly via paragraph Styles)
3) you can also apply direct formatting to Run nodes by using Run attributes (Font). In this case the Run will inherit formatting of Paragraph Style, a Glyph Style and then direct formatting.

There is no style in your input document. Please copy the text of input document into new empty document using MS Word and check the output. The output will not be same as input.

You are facing this behavior because there is no style in your input document . The styles of text are "RAUTRP+TimesNewRoman", "GUUSPV+TimesNewRoman,Italic". These styles should exist in your document to get the correct output.

If you want the correct font name in output document, please apply font formatting according to your requirements. After using the correct font name or style, the page break issue will be resolved.

Moreover, your document contains page break at the end of each section. See the attached image for detail. We suggest you please remove the page break using following method.

private static void RemovePageBreaks(Document doc) {

    // Retrieve all paragraphs in the document.
    NodeCollection paragraphs = doc.getChildNodes(NodeType.PARAGRAPH, true);

    // Iterate through all paragraphs
    for (Paragraph para : (Iterable) paragraphs) {
        {
            // Check all runs in the paragraph for page breaks and remove them.
            for (Run run : para.getRuns()) {
                if (run.getText().contains(ControlChar.PAGE_BREAK))
                    run.setText(run.getText().replace(ControlChar.PAGE_BREAK, ""));
            }
        }
    }
}

gayudearest · February 27, 2017, 5:04am

Hi Tahir

The source document here is the Document which I converted from PDF to docx using the below code :-
String filesLocation = “C:\DOcuments\462887.481663.Decision.doc.pdf.00500000B1D1F6.pdff”;

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filesLocation);
// Instantiate Doc SaveOptions instance
DocSaveOptions saveOptions = new DocSaveOptions();

// Set output file format as DOCX
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);

saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOptions.setMaxDistanceBetweenTextLines(3.5f);

saveOptions.setAddReturnToLineEnd(false);

// Save the file into Microsoft document format
pdfDocument.save(“C:\Documents\462887.481663.Decision.doc.pdf.00500000B1D1F6_converted.docx”,
saveOptions);

Now I am using the same generated docx
as input again and then generating a new document. The font if you see
in the generated document differs from the new document’s font which I
have generated.
As you have mentioned I am already removing page
breaks and copying all font styles as it is. You can check in my
previously attached code.

Attaching the respective pdf file and the generated and new docx files as well.

Can you please help me figure out why the font differs in such case and while conversion from pdf to docx on what basis the fonts are getting created in the docx file.

tahir.manzoor · February 28, 2017, 1:55am

Hi Gayatri,

Thanks for sharing the detail. Your query is related to Aspose.Pdf APIs. So, I am moving this forum thread to Aspose.Total forum. My colleagues will answer your query shortly.

fahadadeel · February 28, 2017, 12:18pm

Hi Gayatri,

Thanks for sharing details.

If you notice the generated document contains font name like [six randomly generated characters][plus sign][font name], these font name behavior is as per design. It cannot be turned-off. A bit simplified reason is PDF file can contain several fonts with same names, so prefixes are added to guaranty unquietly of font names. If you notice “Times New Roman” is the font name and it remained same in all documents but with new random prefix.

Please feel free to contact us for any further assistance.

We are sorry for the inconvenience.

Best Regards,

gayudearest · March 2, 2017, 5:54am

Hi fahadadeel

Thank you for replying. I understand that there are different fonts in the source document which is not getting rendered in the new document I am trying to copying.

I know that there are some fonts in the pdf file which is creating problem. What I am now trying to do is trying to extract all the fons from the pdf using FOntCOllection and then saving them to a folder and then converting the Pdf to docx. .Then I install those ttf fonts and then create a new docx document from the generated one which actually gives me correct output.

Here is the code which I am using while converting pdf to docx and extracting the fonts into a folder.:-

private static void singleConvertPdfToDoc() {
// TODO Auto-generated method stub
String filesLocation = “\\lngdays-dev069\Render\Gayatri\CC POC\TestFonts\462887.481663.Decision.doc.pdf.00500000B1D1F6.pdf”;
String fileName = filesLocation.substring(filesLocation.lastIndexOf(’\’) + 1, filesLocation.length());
fileNameOnly = fileName.substring(0, fileName.lastIndexOf(’.’));
testOut = “\\lngdays-dev069\Render\Gayatri\CC POC\TestFonts\out\”.concat(fileNameOnly);
byte[] buff = new byte[8000];
InputStream is;
try {
Path path = Paths.get(testOut);
// if directory exists?
if (!Files.exists(path)) {
try {
Files.createDirectories(path);
} catch (IOException e) {
// fail to create directory
e.printStackTrace();
}
}

String fontCacheFolder = testOut + fileNameOnly + “_fonts_preSaved\”;
String cacheFontFileTemplate = fontCacheFolder + “font%1$s.ttf”;
Path fontOutFolderPath = Paths.get(testOut, fileNameOnly, “_fonts\”);

// Folder that will contain fonts as a result of the conversion
// procedure
String fontOutFolder = fontOutFolderPath.toAbsolutePath().toString();

Path pathFontCacheFolder = Paths.get(fontCacheFolder);
// if directory exists?
if (!Files.exists(pathFontCacheFolder)) {
try {
Files.createDirectories(pathFontCacheFolder);
} catch (IOException e) {
// fail to create directory
e.printStackTrace();
}
}

Path pathFontOutFolder = Paths.get(fontOutFolder);
// if directory exists?
if (!Files.exists(pathFontOutFolder)) {
try {
Files.createDirectories(pathFontOutFolder);
} catch (IOException e) {
// fail to create directory
e.printStackTrace();
}
}

com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filesLocation);

FontAbsorber fa = new FontAbsorber();
fa.visit(pdfDocument);
FontCollection fc = fa.getFonts();

ArrayList fontFiles = new ArrayList();

// Save all the fonts in the cache folder
int fontNum = 0;
for (com.aspose.pdf.Font font : (Iterable<com.aspose.pdf.Font>) fc) {
String cacheFontFile = String.format(cacheFontFileTemplate, Integer.toString(fontNum++));
FileOutputStream out = new FileOutputStream(cacheFontFile);
font.save(out);
fontFiles.add(cacheFontFile);
}

// Instantiate Doc SaveOptions instance
DocSaveOptions saveOptions = new DocSaveOptions();

// Set output file format as DOCX
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);

saveOptions.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOptions.setMaxDistanceBetweenTextLines(3.5f);

saveOptions.setAddReturnToLineEnd(false);

// Save the file into Microsoft document format
pdfDocument.save(
“\\lngdays-dev069\Render\Gayatri\CC POC\TestFonts\462887.481663.Decision.doc.pdf.00500000B1D1F6_latest.docx”,
saveOptions);

System.out.println(“convertion PDF To DOC is done”);
} catch (

Exception e) {
// TODO Auto-generated catch block
System.out.println(“Exception is >>>>” + e.getMessage());
e.printStackTrace();
}

}

Now my doubt is that is there any way where I could use the extracted fonts from the pdf in the new document which I am creating. I mean to say like I wnt to read those fonts from the folder and then create a new document based on those fonts. I tried the following but it did not work :-

Document newDoc = new Document();
newDoc.removeAllChildren();
FontSettings f = new FontSettings();
f.setFontsFolder(“fontsFolder”, true);
newDoc.setFontSettings(f);

Attaching the source code and source pdf and the generated docx again for your refrence. Please help me as soon as possible because I am stuck in this from very long time.

fahadadeel · March 3, 2017, 4:00am

Hi Gayatri,

Thanks for sharing further details.

I am looking into your problem. I will get back to you with my findings shortly.

We are sorry for the inconvenience.

Best Regards,

fahadadeel · March 6, 2017, 1:08am

Hi Gayatri

Thanks for your patience.

As shared earlier Aspose is using [six randomly generated characters][plus sign][font name] algorithm for fonts while rendering from PDF to DCOX, you may omit [six randomly generated characters][plus sign] and use [font name] only while creating a new document. You may assign [font name] font to the new document. Hopefully it will solve your problem.

If you still face any issue, please feel free to contact us.

Best Regards,