Extract text from word document line by line and rebuild word document by replacing each line with different text keeping previous font size and color

Test_Doc.zip (127.9 KB)

Hi,

I want to extract text,header,footer from word document line by line, save it to databse and rebuild word document by replacing each line with different text from databse but keeping previous font size and color. With below code i can get text line by line but while rebuilding word document previous font and colors will be lost. Can you please help me with this?

Aspose.Words.Document doc = new Aspose.Words.Document(filePath);

AsposeWordDocToTxtWriter myConverter = new AsposeWordDocToTxtWriter();
doc.Accept(myConverter);
text = myConverter.GetText();

words = text.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
    //save text to databse
}

I have tried “Run” also but with that i am not getting text line by line.

Aspose.Words.NodeList runs = doc.SelectNodes("//Run");
foreach (Aspose.Words.Run run in runs)
{
}

Thanks!

@manasiak

We suggest you please read the following article about Aspose.Words document object model.
Aspose.Words Document Object Model

The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages.

You can use the DocumentLayoutHelper utility to get the text of document line by line. Hope this helps you.

Test_Doc.zip (43.8 KB)

Hi Tahir,

Thanks for your reply. I have another question. Can you please help me with this. Written below code to extract text from word document.

Aspose.Words.DocumentBuilder builder = new Aspose.Words.DocumentBuilder(doc);
Aspose.Words.NodeCollection paragraphs = doc.GetChildNodes(Aspose.Words.NodeType.Paragraph, true);
foreach (Aspose.Words.Paragraph paragraph in paragraphs)
{
      if (!String.IsNullOrWhiteSpace(paragraph.GetText()))
      {
             // save text in database. 
      }
 }

Saved text is translated into another language. I want to replace source text with translated text without losing formatting. But with below code all the formatting, line breaks, tabs are lost. Can you please tell me how can i replace text without losing formatting?

foreach (Aspose.Words.Paragraph paragraph in paragraphs)
{
    paragraph.Range.Replace(sourcetext, translatedText, options);
}
doc.Save(docxFilePath);

Thanks!

@manasiak

You are saving the text of document into database. The simple text does not contain any formatting. Could you please share complete detail of your use case along with code example to reproduce your issue at our end? We will then provide you more information about your query.

Whitepaper_Clarify.zip (107.5 KB)
Please find the attached document.

Thanks for your help.

I trying to extract all text, header, footer from the document by line. From the attached document text “Clarify VE technology was developed to resolve four specific clinical issues inherent in conventional ultrasound imaging” should be extracted as a line.
This text will be translated to different language and will be saved in the database. After that need to replace document text, header, footer by translated text from databse, keeping previous document formatting, fonts, color as it is.
I have tried below code also. Code works but Sentences are split in the middle.

Extract

Aspose.Words.NodeList runs = doc.SelectNodes("//Run");
foreach (Aspose.Words.Run run in paragraph.GetChildNodes(Aspose.Words.NodeType.Run, true))
{
   text = run.Text.TrimStart();
   //save text in databse
}

//Replace source text with the translated text
foreach (Aspose.Words.Paragraph paragraph in paragraphs)
{
    foreach (Aspose.Words.Run run in paragraph.GetChildNodes(Aspose.Words.NodeType.Run, true))
    {
         if (!String.IsNullOrWhiteSpace(run.GetText()))
         {
                //replace source text with translated text
                run.Text = translated text;   
         }
     }
}

@manasiak

In your case, we suggest you following solution.

  1. Iterate over paragraph nodes. You can get the paragraphs of document using Document.GetChildNodes(NodeType.Paragraph, true).
  2. Get the text of paragraph using Paragraph.ToString(SaveFormat.Text).
  3. Save the text into database along with translated text.
  4. Please use Find and Replace feature to find the text and replace it with translated text.

Hope this helps you.

@manasiak

Further to my previous post, following code example replaces the desired text with translated text.

Document doc = new Document(MyDir + "Whitepaper_Clarify.docx");
doc.Range.Replace("Clarify VE technology was developed to resolve four specific clinical issues inherent in conventional ultrasound imaging",
    "Clarify VEテクノロジーは、従来の超音波イメージングに固有の4つの特定の臨床問題を解決するために開発されました。", new FindReplaceOptions());

doc.Save(MyDir + @"19.9.docx");

Hi Tahir, Thanks for your help. Above code worked !!

QA_Whitepaper_Clarify_3.zip (13.2 KB)

Hi,

I want to extract all text, header, footer from the document by line. Extracted text will be translated to different language and will be saved in the database. After that need to replace document text, header, footer by translated text from databse, keeping previous document formatting, fonts, color as it is.

Below code works fine except for the attached document. Extracted text from the attached document displays box symbol for column breaks. Removing the box symbols screws up the formatting while replacing translated text. Don’t want box symbol extracted with text. But want to keep formatting as it is after replacing text. Can you please help me with this?

//Extract text from word document
foreach (Aspose.Words.Paragraph paragraph in paragraphs)
{. 
        string text = paragraph.ToString(Aspose.Words.SaveFormat.Text).Trim();
        //save text in database
}

//Replace source text with translated text in word document
Aspose.Words.Replacing.FindReplaceOptions options = new Aspose.Words.Replacing.FindReplaceOptions();
foreach (Aspose.Words.Paragraph paragraph in paragraphs)
{
       paragraph.Range.Replace(SourceText, TranslatedText, options);
}

Thanks!

@manasiak

To ensure a timely and accurate response, please attach the following resources here for testing:

  • Please attach the output Word file that shows the undesired behavior.
  • Please attach the expected output Word file that shows the desired behavior.
  • Please create a standalone console application ( source code without compilation errors ) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

Box_Symbol_Issue.zip (5.4 MB)

Please find attached Console Application, Output word file, Source document and SS of text extracted with box symbol. Let me know if any more information is required.

Thanks.

@manasiak

We are working over your query and will get back to you soon.

@manasiak

The box symbol is column break in the paragraph. It is in the first Run node of paragraph. In your case, we suggest you please do not extract the first Run node that contains the column break. We suggest you following solution in this case.

  • Please check if the paragraph contains the column break. Please check the following line of code.
    Paragraph.ToString(SaveFormat.Text).Contains(ControlChar.ColumnBreak)
  • Clone the paragraph, remove the Run node that contains the column break.
  • Get the text of paragraph and save it to database.