Is it possible to pre-process a .doc to remove extra paragraphs- spaces- hard line breaks- etc before converting to .epub

pubitbn · July 29, 2010, 4:26pm

my source.doc has a number of contiguous paragraphs for example: between item 5 and 6(see source.doc)

these were obviously added by the author to be pleasing to the human eye.

however after converting source.doc to output.epub these extra paragraphs make viewing of the epub in a reader, quite unpleasant

*** conversion code ***

Document doc = new Document(_sourceFilePath);
doc.Save(_destinationFilePath, SaveFormat.Epub);

Is there an option available to indicate to doc to the remove extra paragraphs or to replace all contiguous paragraphs with a single paragraph.
The same question for spaces and hard line breaks, etc

Basically I need a way to define a set of rules that gets applied to each .doc (docs are created by third party) in an effort to clean up so the converted .epub looks consistent in the reader.

FOLLOW UP QUESTION:
The converted .epub has a single html file for its content with the css style inline.
Can Aspose extract out from the .doc - a separate css file with the style information - and have the single html file link to the separate css file for the styling information.

Thanks in advance…

P.S I could not upload the.epub without adding the.zip - It was not part of the allowed list of extensions

alexey.noskov · July 30, 2010, 2:47am

Hi

Thanks for your inquiry. I think, you can use DocumentVisitor to achieve what you need. For example, please try using the following code:

Document doc = new Document(@"Test001\source.doc");
ParagraphResolver resolver = new ParagraphResolver();
doc.Accept(resolver);
doc.Save(@"Test001\out.doc");
doc.Save(@"Test001\out.epub");

private class ParagraphResolver: DocumentVisitor
{
    public override VisitorAction VisitParagraphEnd(Paragraph paragraph)
    {
        // Get next node after the paragraph.
        CompositeNode nextNode = (CompositeNode) paragraph.NextSibling;
        // If paragraph is empty and the next node is also enpty paragraph, remove the paragraph.
        if (!paragraph.HasChildNodes && nextNode != null && !nextNode.HasChildNodes)
        {
            paragraph.Remove();
        }
        // If both paragraphs are not empty, concatenate them
        else if (paragraph.HasChildNodes && nextNode != null && nextNode.NodeType == NodeType.Paragraph && nextNode.HasChildNodes)
        {
            // If the next paragraph starts with tab, remove it.
            if (nextNode.FirstChild.NodeType == NodeType.Run)
            {
                Run run = (Run) nextNode.FirstChild;
                run.Text = run.Text.StartsWith("\t") ? run.Text.Substring(1) : run.Text;
            }
            foreach(Node node in nextNode.ChildNodes)
            paragraph.AppendChild(node);
        }
        return VisitorAction.Continue;
    }
}

Hope this helps.
Best regards,

pubitbn · August 2, 2010, 3:14pm

Thank you for the prompt reply.
It worked great for the paragraphs.

I am trying to take the same approach for other chars (see attached image - iding chars.JPG)
I am going the route of VisitSpecialChar GetText
however I need to know if there is some documentation that map Aspose GetText return values
to word chars (as in iding chars.JPG)

Thanks in advance…

adam.skelton · August 2, 2010, 8:14pm

Hi Brian,
I believe you are looking for the enumerations contained in the ControlChar class.
Thanks,

pubitbn · August 3, 2010, 10:17am

I added the following function to the above class as an initial test…

public override VisitorAction VisitSpecialChar(SpecialChar specialChar)
{
    if (specialChar.GetText() == ControlChar.Tab)
    {
        specialChar.Remove();
    }

    return VisitorAction.Continue;
}

However tabs are not being deleted from the doc. What am I doing wrong?

Thanks in advance.

alexey.noskov · August 3, 2010, 11:32am

Hi

Thanks for your request. Tabs are not considered as special characters in Word documents. If you need to remove all tabs, you can try using Find and Replace method, like shown below:

Document doc = new Document(@"Test001\in.doc");
doc.Range.Replace("\t", "", false, false);
doc.Save(@"Test001\out.doc");

Best regards,