Converting HTML E-mail To Plain Text

by Bill Owens 22. July 2008 19:45
Posted: Thursday, July 10, 2008 10:34 PM by Simon Hutson

OK, I admit it. I've caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.


After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text - Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn't need, I came up with the following code which forms the basis of this plug-in.



Private Function ConvertHTMLToText(ByVal Source As String) As String
 
    Dim result As String = Source
 
    ' Remove formatting that will prevent regex from running reliably
    ' \r - Matches a carriage return \u000D.
    ' \n - Matches a line feed \u000A.
    ' \f - Matches a form feed \u000C.
    ' For more details see http://msdn.microsoft.com/en-us/library/4edbef7e.aspx
    result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
 
    ' replace the most commonly used special characters:
    result = Replace(result, "&lt;", "<", RegexOptions.IgnoreCase)
    result = Replace(result, "&gt;", ">", RegexOptions.IgnoreCase)
    result = Replace(result, "&nbsp;", " ", RegexOptions.IgnoreCase)
    result = Replace(result, "&quot;", """", RegexOptions.IgnoreCase)
    result = Replace(result, "&amp;", "&", RegexOptions.IgnoreCase)
 
    ' Remove ASCII character code sequences such as &#nn; and &#nnn;
    result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all other special characters. More can be added - see the following for more details:
    ' http://www.degraeve.com/reference/specialcharacters.php
    ' http://www.web-source.net/symbols.htm
    result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from the <head> tag
    result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from the </head> tag
    result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
    ' Delete everything between the <head> and </head> tags
    result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <script> tags
    result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </script> tags
    result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
    ' Delete everything between all <script> and </script> tags
    result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Remove all attributes and whitespace from all <style> tags
    result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
    ' Remove all whitespace from all </style> tags
    result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
    ' Delete everything between all <style> and </style> tags
    result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Insert tabs in place of <td> tags
    result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
 
    ' Insert single line breaks in place of <br> and <li> tags
    result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
 
    ' Insert double line breaks in place of <p>, <div> and <tr> tags
    result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
    result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
 
    ' Remove all reminaing html tags
    result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
 
    ' Replace repeating spaces with a single space
    result = Replace(result, " +", " ")
 
    ' Remove any trailing spaces and tabs from the end of each line
    result = Replace(result, "[ \t]+\r\n", vbCrLf)
 
    ' Remove any leading whitespace characters
    result = Replace(result, "^[\s]+", String.Empty)
 
    ' Remove any trailing whitespace characters
    result = Replace(result, "[\s]+$", String.Empty)
 
    ' Remove extra line breaks if there are more than two in a row
    result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
 
    ' Thats it.
    Return result
 
End Function

All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the "DeliverPromote" event, whereas any incoming e-mail handled by the e-mail router triggers the "DeliverIncoming" event. Interestingly enough, the "Create" event was also called as a child pipeline for these events, but modifying the message here didn't have any effect, even in the pre-processing stage.


Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:



  1. is running on the 'DeliverPromote' or 'DeliverIncoming' messages

  2. is running synchronously

  3. is running against the 'Email' entity

  4. is running in the 'pre-processing' stage of the pipeline

  5. is running in a 'Parent' pipeline


Public Class ConvertHtmlToText
    Implements IPlugin
 
    Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
 
        ' Exit if any of the following conditions are true:
        '  1. plug-in is not running synchronously
        '  2. plug-in is not running against the 'Email' entity
        '  3. plug-in is not running in the 'pre-processing' stage of the pipeline
        '  4. plug-in is not running in a 'Parent' pipeline
        If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
            Exit Sub
        End If
 
        If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
 
            For Each item In context.InputParameters.Properties
 
                If (item.Name = "Body") Then
                    context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
                End If
 
            Next
 
        End If
 
    End Sub
 
End Class

As always, I have include the source code to my project here. Please do bear in mind that I haven't included any error handling or logging, so it's not production-ready. However, it should provide you with a good head-start.


This posting is provided "AS IS" with no warranties, and confers no rights.

Tags: , ,

CRM 4.0

Page List

About the author

I work for a consulting firm in Dublin Ohio called Affiliated Resource Group. For the last five years I have been spearheading our Microsoft Dynamics CRM practice. I have a deep appreciation for the Microsoft CRM platform and I am very excited about it. You might even describe me as a Microsoft CRM Advocate. I have many battle scars from my experience with the product and I’m constantly being asked questions about CRM and how-to-do something in it. Hence, this BLOG is to help disseminate that knowledge and information to everyone. As of last year I was posting links to many other blogs to help spread the knowledge, but now with the community.dynamics.com doing that for me, I will be following that practice unless a really juicy article catches my eye. Many people have asked where my post are for the first half of 2010, my company had me posting to another blog and maintain two was near impossible. I am now down to just this blog. So good luck and I hope that this blog may help in some way. If you have suggestions or questions, please email me them.

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.

© Copyright 2012 BillOnCRM