Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML

This article presents a collection of regular expressions I frequently use to clean up HTML generated from some tools’ export routines, e.g., Typora. A PowerShell script automates the clean-up task.

TL;DR

This PowerShell script implements the techniques explained in the article. It automates the process of cleaning an entire HTML file.

What This Does

Open links in a new window and replace single with double quotes.

Regex

  • Search for:
    ``
  • Replace with:
    ``

Example

  • Before:
    <a href='https://helgeklein.com/blog/category/home-automation/'>
  • After:
    <a href="https://helgeklein.com/blog/category/home-automation/" target="_blank" rel="noopener">

Headings: Remove ID Attributes

What This Does

Remove ID attributes that some export routines add to every heading.

Regex

  • Search for:
    ``
  • Replace with:
    ``

Example

  • Before:
    <h2 id='what-are-grafana-loki--promtail'>
  • After:
    <h2>

Code: Remove Duplicate Syntax Highlighting Language Attributes

What This Does

Some export routines add duplicate language attributes to code blocks. This replaces the two attributes with a single new attribute that I prefer.

Regex

  • Search for:
    class='language-([^']+)'\s+lang='\1'>
  • Replace with:
    class="lang-\1">

Example

  • Before:
    <code class='language-yaml' lang='yaml'>
  • After:
    <code class="lang-yaml">

Replace Paragraph Tags with Newlines

What This Does

WordPress doesn’t require authors to enclose paragraphs in ``

``

tags. Instead, it adds those automatically to the HTML when it identifies a paragraph.

To optimize readability in the backend, I remove HTML paragraph tags that are added by many tools’ export routines.

Regex

  • Search for:
    ?p>
  • Replace with:
    \n

Example

  • Before:
    <p>Text in a paragraph.</p>
  • After:
    \nText in a paragraph.\n

Replace HTML-Encoded Special Characters With Plain Text

What This Does

Many tools’ export routines replace special characters with their corresponding HTML entities, aka they HTML-encode such special characters. With WordPress, this type of encoding is not necessary, as WordPress does it automatically. To optimize readability in the backend, I convert HTML entities back to plain text.

The regex presented below uses a syntax that is available in Notepad++ (regex docs) and all other applications that use the boost regex library. The regex makes the following replacements:

  • &lt;<
  • &gt;>
  • &#39;'
  • &quot;"
  • &amp;&

Regex

  • Search for:
    (&lt;)|(&gt;)|(&#39;)|(&quot;)|(&amp;)
  • Replace with:
    (?1<)(?2>)(?3')(?4")(?5&)

Example

  • Before:
    &amp;lt;a href=&amp;#39;([^&amp;#39;]+)&amp;#39;&amp;gt;
  • After:
    &lt;a href='([^']+)'&gt;

Comments

Related Posts

Latest Posts