Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML

This article presents a collection of regular expressions I frequently use to clean up HTML generated from some tools’ export routines, e.g., Typora. A PowerShell script automates the clean-up task.

TL;DR

This PowerShell script implements the techniques explained in the article. It automates the process of cleaning an entire HTML file.

What This Does

Open links in a new window and replace single with double quotes.

Regex

  • Search for:
    <a href='([^']+)'>
  • Replace with:
    <a href="\1" target="_blank" rel="noopener">

Example

  • Before:
    <a href='https://helgeklein.com/categories/home-automation-networking--self-hosting/'>
  • After:
    <a href="https://helgeklein.com/categories/home-automation-networking--self-hosting/" target="_blank" rel="noopener">

Headings: Remove ID Attributes

What This Does

Remove ID attributes that some export routines add to every heading.

Regex

  • Search for:
    <h(\d)\s+id='[^']+'>
  • Replace with:
    <h\1>

Example

  • Before:
    <h2 id='what-are-grafana-loki--promtail'>
  • After:
    <h2>

Code: Remove Duplicate Syntax Highlighting Language Attributes

What This Does

Some export routines add duplicate language attributes to code blocks. This replaces the two attributes with a single new attribute that I prefer.

Regex

  • Search for:
    class='language-([^']+)'\s+lang='\1'>
  • Replace with:
    class="lang-\1">

Example

  • Before:
    <code class='language-yaml' lang='yaml'>
  • After:
    <code class="lang-yaml">

Replace Paragraph Tags with Newlines

What This Does

WordPress doesn’t require authors to enclose paragraphs in <p> </p> tags. Instead, it adds those automatically to the HTML when it identifies a paragraph (via its wpautop function).

To optimize readability in the backend, I remove HTML paragraph tags that are added by many tools’ export routines.

Regex

  • Search for:
    </?p>
  • Replace with:
    \n

Example

  • Before:
    <p>Text in a paragraph.</p>
  • After:
    \nText in a paragraph.\n

Replace HTML-Encoded Special Characters With Plain Text

What This Does

Many tools’ export routines replace special characters with their corresponding HTML entities, aka they HTML-encode such special characters. With WordPress, this type of encoding is not necessary, as WordPress does it automatically. To optimize readability in the backend, I convert HTML entities back to plain text.

The regex presented below uses a syntax that is available in Notepad++ (regex docs) and all other applications that use the boost regex library. The regex makes the following replacements:

  • &lt;<
  • &gt;>
  • &#39;'
  • &quot;"
  • &amp;&

Regex

  • Search for:
    (&lt;)|(&gt;)|(&#39;)|(&quot;)|(&amp;)
  • Replace with:
    (?1<)(?2>)(?3')(?4")(?5&)

Example

  • Before:
    &lt;a href=&amp;#39;([^&amp;#39;]+)&amp;#39;&gt;
  • After:
    <a href='([^']+)'>

Comments

Related Posts

Keyboard Shortcuts You'll Never Want to Miss Again

Keyboard Shortcuts You'll Never Want to Miss Again
Keyboard shortcuts are the key to productivity and efficiency. This is true for any machine, any OS and any application. This article lists shortcuts I personally find very helpful and use often. In selecting what to include I focused on shortcuts that might be less well known. Many shortcuts work in other than the described applications in a very similar way. Please do try them out in all your most frequently used applications.
Miscellaneous

Latest Posts

Fast & Silent 5 Watt PC: Minimizing Idle Power Usage

Fast & Silent 5 Watt PC: Minimizing Idle Power Usage
This micro-series explains how to turn the Lenovo ThinkCentre M90t Gen 6 into a smart workstation that consumes only 5 Watts when idle but reaches top Cinebench scores while staying almost imperceptibly silent. In the first post, I showed how to silence the machine by replacing and adding to Lenovo’s CPU cooler. In this second post, I’m listing the exact configuration that achieves the lofty goal of combining minimal idle power consumption with top Cinebench scores.
Hardware

Fast & Silent 5 Watt PC: Lenovo ThinkCentre M90t Modding

Fast & Silent 5 Watt PC: Lenovo ThinkCentre M90t Modding
This micro-series explains how to turn the Lenovo ThinkCentre M90t Gen 6 into a smart workstation that consumes only 5 Watts when idle but reaches top Cinebench scores while staying almost imperceptibly silent. In this first post, I’m showing how to silence the machine by replacing and adding to Lenovo’s CPU cooler. In a second post, I’m listing the exact configuration that achieves the lofty goal of combining minimal idle power consumption with top Cinebench scores.
Hardware