by: Helge, published: Aug 7, 2023, updated: May 12, 2024, in

Miscellaneous

Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML

This article presents a collection of regular expressions I frequently use to clean up HTML generated from some tools’ export routines, e.g., Typora. A PowerShell script automates the clean-up task.

TL;DR

This PowerShell script implements the techniques explained in the article. It automates the process of cleaning an entire HTML file.

Links: Open in New Window

What This Does

Open links in a new window and replace single with double quotes.

Regex

Search for:
<a href='([^']+)'>
Replace with:
<a href="\1" target="_blank" rel="noopener">

Example

Before:
<a href='https://helgeklein.com/blog/category/home-automation/'>
After:
<a href="https://helgeklein.com/blog/category/home-automation/" target="_blank" rel="noopener">

Headings: Remove ID Attributes

What This Does

Remove ID attributes that some export routines add to every heading.

Regex

Search for:
<h(\d)\s+id='[^']+'>
Replace with:
<h\1>

Example

Before:
<h2 id='what-are-grafana-loki--promtail'>
After:
<h2>

Code: Remove Duplicate Syntax Highlighting Language Attributes

What This Does

Some export routines add duplicate language attributes to code blocks. This replaces the two attributes with a single new attribute that I prefer.

Regex

Search for:
class='language-([^']+)'\s+lang='\1'>
Replace with:
class="lang-\1">

Example

Before:
<code class='language-yaml' lang='yaml'>
After:
<code class="lang-yaml">

Replace Paragraph Tags with Newlines

What This Does

WordPress doesn’t require authors to enclose paragraphs in <p> </p> tags. Instead, it adds those automatically to the HTML when it identifies a paragraph.

To optimize readability in the backend, I remove HTML paragraph tags that are added by many tools’ export routines.

Regex

Search for:
</?p>
Replace with:
\n

Example

Before:
<p>Text in a paragraph.</p>
After:
\nText in a paragraph.\n

Replace HTML-Encoded Special Characters With Plain Text

What This Does

Many tools’ export routines replace special characters with their corresponding HTML entities, aka they HTML-encode such special characters. With WordPress, this type of encoding is not necessary, as WordPress does it automatically. To optimize readability in the backend, I convert HTML entities back to plain text.

The regex presented below uses a syntax that is available in Notepad++ (regex docs) and all other applications that use the boost regex library. The regex makes the following replacements:

< → <
> → >
' → '
" → "
& → &

Regex

Search for:
(<)|(>)|(')|(")|(&)
Replace with:
(?1<)(?2>)(?3')(?4")(?5&)

Example

Before:
<a href='([^']+)'>
After:
<a href='([^']+)'>

About the Author

Helge Klein (ex CTP, MVP, and vExpert) worked as a consultant and developer before founding vast limits, the uberAgent company, which was acquired by the Citrix business unit of Cloud Software Group in late 2023. Previously, Helge applied his extensive knowledge in IT infrastructure projects and architected a user profile management product, the successor of which is now available as Citrix Profile Management. Helge is the author of the popular tools Delprof2 and SetACL. He has presented at Citrix Synergy, BriForum, E2EVC, Splunk .conf, and many other events.

1 Comment

worktime

October 12, 2023 at 14:21

The provided regular expression patterns cover tasks that often arise when working with HTML content. Programmers can use these templates as a starting point and adapt them to their specific cleaning needs

Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML

Contents

TL;DR

Links: Open in New Window

What This Does

Regex

Example

Headings: Remove ID Attributes

What This Does

Regex

Example

Code: Remove Duplicate Syntax Highlighting Language Attributes

What This Does

Regex

Example

Replace Paragraph Tags with Newlines

What This Does

Regex

Example

Replace HTML-Encoded Special Characters With Plain Text

What This Does

Regex

Example

About the Author

1 Comment

Leave a Reply Cancel reply

Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML

Contents

TL;DR

Links: Open in New Window

What This Does

Regex

Example

Headings: Remove ID Attributes

What This Does

Regex

Example

Code: Remove Duplicate Syntax Highlighting Language Attributes

What This Does

Regex

Example

Replace Paragraph Tags with Newlines

What This Does

Regex

Example

Replace HTML-Encoded Special Characters With Plain Text

What This Does

Regex

Example

About the Author

1 Comment

Leave a Reply Cancel reply

Related Posts

Splunk Search Results: JSON to HTML Table Conversion in PowerShell

Regex Search & Replace in WordPress Posts With WP-CLI

Should I Have Separate Personal & Professional Twitter Accounts?

Renaming Multiple Files With Regular Expressions in Total Commander

Latest Posts

Enertex KNX IP Secure Router: Initial Configuration Without 2nd Interface

How to Sync & Backup Frigate NVR Recordings to Offsite Cloud Storage

Frigate: NVR With Object Detection on Raspberry Pi 5 & Coral TPU

Elasticsearch ES|QL: Energy Consumption Chart With Home Assistant Data