Regex Cheat Sheet: Regular Expressions For Cleaning Up HTML
This article presents a collection of regular expressions I frequently use to clean up HTML generated from some tools’ export routines, e.g., Typora. A PowerShell script automates the clean-up task.
TL;DR
This PowerShell script implements the techniques explained in the article. It automates the process of cleaning an entire HTML file.
Links: Open in New Window
What This Does
Open links in a new window and replace single with double quotes.
Regex
- Search for:
<a href='([^']+)'>
- Replace with:
<a href="\1" target="_blank" rel="noopener">
Example
- Before:
<a href='https://helgeklein.com/blog/category/home-automation/'>
- After:
<a href="https://helgeklein.com/blog/category/home-automation/" target="_blank" rel="noopener">
Headings: Remove ID Attributes
What This Does
Remove ID attributes that some export routines add to every heading.
Regex
- Search for:
<h(\d)\s+id='[^']+'>
- Replace with:
<h\1>
Example
- Before:
<h2 id='what-are-grafana-loki--promtail'>
- After:
<h2>
Code: Remove Duplicate Syntax Highlighting Language Attributes
What This Does
Some export routines add duplicate language attributes to code blocks. This replaces the two attributes with a single new attribute that I prefer.
Regex
- Search for:
class='language-([^']+)'\s+lang='\1'>
- Replace with:
class="lang-\1">
Example
- Before:
<code class='language-yaml' lang='yaml'>
- After:
<code class="lang-yaml">
Replace Paragraph Tags with Newlines
What This Does
WordPress doesn’t require authors to enclose paragraphs in <p>
</p>
tags. Instead, it adds those automatically to the HTML when it identifies a paragraph.
To optimize readability in the backend, I remove HTML paragraph tags that are added by many tools’ export routines.
Regex
- Search for:
</?p>
- Replace with:
\n
Example
- Before:
<p>Text in a paragraph.</p>
- After:
\nText in a paragraph.\n
Replace HTML-Encoded Special Characters With Plain Text
What This Does
Many tools’ export routines replace special characters with their corresponding HTML entities, aka they HTML-encode such special characters. With WordPress, this type of encoding is not necessary, as WordPress does it automatically. To optimize readability in the backend, I convert HTML entities back to plain text.
The regex presented below uses a syntax that is available in Notepad++ (regex docs) and all other applications that use the boost regex library. The regex makes the following replacements:
<
→<
>
→>
'
→'
"
→"
&
→&
Regex
- Search for:
(<)|(>)|(')|(")|(&)
- Replace with:
(?1<)(?2>)(?3')(?4")(?5&)
Example
- Before:
<a href='([^']+)'>
- After:
<a href='([^']+)'>
1 Comment
The provided regular expression patterns cover tasks that often arise when working with HTML content. Programmers can use these templates as a starting point and adapt them to their specific cleaning needs