Build a Markdown to HTML Converter in Python
Markdown powers much of the web — from README files to blog posts to documentation. Converting Markdown to HTML is a common task, and Python gives you excellent tools to do it well.
This guide walks through the full conversion pipeline: reading a Markdown file, parsing it to HTML, adding syntax highlighting for code blocks, sanitizing the output against XSS attacks, and wrapping it all in a working CLI tool.
The Conversion Pipeline
Every Markdown-to-HTML converter follows the same five-step pipeline:
read .md file
→ parse with a markdown library
→ (optional) apply syntax highlighting to code blocks
→ (optional) sanitize HTML to strip dangerous tags/attributes
→ wrap in an HTML template
→ write to output file
The two optional steps — highlighting and sanitization — are where most real-world converters spend their logic. The parsing step itself is straightforward; it’s the surrounding concerns that require care.
Setting Up
Install the dependencies you need for this guide:
pip install mistune markdown pygments bleach
- mistune — fast pure-Python markdown parser
- markdown — the classic Python-Markdown library with a rich extension ecosystem
- pygments — syntax highlighting engine used by both libraries
- bleach — HTML sanitizer for stripping dangerous tags
Converting Markdown with mistune
mistune is the simplest library for converting Markdown to HTML. Its API is a single function call. Pass your markdown text to mistune.html() and it returns a string of raw HTML:
import mistune
md = """
## Heading
This is **bold** and this is *italic*.
- Item one
- Item two
"""
html = mistune.html(md)
print(html)
<h2>Heading</h2>
<p>This is <strong>bold</strong> and this is <em>italic</em>.</p>
<ul>
<li>Item one</li>
<li>Item two</li>
</ul>
mistune.html() handles tables, strikethrough, task lists, and autolinks out of the box. No extensions to configure for basic parsing.
Converting Markdown with Python-Markdown
The markdown library takes a different approach: a lean core with an extensions system. Enable features by passing extension names:
import markdown
md = """
## Heading
This is **bold** text.
> A blockquote for emphasis.
"""
html = markdown.markdown(md)
print(html)
<h2>Heading</h2>
<p>This is <strong>bold</strong> text.</p>
<blockquote>
<p>A blockquote for emphasis.</p>
</blockquote>
For fenced code blocks, add the fenced_code extension. For syntax highlighting, also add codehilite:
import markdown
md = """
Here's some Python:
def add(a, b):
return a + b
And a JavaScript example:
const add = (a, b) => a + b;
"""
html = markdown.markdown(md, extensions=['fenced_code', 'codehilite'])
print(html)
Python-Markdown requires explicit extension activation. The codehilite extension adds CSS classes to code blocks — include a Pygments stylesheet in your HTML output to see colors.
Adding Syntax Highlighting with Pygments
Neither mistune nor Python-Markdown includes syntax highlighting built-in. Both use Pygments for code block coloring, but they integrate differently.
Highlighting with mistune
mistune requires a custom renderer. Subclass mistune.renderers.html.HTMLRenderer and override the block_code method:
import mistune
from pygments import highlight
from pygments.lexers import get_lexer_by_name, TextLexer
from pygments.formatters import HtmlFormatter
class HighlightRenderer(mistune.renderers.html.HTMLRenderer):
def block_code(self, code, lang=None):
if lang:
lexer = get_lexer_by_name(lang, stripall=True)
else:
lexer = TextLexer()
formatter = HtmlFormatter()
highlighted = highlight(code, lexer, formatter)
return f'<div class="highlight">{formatter.get_style_defs()}{highlighted}</div>\n'
renderer = HighlightRenderer()
markdown = mistune.create_markdown(renderer=renderer)
md = """
Here is Python:
def multiply(a, b):
return a * b
And JavaScript:
const multiply = (a, b) => a * b;
"""
print(markdown(md))
The block_code method receives the raw code string and optional language identifier. It looks up the appropriate Pygments lexer, highlights the code, and returns the formatted HTML.
Highlighting with Python-Markdown
Python-Markdown’s codehilite extension handles highlighting automatically:
import markdown
md = """
Python:
def add(a, b):
return a + b
JavaScript:
const add = (a, b) => a + b;
"""
html = markdown.markdown(md, extensions=['codehilite', 'fenced_code'])
print(html)
To actually see colors in the browser, include a Pygments stylesheet in your HTML template:
from pygments.formatters import HtmlFormatter
def get_pygments_css():
formatter = HtmlFormatter(style='monokai')
return f'<style>{formatter.get_style_defs(".codehilite")}</style>'
Sanitizing HTML Output
Raw Markdown conversion is not safe for untrusted input. Markdown parsers pass through raw HTML, which means a user can inject <script> tags or event handlers like <img onerror="...">.
Render untrusted Markdown through a sanitizer before outputting HTML. This is not optional — it is a security requirement.
Using bleach
bleach.clean() strips dangerous tags and attributes while preserving safe HTML:
import bleach
dirty = """
<p>Hello <script>alert("xss")</script></p>
<p>Click <a href="javascript:alert(1)">here</a></p>
<p><img src=x onerror="alert(1)"></p>
"""
clean = bleach.clean(
dirty,
tags=['p', 'a'],
attributes={'a': ['href']},
strip=True
)
print(clean)
<p>Hello alert("xss")</p>
<p>Click <a>here</a></p>
<p></p>
The strip=True option removes disallowed tags entirely. The href attribute on <a> tags is preserved, but javascript: URLs are neutralized.
Full Pipeline: mistune to bleach
Combine parsing and sanitization into a single conversion function:
import mistune
import bleach
def convert_md_to_html(md_text, sanitize=True):
raw_html = mistune.html(md_text)
if not sanitize:
return raw_html
return bleach.clean(
raw_html,
tags=[
'h1', 'h2', 'h3', 'h4', 'p', 'ul', 'ol', 'li',
'strong', 'em', 'b', 'i', 'u', 's',
'a', 'code', 'pre', 'blockquote',
'div', 'span', 'br', 'hr',
'img', 'table', 'thead', 'tbody', 'tr', 'th', 'td',
],
attributes={
'a': ['href', 'title', 'target', 'rel'],
'img': ['src', 'alt', 'title', 'width', 'height'],
'code': ['class'],
'span': ['class'],
'div': ['class'],
'*': ['id'],
},
strip=True
)
# Test with a potentially dangerous input
md_input = """
A Safe Article
Click [here](https://example.com)!
<script>document.cookie</script>
print("safe")
"""
result = convert_md_to_html(md_input)
print(result)
<h2>A Safe Article</h2>
<p>Click <a href="https://example.com" rel="nofollow">here</a>!</p>
<p></p>
<pre><code>print("safe")
</code></pre>
The script tag is stripped, the link is preserved with rel="nofollow", and the code block is rendered safely.
Building a Full CLI Converter
Now assemble the pieces into a reusable command-line tool. This project structure keeps concerns separate:
markdown_converter/
├── converter/
│ ├── __init__.py
│ ├── core.py
│ ├── renderer.py
│ └── sanitizer.py
├── main.py
└── requirements.txt
converter/renderer.py
import mistune
from pygments import highlight
from pygments.lexers import get_lexer_by_name, TextLexer
from pygments.formatters import HtmlFormatter
class HighlightRenderer(mistune.renderers.html.HTMLRenderer):
def block_code(self, code, lang=None):
if lang:
lexer = get_lexer_by_name(lang, stripall=True)
else:
lexer = TextLexer()
formatter = HtmlFormatter()
highlighted = highlight(code, lexer, formatter)
return f'<div class="highlight">{highlighted}</div>\n'
converter/sanitizer.py
import bleach
ALLOWED_TAGS = [
'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
'p', 'ul', 'ol', 'li',
'strong', 'em', 'b', 'i', 'u', 's',
'a', 'code', 'pre', 'blockquote',
'div', 'span', 'br', 'hr',
'img', 'table', 'thead', 'tbody', 'tr', 'th', 'td',
]
ALLOWED_ATTRIBUTES = {
'a': ['href', 'title', 'target', 'rel'],
'img': ['src', 'alt', 'title', 'width', 'height'],
'code': ['class'],
'span': ['class'],
'div': ['class'],
'*': ['id'],
}
def sanitize_html(raw_html):
return bleach.clean(
raw_html,
tags=ALLOWED_TAGS,
attributes=ALLOWED_ATTRIBUTES,
strip=True
)
converter/core.py
import mistune
from .renderer import HighlightRenderer
from .sanitizer import sanitize_html
def convert_md_to_html(md_text, sanitize=True):
renderer = HighlightRenderer()
markdown = mistune.create_markdown(renderer=renderer)
raw_html = markdown(md_text)
if sanitize:
return sanitize_html(raw_html)
return raw_html
def convert_file(input_path, output_path, sanitize=True):
with open(input_path, 'r', encoding='utf-8') as f:
md_text = f.read()
html = convert_md_to_html(md_text, sanitize=sanitize)
full_page = wrap_in_html_template(html)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(full_page)
return output_path
def wrap_in_html_template(body_content, title="Document"):
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>{title}</title>
<style>
body {{ font-family: system-ui, sans-serif; line-height: 1.6; max-width: 800px; margin: 0 auto; padding: 2rem; }}
.highlight {{ background: #f4f4f4; padding: 1rem; overflow-x: auto; border-radius: 4px; }}
pre {{ margin: 0; }}
</style>
</head>
<body>
{body_content}
</body>
</html>"""
main.py
import argparse
import sys
from converter.core import convert_file
def main():
parser = argparse.ArgumentParser(description='Convert Markdown to HTML')
parser.add_argument('input', help='Input Markdown file')
parser.add_argument('-o', '--output', help='Output HTML file', default=None)
parser.add_argument(
'--no-sanitize',
action='store_true',
help='Skip HTML sanitization'
)
args = parser.parse_args()
output = args.output or args.input.replace('.md', '.html')
try:
result = convert_file(args.input, output, sanitize=not args.no_sanitize)
print(f"Written: {result}")
except FileNotFoundError:
print(f"Error: '{args.input}' not found", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()
Run it from the command line:
python main.py article.md -o article.html
Written: article.html
With --no-sanitize, the converter skips the bleach step — useful when you control the input and want maximum fidelity, but dangerous for user-submitted content.
See Also
working-with-files— read and write files on disk, which is the first step in the conversion pipelineregex-guide— pattern matching and text processing that complements markdown parsingrequests-library— fetch remote Markdown files over HTTP before converting them to HTML
Summary
You now have a complete picture of converting Markdown to HTML in Python:
| Step | What happens | Key tool |
|---|---|---|
| Read | Load .md file | built-in open() |
| Parse | Convert Markdown syntax to HTML | mistune.html() or markdown.markdown() |
| Highlight | Colorize code blocks | Pygments via custom renderer or codehilite |
| Sanitize | Strip dangerous HTML/attributes | bleach.clean() |
| Output | Wrap in HTML template and write | string formatting |
mistune is the faster, simpler choice for new projects. Python-Markdown’s extension ecosystem is valuable when you need TOC generation, footnotes, or SmartyPants processing. Either way, make sanitization a mandatory step for any untrusted input.
The CLI tool gives you a reusable command for batch conversions or integration into static site generators. Adapt the project structure to your needs — swap the renderer for a different highlighting theme, tighten the allowed tags list for stricter security, or add a watch mode for live preview.