Skip to content

Image/link attributes containing ], ), or spaces produce broken Markdown output #261

@Elijah-J

Description

@Elijah-J

convert_img, convert_a, and convert_video can emit Markdown that downstream parsers do not read as the original HTML. The failure mode is that markdownify drops raw values into Markdown link/image syntax without escaping them for that context. For img and video that means attribute-backed values like alt, src, and poster; for a and video it also includes generated label text inside [...].

Confirmed on 1.2.2 and current develop (at the time of filing, markdownify/__init__.py was byte-identical on both).

Reproducer

from markdownify import markdownify as md

md('<img src="/a" alt="]">')
# Output:   '![]](/a)'
# Expected: '![\\]](/a)' (image preserved)
# Re-parse: renders as literal text, image destroyed

md('<img src="/a b" alt="x">')
# Output:   '![x](/a b)'
# Expected: '![x](</a b>)' or URL-encoded
# Re-parse: literal text in 3 of 4 parsers

md('<img src="/safe" alt="](http://attacker)">')
# Output:   '![](http://attacker)](/safe)'
# Re-parse: <img src="http://attacker" alt=""/>](/safe)
#            attacker-controlled URL substituted, original destination left as trailing literal text

md('<a href="/a)b">click</a>')
# Output:   '[click](/a)b)'
# Re-parse: <a href="/a">click</a>b)

I re-ran those outputs through Python-Markdown, Mistune, commonmark.py, and markdown-it-py. The delimiter-truncation and URL-substitution cases break in all four. The space-in-destination cases are accepted by Python-Markdown but rendered as literal text by the three CommonMark parsers.

That matches CommonMark §6.3 (links) and §6.4 (images): brackets in labels need to be escaped or balanced, raw destinations cannot contain spaces unless they are written as <...>, and an unescaped ) closes an unbalanced destination early.

escape_misc=True is not a full workaround. It does not help for attribute-backed fields such as img alt/src/title, href, src, or poster, because those values bypass escape(). It does help when the broken piece is generated label text. For example, <a href="link">text]</a> becomes [text\]](link) with escape_misc=True.

convert_img is the clearest example: it pulls attributes directly from el.attrs and returns ![%s](%s%s) without routing alt or src through escape().

Failing input patterns

The confirmed input shapes so far are unbalanced [ or ] in alt, ) or a space in src or href, and ](...) appearing in alt or link text. The last case is the URL-substitution variant: the embedded URL becomes the parsed destination and the original src/href is left behind as trailing literal text.

Security note

The ](http://...) case is the one I would call out separately because it can substitute an attacker-controlled URL into the parsed Markdown output. That seems relevant for any pipeline that treats markdownify output as a trusted source of destinations, including HTML-to-Markdown storage flows or LLM ingest pipelines. I am not filing this as a CVE; I just want the behavior on record.

I have not tested sanitizer behavior here, so I am not making a stronger mitigation claim in this issue body.

Affected functions

Affected code paths include convert_img for raw src, alt, and title; convert_a for raw href plus the surrounding [...] around generated link text; and convert_video for raw src, poster, fallback <source src>, and generated label text. The existing title.replace('"', r'\"') in convert_img is a partial version of the kind of context-aware escaping that is needed here.

Fix shape

If you want a PR, my preference would be a shared escape layer for Markdown labels, destinations, and titles, applied anywhere markdownify emits link/image syntax. A narrower delimiter-by-delimiter patch would fix the immediate repros, but it would keep the escaping rules fragmented across emitters and make this class of bug easy to reintroduce.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions