Opened 12 years ago

Closed 12 years ago

#10474 closed defect (fixed)

AIHTMLDecoder and Unicode newlines

Reported by: rgovostes Owned by: nobody
Milestone: Component: Adium Core
Version: Severity: normal
Keywords: Cc:
Patch Status:


I've noticed that pasted text with line breaks, especially when copied from Safari or Word, sometimes shows up in the message view stripped of all breaks. More interestingly, though the problem shows up in the message view it does not appear in the logs or in Growl notifications.

The following HTML, rendered by Safari and pasted into Adium, triggers the issue:

This is on the first line.<br />
This is on the second line.

However, remove those paragraph tags and the resulting paste doesn't show the issue. And render it in Firefox and the paste will work fine too.

As it turns out, when we stick the paragraph tags in, it makes WebKit change its mind about how the characters are encoded when copied. Without the paragraph tags, they show up as regular linefeeds (0x0A), but with the paragraph tags they become line separators (U+2028).

The encoding causes it to bypass AIHTMLDecoder's substitutions in -encodeLooseHTML:imagesPath:, which are only set up to recognize \n and \r:

  • When sending the message to AIM's server, thingsToInclude.nonASCII = false, so it does a very rudimentary find/replace of \r\n, \r, and \n.
  • When sending the message to the message view, thingsToInclude.nonASCII = true, so we end up around line 620 being escaped as &#x2028;

Ideally this code would be updated to properly replace all Unicode line breaks as <br>. Wikipedia has the exhaustive list taken from the Unicode Standard 4.0 guidelines:

Otherwise, Apple's character sets and Unicode utilities don't seem to include all of those (strangely enough), so the list may need to be hardcoded.

Change History (1)

comment:1 Changed 12 years ago by Evan Schoenberg

Resolution: fixed
Status: newclosed

(In [24382]) We now encode LINE_SEPARATOR (\u2028) and PARAGRAPH_SEPARATOR (\u2029) to <BR> when encoding HTML. NEXT_LINE (\u0085) and FORM_FEED (\u000C) are noted but not encoded because the compiler claims these are not valid unichar sequences.

Fixes #10474.

Note: See TracTickets for help on using tickets.