Modify

Opened 20 months ago

Closed 9 months ago

Last modified 9 months ago

#11050 closed defect (fixed)

An invalid XML character (Unicode: 0x1b) was found in the element content of the document.

Reported by: sanket.modi@… Owned by: osimons
Priority: high Component: XmlRpcPlugin
Severity: major Keywords:
Cc: Trac Release: 1.0

Description

There is a ticket which contains some Escape characters and when we try to get that ticket using API, it gives us error "An invalid XML character (Unicode: 0x1b) was found in the element content of the document".

As there are some invalid XML characters which parser can't parse, we suggest you to remove those before sending XML.

Attachments (1)

dont-remove-surrogate-pairs-r13728.diff (4.3 KB) - added by jun66j5 10 months ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 20 months ago by osimons

1) According to XML spec the 0x1b character is indeed not valid in XML and should not be included.

2) However since Trac web output is also X(HT)ML, it would actually be illegal to return it in that response too. Does a regular web response of the ticket include the illegal characters or are they then stripped away?

3) If stripped away - from either web or rpc - how would you handle field updates? That would obviously overwrite the control characters as they would not be included when post'ing back new data?

4) If such control characters are troublesome, would it not be an equally good strategy to ensure that they don't make it into the database in the first place?

Anyway, it could likely be done quite simply in tracrpc/xml_rpc.py by editing _send_reponse() to this (using multiple lines for clarity):

_illegal_xml_chars_RE = re.compile(
           u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
...
    def _send_response(self, req, response, content_type='application/xml'):
        response = to_unicode(response)
        response = _illegal_xml_chars_RE.sub('?', response)
        response = response.encode("utf-8")
...

Just leaving notes here for now. Need to think about this some more first.

comment:2 Changed 10 months ago by olemis

Considering this article the fact seems to be that illegal characters should not be removed if included in CDATA blocks ?

comment:3 Changed 10 months ago by olemis

Now , looking at CData section definition I notice that it's constructed using Char , so I guess it's ok to strip those characters .

What about replacing it with Unicode replacement character U+FFFD ? It is allowed afaict .

comment:4 Changed 10 months ago by osimons

Trac just removes them, as implemented in trac:changeset:7718

I'd be OK with just stripping them too, but if a visual marker is considered a better solution for lost byte(s) I'm OK with that too. The visual marker would change the suggested fix above in this line:

response = _illegal_xml_chars_RE.sub(u'\uFFFD', response)

Please try the patch and see if it works OK in actual use.

comment:5 Changed 10 months ago by olemis

There is a patch available in branch t11050 . It considers sys.maxunicode when building invalid chars regex (see why) . The tests results reveal that everything seems to be ok.

comment:6 Changed 10 months ago by olemis

comment:7 Changed 10 months ago by olemis

Similar results for Trac /trunk

comment:8 Changed 10 months ago by osimons

Thanks for the patch. Looks like a very good solution and I like the test too.

However, when testing patch on OSX I get an error:

Fault: <Fault 1: "'unichr() arg not in range(0x10000) (narrow Python build)' while executing 'test_unichr.unichr()'">

Which looks like this problem covered in stackoverflow:7105874

comment:9 Changed 10 months ago by olemis

Thanks for the review !

I used unicode char code points consistently in server side code but not in client side test code. This could be the reason why it's failing ?

I do not have a Mac to test it . @osimons : could you please retry with latest version of the patch ?

comment:10 Changed 10 months ago by olemis

This is the changeset .

comment:11 Changed 10 months ago by osimons

  • Resolution set to fixed
  • Status changed from new to closed

In 13728:

XmlRpcPlugin: Remove invalid XML characters from ouput. Closes #11050.

Thanks olemis.

comment:12 Changed 10 months ago by osimons

(BTW, I reorganized bits of the code.)

comment:13 follow-up: Changed 10 months ago by jun66j5

  • Resolution fixed deleted
  • Status changed from closed to reopened

After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.

Python 2.4.4 (#71, Oct 18 2006, 08:34:43) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, re
>>> _illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F),
...                      (0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF),
...                      (0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF),
...                      (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF),
...                      (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),
...                      (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF),
...                      (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF),
...                      (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),
...                      (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF),
...                      (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF) ]
... _illegal_ranges = ["%s-%s" % (unichr(low), unichr(high))
...                    for (low, high) in _illegal_unichrs
...                                 if low < sys.maxunicode]
... _illegal_xml_chars_RE = re.compile(u'[%s]' % u''.join(_illegal_ranges))
>>> text = u'\U0001D4C1'  # U+1D4C1, http://www.charbase.com/1D4C1
>>> _illegal_xml_chars_RE.sub(u'\uFFFD', text)
u'\ufffd\ufffd'

Changed 10 months ago by jun66j5

comment:14 in reply to: ↑ 13 Changed 10 months ago by jun66j5

Replying to jun66j5:

After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.

Proposed patch: dont-remove-surrogate-pairs-r13728.diff.

comment:15 Changed 10 months ago by osimons

  • Resolution set to fixed
  • Status changed from reopened to closed

In 13729:

XmlRpcPlugin: Tweak [13728] fix by not removing surrogate pairs. Closes #11050 again.

Thanks Jun.

comment:16 follow-up: Changed 9 months ago by osimons

  • Resolution fixed deleted
  • Status changed from closed to reopened

Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635

comment:17 Changed 9 months ago by osimons

  • Resolution set to fixed
  • Status changed from reopened to closed

In 13776:

XmlRpcPlugin: Tweak [13728] fix again to let carriage return pass through. Closes #11050 again. Also closes #11635.

comment:18 in reply to: ↑ 16 Changed 9 months ago by olemis

Replying to osimons:

Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635

Sorry . The regex I used in first place is broken . I've been reviewing the standard and now I realize . I apologize for the trouble I caused but really did not notice about matching (and replacing) #xD char .

Add Comment

Modify Ticket

Action
as closed The owner will remain osimons.
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.