Modify

Opened 12 years ago

Last modified 10 years ago

#11050 reopened defect

An invalid XML character (Unicode: 0x1b) was found in the element content of the document.

Reported by: sanket.modi@… Owned by: osimons
Priority: high Component: XmlRpcPlugin
Severity: major Keywords:
Cc: Trac Release: 1.0

Description

There is a ticket which contains some Escape characters and when we try to get that ticket using API, it gives us error "An invalid XML character (Unicode: 0x1b) was found in the element content of the document".

As there are some invalid XML characters which parser can't parse, we suggest you to remove those before sending XML.

Attachments (1)

dont-remove-surrogate-pairs-r13728.diff (4.3 KB) - added by Jun Omae 11 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 12 years ago by osimons

1) According to XML spec the 0x1b character is indeed not valid in XML and should not be included.

2) However since Trac web output is also X(HT)ML, it would actually be illegal to return it in that response too. Does a regular web response of the ticket include the illegal characters or are they then stripped away?

3) If stripped away - from either web or rpc - how would you handle field updates? That would obviously overwrite the control characters as they would not be included when post'ing back new data?

4) If such control characters are troublesome, would it not be an equally good strategy to ensure that they don't make it into the database in the first place?

Anyway, it could likely be done quite simply in tracrpc/xml_rpc.py by editing _send_reponse() to this (using multiple lines for clarity):

_illegal_xml_chars_RE = re.compile(
           u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
...
    def _send_response(self, req, response, content_type='application/xml'):
        response = to_unicode(response)
        response = _illegal_xml_chars_RE.sub('?', response)
        response = response.encode("utf-8")
...

Just leaving notes here for now. Need to think about this some more first.

comment:2 Changed 11 years ago by Olemis Lang

Considering this article the fact seems to be that illegal characters should not be removed if included in CDATA blocks ?

comment:3 Changed 11 years ago by Olemis Lang

Now , looking at CData section definition I notice that it's constructed using Char , so I guess it's ok to strip those characters .

What about replacing it with Unicode replacement character U+FFFD ? It is allowed afaict .

comment:4 Changed 11 years ago by osimons

Trac just removes them, as implemented in trac:changeset:7718

I'd be OK with just stripping them too, but if a visual marker is considered a better solution for lost byte(s) I'm OK with that too. The visual marker would change the suggested fix above in this line:

response = _illegal_xml_chars_RE.sub(u'\uFFFD', response)

Please try the patch and see if it works OK in actual use.

comment:5 Changed 11 years ago by Olemis Lang

There is a patch available in branch t11050 . It considers sys.maxunicode when building invalid chars regex (see why) . The tests results reveal that everything seems to be ok.

comment:6 Changed 11 years ago by Olemis Lang

comment:7 Changed 11 years ago by Olemis Lang

Similar results for Trac /trunk

comment:8 Changed 11 years ago by osimons

Thanks for the patch. Looks like a very good solution and I like the test too.

However, when testing patch on OSX I get an error:

Fault: <Fault 1: "'unichr() arg not in range(0x10000) (narrow Python build)' while executing 'test_unichr.unichr()'">

Which looks like this problem covered in stackoverflow:7105874

comment:9 Changed 11 years ago by Olemis Lang

Thanks for the review !

I used unicode char code points consistently in server side code but not in client side test code. This could be the reason why it's failing ?

I do not have a Mac to test it . @osimons : could you please retry with latest version of the patch ?

comment:10 Changed 11 years ago by Olemis Lang

This is the changeset .

comment:11 Changed 11 years ago by osimons

Resolution: fixed
Status: newclosed

In 13728:

XmlRpcPlugin: Remove invalid XML characters from ouput. Closes #11050.

Thanks olemis.

comment:12 Changed 11 years ago by osimons

(BTW, I reorganized bits of the code.)

comment:13 Changed 11 years ago by Jun Omae

Resolution: fixed
Status: closedreopened

After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.

Python 2.4.4 (#71, Oct 18 2006, 08:34:43) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, re
>>> _illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F),
...                      (0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF),
...                      (0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF),
...                      (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF),
...                      (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),
...                      (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF),
...                      (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF),
...                      (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),
...                      (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF),
...                      (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF) ]
... _illegal_ranges = ["%s-%s" % (unichr(low), unichr(high))
...                    for (low, high) in _illegal_unichrs
...                                 if low < sys.maxunicode]
... _illegal_xml_chars_RE = re.compile(u'[%s]' % u''.join(_illegal_ranges))
>>> text = u'\U0001D4C1'  # U+1D4C1, http://www.charbase.com/1D4C1
>>> _illegal_xml_chars_RE.sub(u'\uFFFD', text)
u'\ufffd\ufffd'

Changed 11 years ago by Jun Omae

comment:14 in reply to:  13 Changed 11 years ago by Jun Omae

Replying to jun66j5:

After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.

Proposed patch: dont-remove-surrogate-pairs-r13728.diff.

comment:15 Changed 11 years ago by osimons

Resolution: fixed
Status: reopenedclosed

In 13729:

XmlRpcPlugin: Tweak [13728] fix by not removing surrogate pairs. Closes #11050 again.

Thanks Jun.

comment:16 Changed 11 years ago by osimons

Resolution: fixed
Status: closedreopened

Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635

comment:17 Changed 11 years ago by osimons

Resolution: fixed
Status: reopenedclosed

In 13776:

XmlRpcPlugin: Tweak [13728] fix again to let carriage return pass through. Closes #11050 again. Also closes #11635.

comment:18 in reply to:  16 Changed 11 years ago by Olemis Lang

Replying to osimons:

Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635

Sorry . The regex I used in first place is broken . I've been reviewing the standard and now I realize . I apologize for the trouble I caused but really did not notice about matching (and replacing) #xD char .

comment:19 Changed 10 years ago by Jun Omae

I get the following a failure of unit tests with latest xmlrpcplugin.

======================================================================
FAIL: test_xml_encoding_special_characters (tracrpc.tests.xml_rpc.RpcXmlTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/src/xmlrpcplugin.git/tracrpc/tests/xml_rpc.py", line 110, in test_xml_encoding_special_characters
    self.assertEquals('Desc & ription\r\nLine 2', ticket[3]['description'])
AssertionError: 'Desc & ription\r\nLine 2' != 'Desc & ription\nLine 2'

Seems to be caused by xmlrpclib.loads converting CRLF to LF. See SO:14294682.

Python 2.4.6 (#2, Jan 24 2014, 20:03:08)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xmlrpclib import dumps, loads
>>> t = ("abc\r\ndef",)
>>> dumps(t)
'<params>\n<param>\n<value><string>abc\r\ndef</string></value>\n</param>\n</params>\n'
>>> loads(dumps(t))
(('abc\ndef',), None)

comment:20 in reply to:  19 Changed 10 years ago by osimons

Resolution: fixed
Status: closedreopened

Replying to jun66j5:

Seems to be caused by xmlrpclib.loads converting CRLF to LF. See SO:14294682.

I see the same. It certainly did not use to be like this, and I can't help thinking something has changed leading up to the Python 2.7.10 that I now use. I no longer have a working array of older Python versions, so this is just speculation from me as the failure persists regardless of Trac version used when running the tests (0.11-stable, 0.12-stable, 1.0-stable and trunk).

Modify Ticket

Change Properties
Set your email in Preferences
Action
as reopened The owner will remain osimons.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.