Opened 12 years ago
Last modified 10 years ago
#11050 reopened defect
An invalid XML character (Unicode: 0x1b) was found in the element content of the document.
Reported by: | Owned by: | osimons | |
---|---|---|---|
Priority: | high | Component: | XmlRpcPlugin |
Severity: | major | Keywords: | |
Cc: | Trac Release: | 1.0 |
Description
There is a ticket which contains some Escape characters and when we try to get that ticket using API, it gives us error "An invalid XML character (Unicode: 0x1b) was found in the element content of the document".
As there are some invalid XML characters which parser can't parse, we suggest you to remove those before sending XML.
Attachments (1)
Change History (21)
comment:1 Changed 12 years ago by
comment:2 Changed 11 years ago by
Considering this article the fact seems to be that illegal characters should not be removed if included in CDATA blocks ?
comment:3 Changed 11 years ago by
Now , looking at CData section definition I notice that it's constructed using Char , so I guess it's ok to strip those characters .
What about replacing it with Unicode replacement character U+FFFD
? It is allowed afaict .
comment:4 Changed 11 years ago by
Trac just removes them, as implemented in trac:changeset:7718
I'd be OK with just stripping them too, but if a visual marker is considered a better solution for lost byte(s) I'm OK with that too. The visual marker would change the suggested fix above in this line:
response = _illegal_xml_chars_RE.sub(u'\uFFFD', response)
Please try the patch and see if it works OK in actual use.
comment:5 Changed 11 years ago by
There is a patch available in branch t11050 . It considers sys.maxunicode
when building invalid chars regex (see why) . The tests results reveal that everything seems to be ok.
comment:8 Changed 11 years ago by
Thanks for the patch. Looks like a very good solution and I like the test too.
However, when testing patch on OSX I get an error:
Fault: <Fault 1: "'unichr() arg not in range(0x10000) (narrow Python build)' while executing 'test_unichr.unichr()'">
Which looks like this problem covered in stackoverflow:7105874
comment:9 Changed 11 years ago by
Thanks for the review !
I used unicode char code points consistently in server side code but not in client side test code. This could be the reason why it's failing ?
I do not have a Mac to test it . @osimons : could you please retry with latest version of the patch ?
comment:13 follow-up: 14 Changed 11 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.
Python 2.4.4 (#71, Oct 18 2006, 08:34:43) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import sys, re >>> _illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F), ... (0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF), ... (0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF), ... (0x3FFFE, 0x3FFFF), (0x4FFFE, 0x4FFFF), ... (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF), ... (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF), ... (0x9FFFE, 0x9FFFF), (0xAFFFE, 0xAFFFF), ... (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF), ... (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF), ... (0xFFFFE, 0xFFFFF), (0x10FFFE, 0x10FFFF) ] ... _illegal_ranges = ["%s-%s" % (unichr(low), unichr(high)) ... for (low, high) in _illegal_unichrs ... if low < sys.maxunicode] ... _illegal_xml_chars_RE = re.compile(u'[%s]' % u''.join(_illegal_ranges)) >>> text = u'\U0001D4C1' # U+1D4C1, http://www.charbase.com/1D4C1 >>> _illegal_xml_chars_RE.sub(u'\uFFFD', text) u'\ufffd\ufffd'
Changed 11 years ago by
Attachment: | dont-remove-surrogate-pairs-r13728.diff added |
---|
comment:14 Changed 11 years ago by
Replying to jun66j5:
After the r13728, surrogate pairs are removed on Python narrow build. Please don't remove these characters.
Proposed patch: dont-remove-surrogate-pairs-r13728.diff.
comment:16 follow-up: 18 Changed 11 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635
comment:18 Changed 11 years ago by
Replying to osimons:
Seems we are too greedy, and even line-endings are converted. Not so good.... See comment:5:ticket:11635
Sorry . The regex I used in first place is broken . I've been reviewing the standard and now I realize . I apologize for the trouble I caused but really did not notice about matching (and replacing) #xD
char .
comment:19 follow-up: 20 Changed 10 years ago by
I get the following a failure of unit tests with latest xmlrpcplugin.
====================================================================== FAIL: test_xml_encoding_special_characters (tracrpc.tests.xml_rpc.RpcXmlTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/src/xmlrpcplugin.git/tracrpc/tests/xml_rpc.py", line 110, in test_xml_encoding_special_characters self.assertEquals('Desc & ription\r\nLine 2', ticket[3]['description']) AssertionError: 'Desc & ription\r\nLine 2' != 'Desc & ription\nLine 2'
Seems to be caused by xmlrpclib.loads
converting CRLF to LF. See SO:14294682.
Python 2.4.6 (#2, Jan 24 2014, 20:03:08) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from xmlrpclib import dumps, loads >>> t = ("abc\r\ndef",) >>> dumps(t) '<params>\n<param>\n<value><string>abc\r\ndef</string></value>\n</param>\n</params>\n' >>> loads(dumps(t)) (('abc\ndef',), None)
comment:20 Changed 10 years ago by
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Replying to jun66j5:
Seems to be caused by
xmlrpclib.loads
converting CRLF to LF. See SO:14294682.
I see the same. It certainly did not use to be like this, and I can't help thinking something has changed leading up to the Python 2.7.10 that I now use. I no longer have a working array of older Python versions, so this is just speculation from me as the failure persists regardless of Trac version used when running the tests (0.11-stable, 0.12-stable, 1.0-stable and trunk).
1) According to XML spec the 0x1b character is indeed not valid in XML and should not be included.
2) However since Trac web output is also X(HT)ML, it would actually be illegal to return it in that response too. Does a regular web response of the ticket include the illegal characters or are they then stripped away?
3) If stripped away - from either web or rpc - how would you handle field updates? That would obviously overwrite the control characters as they would not be included when post'ing back new data?
4) If such control characters are troublesome, would it not be an equally good strategy to ensure that they don't make it into the database in the first place?
Anyway, it could likely be done quite simply in
tracrpc/xml_rpc.py
by editing_send_reponse()
to this (using multiple lines for clarity):Just leaving notes here for now. Need to think about this some more first.