Opened 17 years ago
Closed 12 years ago
#2252 closed defect (fixed)
Japanese wiki page name doesn't autowikify
Reported by: | redboltz | Owned by: | Ryan J Ollos |
---|---|---|---|
Priority: | normal | Component: | AutoWikifyPlugin |
Severity: | normal | Keywords: | |
Cc: | Ryan J Ollos, Jun Omae | Trac Release: | 0.10 |
Description
Japanese wiki page (page name is written in Japanese, not contents) doesn't wikify automatically.
I think it's concerned with utf-8 multibyte string.
In _update method, utf-8 string is escaped. Is it correct behavior ?
Attachments (2)
Change History (28)
comment:1 Changed 17 years ago by
comment:2 follow-up: 13 Changed 17 years ago by
I use trac version 0.10.4.
I inserted "print(self.pages)" in autowikify.py.
def _all_pages(self): self.pages = set(WikiSystem(self.env).get_pages()) print(self.pages)
Output:
set([u'WikiNewPage', u'\u3042\u3042\u3042\u3042', ...snip...])
For example, page u'\u3042\u3042\u3042\u3042'
does't autowikify.
Page name is below same as u'\u3042\u3042\u3042\u3042'
.(Can you see?)
ああああ
comment:4 Changed 16 years ago by
I fixed this problem in my local environment.
- trac version 0.11.1
Solution consists of two parts.
1.Add locale flag in trac wiki system
- See http://trac.edgewall.org/ticket/7552 and try locale_add.patch.
- For trac 0.11 or later.
2.Remove blank "\b" from patterns in autowikify plugin.
- See remove_blank.patch
- Japanese words don't separate with blank.
- But I think this function should be able to toggle by option.
Changed 16 years ago by
Attachment: | remove_blank.patch added |
---|
comment:5 Changed 16 years ago by
I closed #T7552 as wontfix, since I don't think the problem comes from Trac.
The actual problem is also not related to the way re.escape
deals with unicode objects as I originally thought:
>>> import re >>> pages = [u"ああああ", 'WikiStart'] >>> pattern = r'\b(?P<autowiki>' + '|'.join([re.escape(page) for page in pages]) + r')\b' >>> pattern u'\\b(?P<autowiki>\\\u3042\\\u3042\\\u3042\\\u3042|WikiStart)\\b' >>> re.search(pattern, u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span() (1, 5) >>> re.search(pattern, u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None True
So the original code should theoretically work.
Ah, I've just seen this: "Japanese words don't separate with blank."
Well, \b
is not supposed to correspond only to blanks, but to "... whitespace or a non-alphanumeric, non-underscore character.". Nevertheless this reminded me that Japanese sentences look like a stream of characters with no obvious separator between words (at least to the untrained eye ;-) ).
So it looks like the intended behavior here is actually to extract a specific sequence of alphanumeric characters part of a larger sequence of characters, in the same way as one would extract 'WikiStart' from 'TheWikiStartPage' in English. Then I agree that the only way would be to optionally remove the \b
markers.
Wouldn't that be enough? I still don't see the motivation for #T7552.
comment:6 Changed 16 years ago by
Unicode regex operates correctly.
The reason of using re.LOCALE is below.
In the _prepare_rules function, WikiParser makes regex based on the pattern.
An important point is the order of 'syntax'.
Plugin's syntax is ordered after internal syntax. Syntax of the plug-in is arranged behind internal syntax.
Example.
(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))
Japanese wiki page name matches to internal syntax(i0) instead of autowikify syntax(i8). If the re.LOCALE flag is set, it matches to autowikify syntax(i8) according to the expectation.
Check the below code.
# -*- coding: utf-8 -*- import re import locale locale.setlocale(locale.LC_ALL, 'Japan') def replace(fullmatch): """Replace one match with its corresponding expansion""" replacement = handle_match(fullmatch) if replacement: return _markup_to_unicode(replacement) def handle_match(fullmatch): for itype, match in fullmatch.groupdict().items(): # if match and not itype in self.wikiparser.helper_patterns: if match: # Check for preceding escape character '!' print "match:" + itype + "," + match str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))" rules = re.compile(unicode(str), re.UNICODE|re.LOCALE) print "re.UNICODE|re.LOCALE" line = u"あいうえお" result = re.sub(rules, replace, unicode(line)) line = u"PageNameA" result = re.sub(rules, replace, unicode(line)) line = u"ああ" result = re.sub(rules, replace, unicode(line)) rules = re.compile(unicode(str), re.UNICODE) print "re.UNICODE" line = u"あいうえお" result = re.sub(rules, replace, unicode(line)) line = u"PageNameA" result = re.sub(rules, replace, unicode(line)) line = u"ああ" result = re.sub(rules, replace, unicode(line)) str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\}))" rules = re.compile(unicode(str), re.UNICODE) print "re.UNICODE and i8 before i0" line = u"あいうえお" result = re.sub(rules, replace, unicode(line)) line = u"PageNameA" result = re.sub(rules, replace, unicode(line)) line = u"ああ" result = re.sub(rules, replace, unicode(line))
Result.
re.UNICODE|re.LOCALE match:i8,あいうえお match:autowiki,あいうえお match:i8,PageNameA match:autowiki,PageNameA match:i8,ああ match:autowiki,ああ re.UNICODE match:i0,あいうえお <= unexpected match:i8,PageNameA match:autowiki,PageNameA match:i8,ああ match:autowiki,ああ re.UNICODE and i8 before i0 match:i8,あいうえお match:autowiki,あいうえお match:i8,PageNameA match:autowiki,PageNameA match:i8,ああ match:autowiki,ああ
Japanese string "ああ" matches both re.UNICODE and re.UNICODE|re.LOCALE. But string "あいうえお" matches only re.UNICODE|re.LOCALE. It depends on string.
Probably, this behavior is concerned with the below. But I'm not sure.
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.
There is an approach of arranging i8 in front of i0 if re.LOCALE is not set. However, I am not predictable of the side effect.
Any ideas?
comment:7 Changed 16 years ago by
So if I understand you correctly, when using the re.UNICODE flag only, "あいうえお" is matched as a regular wiki page name and not as an autowiki name. But why should that be an issue, as in both cases you'd get a link to that wiki page?
comment:8 follow-up: 9 Changed 16 years ago by
I think trac internal wiki-link doesn't work for page "あいうえお". And I checked it.
Autowikify plugin's page_formatter is below.
def _page_formatter(self, f, n, match): page = match.group('autowiki') return Markup('<a href="%s" class="wiki">%s</a>' % (self.env.href.wiki(page), escape(page)))
It uses 'autowiki' tag.
The trac internal wiki link system might function only when it occasions about CamelCase and it describes explicitly like below.
[wiki:pagename]
Should I add any changes to something plugin?
comment:9 Changed 16 years ago by
Replying to redboltz:
I think trac internal wiki-link doesn't work for page "あいうえお".
I'm afraid I'm not able to follow you... In comment:6, you justify the need for re.LOCALE by saying that without that flag, "あいうえお" gets matched by the internal wiki name regexp i0 (that's what:
re.UNICODE match:i0,あいうえお <= unexpected
shows).
Now you say that the internal wiki name regexp doesn't work for that page, which I can understand if (and only if) that name is part of some longer sentence, like in "あああああいうえおああああああ" (maybe an actual real example would help here). But in that situation, the autowikify regexp should match, provided the \b markers are dropped as discussed in comment:4 and comment:5.
comment:10 Changed 16 years ago by
There is a possibility that I do not understand enough either.
I think there are 2 topics.
It seems that there is a difference in recognition for the first topic.
- Why is it insufficient to match it to 'i0'?
- The function 'handle_match' is called.
def handle_match(self, fullmatch): for itype, match in fullmatch.groupdict().items(): if match and not itype in self.wikiparser.helper_patterns: # Check for preceding escape character '!' if match[0] == '!': return escape(match[1:]) if itype in self.wikiparser.external_handlers: => external_handler = self.wikiparser.external_handlers[itype] return external_handler(self, match, fullmatch) else: internal_handler = getattr(self, '_%s_formatter' % itype) return internal_handler(match, fullmatch)
- 'i0' external_handler is the function 'wikipagename_link'.
# Regular WikiPageNames def wikipagename_link(formatter, match, fullmatch): if not _check_unicode_camelcase(match): => return match return self._format_link(formatter, 'wiki', match, self.format_page_name(match), self.ignore_missing_pages)
- In the function 'wikipagename_link', Japanese string "ああああ" is judged not camelcase.
- After all, it is not wiki link, and a plain string "ああああ" is returned.
- The function 'handle_match' is called.
- What is the behavior expected when Japanese wiki-page-name is part of some longer sentence ?
- In Japanese, it is necessary to match it.
- In addition, wiki pagename in regex pattern should be the order of length.
- The patch is being made now.
Changed 16 years ago by
Attachment: | namesort.diff added |
---|
comment:11 Changed 16 years ago by
I said..
In addition, wiki pagename in regex pattern should be the order of length.
The patch is being made now.
I made the patch.(namesort.diff)
This operates correctly though there might be a possibility of a further improvement in the usage of the collection class of python.
comment:12 Changed 12 years ago by
Cc: | Ryan J Ollos added; anonymous removed |
---|
comment:13 follow-up: 14 Changed 12 years ago by
Cc: | Jun Omae added |
---|
Replying to kondo@t.email.ne.jp:
For example, page
u'\u3042\u3042\u3042\u3042'
does't autowikify.
I've tested AutoWikifyPlugin at r11819 with Trac 0.11, and a wiki page named ああああ is NOT autowikified (this is a clean Trac install with Genshi 0.6.0). On a clean 0.12 Trac install with Genshi 0.6.0 but no Babel, the wiki page named ああああ is autowikified. The same behavior is seen for other wiki page names that contain unicode characters, such as ÄÄÄÄ.
Page name is below same as
u'\u3042\u3042\u3042\u3042'
.(Can you see?)ああああ
I don't see attachment:namesort.diff as the solution. The problem with removing the word boundaries from the regex is that on a Trac install with a wiki page named aaaa, aaaaa will be rendered as
aaaaaSo, two issues remain:
- Why does this work with Trac 0.12 but not 0.11? I'm really not too concerned about this, and would just suggest anyone experiencing the problem to upgrade.
- comment:4 says Japanese words don't separate with blank. I'm not sure how to deal with that issue because I don't think we want to remove the word boundaries from the regex. We could add an option for that, but I think someone that understands the language and Python locale issues should deal with that. I'd likely just make a mess of the situation.
So I'll leave this ticket open for now, but anyone experiencing issues should first upgrade to r11819 or later.
comment:14 Changed 12 years ago by
Thanks you for your reply. I agree with you. My patch introduces a side effect that you mentioned above. Providing an option is acceptable solution for me, but python locale issues are difficult to deal with. I support your decision.
comment:15 follow-up: 16 Changed 12 years ago by
I'm pleasantly surprised to get a reply on such an old ticket :)
What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?
comment:16 follow-up: 17 Changed 12 years ago by
Replying to rjollos:
I'm pleasantly surprised to get a reply on such an old ticket :)
What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?
I'm using the TracLightning. http://sourceforge.jp/projects/traclight/releases/53615
It is Japanese translation version that based on Trac 0.12.2. It is a kind of all-in-one package for windows. It's also includes the autowikify plugin and my patch is included in this package. I know it just now. http://sourceforge.jp/ticket/browse.php?group_id=2810&tid=14661
For testing, I replaced TracLightning version of autowikify with trac-hacks trunk version.
It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.
comment:17 follow-up: 18 Changed 12 years ago by
Replying to redboltz:
It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.
Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?
comment:18 follow-up: 19 Changed 12 years ago by
Replying to rjollos:
Replying to redboltz:
It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.
Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?
Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.
comment:19 Changed 12 years ago by
Replying to redboltz:
Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.
Okay, thanks a lot for testing. I'll make sure we have a solution within a week. If nothing else, I'll just add an option for specifying whether the word boundaries are whitespace-separated. Better, we might be able to have the locale determine this implicitly. Best, Japanese Trac developer jun66j5 will chime in and tell us what the best solution is ;)
comment:20 Changed 12 years ago by
Owner: | changed from Alec Thomas to Ryan J Ollos |
---|---|
Status: | new → assigned |
comment:21 Changed 12 years ago by
I worked in https://github.com/jun66j5/autowikifyplugin/tree/ticket2252/no-boundary-if-cjk-blocks.
If the leading or trailing characters of a page name are CJK characters, it generates the regexp without \b
.
For details, please see unit tests.
Leading | Trailing | regexp |
non CJK | no CJK | \b{page-name}\b
|
CJK | no CJK | {page-name}\b
|
non CJK | CJK | \b{page-name}
|
CJK | CJK | {page-name}
|
I don't think it's the best solution, however, I think it works well in the most cases.
comment:22 Changed 12 years ago by
comment:23 Changed 12 years ago by
Jun, thanks for the patch. I'm still trying to understand it completely. I gave you commit access in case you want to push the changes yourself, otherwise I'll get to it sometime this weekend.
comment:24 follow-up: 25 Changed 12 years ago by
Thanks, Ryan!
I would like to push by myself. Could you please grant the right?
comment:25 Changed 12 years ago by
Replying to jun66j5:
I would like to push by myself. Could you please grant the right?
Sure, I added you for w-access to the autowikifyplugin
path :)
comment:26 Changed 12 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
The
re.escape()
is necessary to ensure page names with characters that are regex operators don't break the regular expression.Are you running the latest version of Trac? Can you paste a page name that doesn't work?