Opened 9 years ago

Closed 5 years ago

# Japanese wiki page name doesn't autowikify

Reported by: Owned by: Takatoshi Kondo Ryan J Ollos normal AutoWikifyPlugin normal Ryan J Ollos, Jun Omae 0.10

### Description

Japanese wiki page (page name is written in Japanese, not contents) doesn't wikify automatically.

I think it's concerned with utf-8 multibyte string.

In _update method, utf-8 string is escaped. Is it correct behavior ?

### comment:1 Changed 9 years ago by Alec Thomas

The re.escape() is necessary to ensure page names with characters that are regex operators don't break the regular expression.

Are you running the latest version of Trac? Can you paste a page name that doesn't work?

### comment:2 follow-up:  13 Changed 9 years ago by Takatoshi Kondo

I use trac version 0.10.4.

I inserted "print(self.pages)" in autowikify.py.

    def _all_pages(self):
self.pages = set(WikiSystem(self.env).get_pages())
print(self.pages)


Output:

set([u'WikiNewPage', u'\u3042\u3042\u3042\u3042', ...snip...])


For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

### comment:3 Changed 9 years ago by Alec Thomas

Yes, thanks! I'll try this out tonight.

### comment:4 Changed 9 years ago by Takatoshi Kondo

I fixed this problem in my local environment.

• trac version 0.11.1

Solution consists of two parts.

1.Add locale flag in trac wiki system

2.Remove blank "\b" from patterns in autowikify plugin.

• See remove_blank.patch
• Japanese words don't separate with blank.
• But I think this function should be able to toggle by option.

### comment:5 Changed 8 years ago by Christian Boos

I closed #T7552 as wontfix, since I don't think the problem comes from Trac.

The actual problem is also not related to the way re.escape deals with unicode objects as I originally thought:

>>> import re
>>> pages = [u"ああああ", 'WikiStart']
>>> pattern = r'\b(?P<autowiki>' + '|'.join([re.escape(page) for page in pages]) + r')\b'
>>> pattern
u'\\b(?P<autowiki>\\\u3042\\\u3042\\\u3042\\\u3042|WikiStart)\\b'
>>> re.search(pattern, u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)
>>> re.search(pattern, u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True


So the original code should theoretically work.

Ah, I've just seen this: "Japanese words don't separate with blank." Well, \b is not supposed to correspond only to blanks, but to "... whitespace or a non-alphanumeric, non-underscore character.". Nevertheless this reminded me that Japanese sentences look like a stream of characters with no obvious separator between words (at least to the untrained eye ;-) ).

So it looks like the intended behavior here is actually to extract a specific sequence of alphanumeric characters part of a larger sequence of characters, in the same way as one would extract 'WikiStart' from 'TheWikiStartPage' in English. Then I agree that the only way would be to optionally remove the \b markers.

Wouldn't that be enough? I still don't see the motivation for #T7552.

### comment:6 Changed 8 years ago by Takatoshi Kondo

Unicode regex operates correctly.

The reason of using re.LOCALE is below.

In the _prepare_rules function, WikiParser makes regex based on the pattern.

An important point is the order of 'syntax'.

Plugin's syntax is ordered after internal syntax. Syntax of the plug-in is arranged behind internal syntax.

Example.

(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?(?P<inline2>.*?))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?$\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^$]+)\])|(?P<i2>!?$(?:'[^']+'|\"[^\"]+\")$)|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?$(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^$]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?$(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^$]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))


Japanese wiki page name matches to internal syntax(i0) instead of autowikify syntax(i8). If the re.LOCALE flag is set, it matches to autowikify syntax(i8) according to the expectation.

Check the below code.

# -*- coding: utf-8 -*-

import re
import locale

locale.setlocale(locale.LC_ALL, 'Japan')

def replace(fullmatch):
"""Replace one match with its corresponding expansion"""
replacement = handle_match(fullmatch)
if replacement:
return _markup_to_unicode(replacement)

def handle_match(fullmatch):
for itype, match in fullmatch.groupdict().items():
#       if match and not itype in self.wikiparser.helper_patterns:
if match:
# Check for preceding escape character '!'
print "match:" + itype + "," + match

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?(?P<inline2>.*?))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?$\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^$]+)\])|(?P<i2>!?$(?:'[^']+'|\"[^\"]+\")$)|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?$(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^$]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?$(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^$]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))"

rules = re.compile(unicode(str), re.UNICODE|re.LOCALE)
print "re.UNICODE|re.LOCALE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?(?P<inline2>.*?))|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?$\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^$]+)\])|(?P<i2>!?$(?:'[^']+'|\"[^\"]+\")$)|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?$(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^$]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?$(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^$]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\}))"

rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE and i8 before i0"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))


Result.

re.UNICODE|re.LOCALE
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE
match:i0,あいうえお  <= unexpected
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE and i8 before i0
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ


Japanese string "ああ" matches both re.UNICODE and re.UNICODE|re.LOCALE. But string "あいうえお" matches only re.UNICODE|re.LOCALE. It depends on string.

Probably, this behavior is concerned with the below. But I'm not sure.

Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.


There is an approach of arranging i8 in front of i0 if re.LOCALE is not set. However, I am not predictable of the side effect.

Any ideas?

### comment:7 Changed 8 years ago by Christian Boos

So if I understand you correctly, when using the re.UNICODE flag only, "あいうえお" is matched as a regular wiki page name and not as an autowiki name. But why should that be an issue, as in both cases you'd get a link to that wiki page?

### comment:8 follow-up:  9 Changed 8 years ago by Takatoshi Kondo

I think trac internal wiki-link doesn't work for page "あいうえお". And I checked it.

Autowikify plugin's page_formatter is below.

    def _page_formatter(self, f, n, match):
page = match.group('autowiki')
return Markup('<a href="%s" class="wiki">%s</a>'
% (self.env.href.wiki(page),
escape(page)))


It uses 'autowiki' tag.

The trac internal wiki link system might function only when it occasions about CamelCase and it describes explicitly like below.

[wiki:pagename]


Should I add any changes to something plugin?

### comment:9 in reply to:  8 Changed 8 years ago by Christian Boos

I think trac internal wiki-link doesn't work for page "あいうえお".

I'm afraid I'm not able to follow you... In comment:6, you justify the need for re.LOCALE by saying that without that flag, "あいうえお" gets matched by the internal wiki name regexp i0 (that's what:

re.UNICODE
match:i0,あいうえお  <= unexpected


shows).

Now you say that the internal wiki name regexp doesn't work for that page, which I can understand if (and only if) that name is part of some longer sentence, like in "あああああいうえおああああああ" (maybe an actual real example would help here). But in that situation, the autowikify regexp should match, provided the \b markers are dropped as discussed in comment:4 and comment:5.

### comment:10 Changed 8 years ago by Takatoshi Kondo

There is a possibility that I do not understand enough either.

I think there are 2 topics.

It seems that there is a difference in recognition for the first topic.

1. Why is it insufficient to match it to 'i0'?
1. The function 'handle_match' is called.
    def handle_match(self, fullmatch):
for itype, match in fullmatch.groupdict().items():
if match and not itype in self.wikiparser.helper_patterns:
# Check for preceding escape character '!'
if match[0] == '!':
return escape(match[1:])
if itype in self.wikiparser.external_handlers:
=>                  external_handler = self.wikiparser.external_handlers[itype]
return external_handler(self, match, fullmatch)
else:
internal_handler = getattr(self, '_%s_formatter' % itype)
return internal_handler(match, fullmatch)

2. 'i0' external_handler is the function 'wikipagename_link'.
        # Regular WikiPageNames
if not _check_unicode_camelcase(match):
=>              return match
self.format_page_name(match),
self.ignore_missing_pages)

3. In the function 'wikipagename_link', Japanese string "ああああ" is judged not camelcase.
4. After all, it is not wiki link, and a plain string "ああああ" is returned.
1. What is the behavior expected when Japanese wiki-page-name is part of some longer sentence ?
• In Japanese, it is necessary to match it.
• In addition, wiki pagename in regex pattern should be the order of length.
• The patch is being made now.

### comment:11 Changed 8 years ago by Takatoshi Kondo

I said..

In addition, wiki pagename in regex pattern should be the order of length.
The patch is being made now.

This operates correctly though there might be a possibility of a further improvement in the usage of the collection class of python.

### comment:12 Changed 5 years ago by Ryan J Ollos

Cc: Ryan J Ollos added; anonymous removed

### comment:13 in reply to:  2 ; follow-up:  14 Changed 5 years ago by Ryan J Ollos

For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

I've tested AutoWikifyPlugin at r11819 with Trac 0.11, and a wiki page named ああああ is NOT autowikified (this is a clean Trac install with Genshi 0.6.0). On a clean 0.12 Trac install with Genshi 0.6.0 but no Babel, the wiki page named ああああ is autowikified. The same behavior is seen for other wiki page names that contain unicode characters, such as ÄÄÄÄ.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

I don't see attachment:namesort.diff as the solution. The problem with removing the word boundaries from the regex is that on a Trac install with a wiki page named aaaa, aaaaa will be rendered as

aaaaa

So, two issues remain:

• Why does this work with Trac 0.12 but not 0.11? I'm really not too concerned about this, and would just suggest anyone experiencing the problem to upgrade.
• comment:4 says Japanese words don't separate with blank. I'm not sure how to deal with that issue because I don't think we want to remove the word boundaries from the regex. We could add an option for that, but I think someone that understands the language and Python locale issues should deal with that. I'd likely just make a mess of the situation.

So I'll leave this ticket open for now, but anyone experiencing issues should first upgrade to r11819 or later.

### comment:14 in reply to:  13 Changed 5 years ago by Takatoshi Kondo

Thanks you for your reply. I agree with you. My patch introduces a side effect that you mentioned above. Providing an option is acceptable solution for me, but python locale issues are difficult to deal with. I support your decision.

### comment:15 follow-up:  16 Changed 5 years ago by Ryan J Ollos

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

### comment:16 in reply to:  15 ; follow-up:  17 Changed 5 years ago by Takatoshi Kondo

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

I'm using the TracLightning. http://sourceforge.jp/projects/traclight/releases/53615

It is Japanese translation version that based on Trac 0.12.2. It is a kind of all-in-one package for windows. It's also includes the autowikify plugin and my patch is included in this package. I know it just now. http://sourceforge.jp/ticket/browse.php?group_id=2810&tid=14661

For testing, I replaced TracLightning version of autowikify with trac-hacks trunk version.

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

### comment:17 in reply to:  16 ; follow-up:  18 Changed 5 years ago by Ryan J Ollos

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

### comment:18 in reply to:  17 ; follow-up:  19 Changed 5 years ago by Takatoshi Kondo

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

### comment:19 in reply to:  18 Changed 5 years ago by Ryan J Ollos

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

Okay, thanks a lot for testing. I'll make sure we have a solution within a week. If nothing else, I'll just add an option for specifying whether the word boundaries are whitespace-separated. Better, we might be able to have the locale determine this implicitly. Best, Japanese Trac developer jun66j5 will chime in and tell us what the best solution is ;)

### comment:20 Changed 5 years ago by Ryan J Ollos

Owner: changed from Alec Thomas to Ryan J Ollos new → assigned

### comment:21 Changed 5 years ago by Jun Omae

If the leading or trailing characters of a page name are CJK characters, it generates the regexp without \b. For details, please see unit tests.

 Leading Trailing regexp non CJK no CJK \b{page-name}\b CJK no CJK {page-name}\b non CJK CJK \b{page-name} CJK CJK {page-name}

I don't think it's the best solution, however, I think it works well in the most cases.

### comment:22 Changed 5 years ago by Ryan J Ollos

(In [11843]) Refs #2252: Refactored, in preparation for applying Jun's patch to support Japanese wiki page names.

### comment:23 Changed 5 years ago by Ryan J Ollos

Jun, thanks for the patch. I'm still trying to understand it completely. I gave you commit access in case you want to push the changes yourself, otherwise I'll get to it sometime this weekend.

### comment:24 follow-up:  25 Changed 5 years ago by Jun Omae

Thanks, Ryan!

I would like to push by myself. Could you please grant the right?

### comment:25 in reply to:  24 Changed 5 years ago by Ryan J Ollos

I would like to push by myself. Could you please grant the right?

Sure, I added you for w-access to the autowikifyplugin path :)

### comment:26 Changed 5 years ago by Jun Omae

Resolution: → fixed assigned → closed

(In [11904]) fixed #2252: autowikify works with CJK wiki name

### Modify Ticket

Change Properties