Modify

Opened 17 years ago

Closed 12 years ago

#2252 closed defect (fixed)

Japanese wiki page name doesn't autowikify

Reported by: redboltz Owned by: Ryan J Ollos
Priority: normal Component: AutoWikifyPlugin
Severity: normal Keywords:
Cc: Ryan J Ollos, Jun Omae Trac Release: 0.10

Description

Japanese wiki page (page name is written in Japanese, not contents) doesn't wikify automatically.

I think it's concerned with utf-8 multibyte string.

In _update method, utf-8 string is escaped. Is it correct behavior ?

Attachments (2)

remove_blank.patch (814 bytes) - added by redboltz 16 years ago.
namesort.diff (1.3 KB) - added by redboltz 16 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 17 years ago by Alec Thomas

The re.escape() is necessary to ensure page names with characters that are regex operators don't break the regular expression.

Are you running the latest version of Trac? Can you paste a page name that doesn't work?

comment:2 Changed 17 years ago by redboltz

I use trac version 0.10.4.

I inserted "print(self.pages)" in autowikify.py.

    def _all_pages(self):
        self.pages = set(WikiSystem(self.env).get_pages())
        print(self.pages)

Output:

set([u'WikiNewPage', u'\u3042\u3042\u3042\u3042', ...snip...])

For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

comment:3 Changed 17 years ago by Alec Thomas

Yes, thanks! I'll try this out tonight.

comment:4 Changed 16 years ago by redboltz

I fixed this problem in my local environment.

  • trac version 0.11.1

Solution consists of two parts.

1.Add locale flag in trac wiki system

2.Remove blank "\b" from patterns in autowikify plugin.

  • See remove_blank.patch
  • Japanese words don't separate with blank.
    • But I think this function should be able to toggle by option.

Changed 16 years ago by redboltz

Attachment: remove_blank.patch added

comment:5 Changed 16 years ago by Christian Boos

I closed #T7552 as wontfix, since I don't think the problem comes from Trac.

The actual problem is also not related to the way re.escape deals with unicode objects as I originally thought:

>>> import re
>>> pages = [u"ああああ", 'WikiStart']
>>> pattern = r'\b(?P<autowiki>' + '|'.join([re.escape(page) for page in pages]) + r')\b'
>>> pattern
u'\\b(?P<autowiki>\\\u3042\\\u3042\\\u3042\\\u3042|WikiStart)\\b'
>>> re.search(pattern, u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)
>>> re.search(pattern, u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True

So the original code should theoretically work.

Ah, I've just seen this: "Japanese words don't separate with blank." Well, \b is not supposed to correspond only to blanks, but to "... whitespace or a non-alphanumeric, non-underscore character.". Nevertheless this reminded me that Japanese sentences look like a stream of characters with no obvious separator between words (at least to the untrained eye ;-) ).

So it looks like the intended behavior here is actually to extract a specific sequence of alphanumeric characters part of a larger sequence of characters, in the same way as one would extract 'WikiStart' from 'TheWikiStartPage' in English. Then I agree that the only way would be to optionally remove the \b markers.

Wouldn't that be enough? I still don't see the motivation for #T7552.

comment:6 Changed 16 years ago by redboltz

Unicode regex operates correctly.

The reason of using re.LOCALE is below.

In the _prepare_rules function, WikiParser makes regex based on the pattern.

An important point is the order of 'syntax'.

Plugin's syntax is ordered after internal syntax. Syntax of the plug-in is arranged behind internal syntax.

Example.

(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))

Japanese wiki page name matches to internal syntax(i0) instead of autowikify syntax(i8). If the re.LOCALE flag is set, it matches to autowikify syntax(i8) according to the expectation.

Check the below code.

# -*- coding: utf-8 -*-

import re
import locale

locale.setlocale(locale.LC_ALL, 'Japan')

def replace(fullmatch):
    """Replace one match with its corresponding expansion"""
    replacement = handle_match(fullmatch)
    if replacement:
        return _markup_to_unicode(replacement)

def handle_match(fullmatch):
    for itype, match in fullmatch.groupdict().items():
#       if match and not itype in self.wikiparser.helper_patterns:
        if match:
            # Check for preceding escape character '!'
            print "match:" + itype + "," + match

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))"

rules = re.compile(unicode(str), re.UNICODE|re.LOCALE)
print "re.UNICODE|re.LOCALE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))


rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\}))"

rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE and i8 before i0"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

Result.

re.UNICODE|re.LOCALE
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE
match:i0,あいうえお  <= unexpected
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE and i8 before i0
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ

Japanese string "ああ" matches both re.UNICODE and re.UNICODE|re.LOCALE. But string "あいうえお" matches only re.UNICODE|re.LOCALE. It depends on string.

Probably, this behavior is concerned with the below. But I'm not sure.

Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.

There is an approach of arranging i8 in front of i0 if re.LOCALE is not set. However, I am not predictable of the side effect.

Any ideas?

comment:7 Changed 16 years ago by Christian Boos

So if I understand you correctly, when using the re.UNICODE flag only, "あいうえお" is matched as a regular wiki page name and not as an autowiki name. But why should that be an issue, as in both cases you'd get a link to that wiki page?

comment:8 Changed 16 years ago by redboltz

I think trac internal wiki-link doesn't work for page "あいうえお". And I checked it.

Autowikify plugin's page_formatter is below.

    def _page_formatter(self, f, n, match):
        page = match.group('autowiki')
        return Markup('<a href="%s" class="wiki">%s</a>'
                      % (self.env.href.wiki(page),
                         escape(page)))

It uses 'autowiki' tag.

The trac internal wiki link system might function only when it occasions about CamelCase and it describes explicitly like below.

[wiki:pagename]

Should I add any changes to something plugin?

comment:9 in reply to:  8 Changed 16 years ago by Christian Boos

Replying to redboltz:

I think trac internal wiki-link doesn't work for page "あいうえお".

I'm afraid I'm not able to follow you... In comment:6, you justify the need for re.LOCALE by saying that without that flag, "あいうえお" gets matched by the internal wiki name regexp i0 (that's what:

re.UNICODE
match:i0,あいうえお  <= unexpected

shows).

Now you say that the internal wiki name regexp doesn't work for that page, which I can understand if (and only if) that name is part of some longer sentence, like in "あああああいうえおああああああ" (maybe an actual real example would help here). But in that situation, the autowikify regexp should match, provided the \b markers are dropped as discussed in comment:4 and comment:5.

comment:10 Changed 16 years ago by redboltz

There is a possibility that I do not understand enough either.

I think there are 2 topics.

It seems that there is a difference in recognition for the first topic.

  1. Why is it insufficient to match it to 'i0'?
    1. The function 'handle_match' is called.
          def handle_match(self, fullmatch):
              for itype, match in fullmatch.groupdict().items():
                  if match and not itype in self.wikiparser.helper_patterns:
                      # Check for preceding escape character '!'
                      if match[0] == '!':
                          return escape(match[1:])
                      if itype in self.wikiparser.external_handlers:
      =>                  external_handler = self.wikiparser.external_handlers[itype]
                          return external_handler(self, match, fullmatch)
                      else:
                          internal_handler = getattr(self, '_%s_formatter' % itype)
                          return internal_handler(match, fullmatch)
      
    2. 'i0' external_handler is the function 'wikipagename_link'.
              # Regular WikiPageNames
              def wikipagename_link(formatter, match, fullmatch):
                  if not _check_unicode_camelcase(match):
      =>              return match
                  return self._format_link(formatter, 'wiki', match,
                                           self.format_page_name(match),
                                           self.ignore_missing_pages)
      
    3. In the function 'wikipagename_link', Japanese string "ああああ" is judged not camelcase.
    4. After all, it is not wiki link, and a plain string "ああああ" is returned.
  1. What is the behavior expected when Japanese wiki-page-name is part of some longer sentence ?
    • In Japanese, it is necessary to match it.
    • In addition, wiki pagename in regex pattern should be the order of length.
      • The patch is being made now.

Changed 16 years ago by redboltz

Attachment: namesort.diff added

comment:11 Changed 16 years ago by redboltz

I said..

In addition, wiki pagename in regex pattern should be the order of length.
The patch is being made now.

I made the patch.(namesort.diff)

This operates correctly though there might be a possibility of a further improvement in the usage of the collection class of python.

comment:12 Changed 12 years ago by Ryan J Ollos

Cc: Ryan J Ollos added; anonymous removed

comment:13 in reply to:  2 ; Changed 12 years ago by Ryan J Ollos

Cc: Jun Omae added

Replying to kondo@t.email.ne.jp:

For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

I've tested AutoWikifyPlugin at r11819 with Trac 0.11, and a wiki page named ああああ is NOT autowikified (this is a clean Trac install with Genshi 0.6.0). On a clean 0.12 Trac install with Genshi 0.6.0 but no Babel, the wiki page named ああああ is autowikified. The same behavior is seen for other wiki page names that contain unicode characters, such as ÄÄÄÄ.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

I don't see attachment:namesort.diff as the solution. The problem with removing the word boundaries from the regex is that on a Trac install with a wiki page named aaaa, aaaaa will be rendered as

aaaaa

So, two issues remain:

  • Why does this work with Trac 0.12 but not 0.11? I'm really not too concerned about this, and would just suggest anyone experiencing the problem to upgrade.
  • comment:4 says Japanese words don't separate with blank. I'm not sure how to deal with that issue because I don't think we want to remove the word boundaries from the regex. We could add an option for that, but I think someone that understands the language and Python locale issues should deal with that. I'd likely just make a mess of the situation.

So I'll leave this ticket open for now, but anyone experiencing issues should first upgrade to r11819 or later.

comment:14 in reply to:  13 Changed 12 years ago by redboltz

rjollos,

Thanks you for your reply. I agree with you. My patch introduces a side effect that you mentioned above. Providing an option is acceptable solution for me, but python locale issues are difficult to deal with. I support your decision.

comment:15 Changed 12 years ago by Ryan J Ollos

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

comment:16 in reply to:  15 ; Changed 12 years ago by redboltz

Replying to rjollos:

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

I'm using the TracLightning. http://sourceforge.jp/projects/traclight/releases/53615

It is Japanese translation version that based on Trac 0.12.2. It is a kind of all-in-one package for windows. It's also includes the autowikify plugin and my patch is included in this package. I know it just now. http://sourceforge.jp/ticket/browse.php?group_id=2810&tid=14661

For testing, I replaced TracLightning version of autowikify with trac-hacks trunk version.

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

comment:17 in reply to:  16 ; Changed 12 years ago by Ryan J Ollos

Replying to redboltz:

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

comment:18 in reply to:  17 ; Changed 12 years ago by redboltz

Replying to rjollos:

Replying to redboltz:

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

comment:19 in reply to:  18 Changed 12 years ago by Ryan J Ollos

Replying to redboltz:

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

Okay, thanks a lot for testing. I'll make sure we have a solution within a week. If nothing else, I'll just add an option for specifying whether the word boundaries are whitespace-separated. Better, we might be able to have the locale determine this implicitly. Best, Japanese Trac developer jun66j5 will chime in and tell us what the best solution is ;)

comment:20 Changed 12 years ago by Ryan J Ollos

Owner: changed from Alec Thomas to Ryan J Ollos
Status: newassigned

comment:21 Changed 12 years ago by Jun Omae

I worked in https://github.com/jun66j5/autowikifyplugin/tree/ticket2252/no-boundary-if-cjk-blocks.

If the leading or trailing characters of a page name are CJK characters, it generates the regexp without \b. For details, please see unit tests.

Leading Trailing regexp
non CJK no CJK \b{page-name}\b
CJK no CJK {page-name}\b
non CJK CJK \b{page-name}
CJK CJK {page-name}

I don't think it's the best solution, however, I think it works well in the most cases.

comment:22 Changed 12 years ago by Ryan J Ollos

(In [11843]) Refs #2252: Refactored, in preparation for applying Jun's patch to support Japanese wiki page names.

comment:23 Changed 12 years ago by Ryan J Ollos

Jun, thanks for the patch. I'm still trying to understand it completely. I gave you commit access in case you want to push the changes yourself, otherwise I'll get to it sometime this weekend.

comment:24 Changed 12 years ago by Jun Omae

Thanks, Ryan!

I would like to push by myself. Could you please grant the right?

comment:25 in reply to:  24 Changed 12 years ago by Ryan J Ollos

Replying to jun66j5:

I would like to push by myself. Could you please grant the right?

Sure, I added you for w-access to the autowikifyplugin path :)

comment:26 Changed 12 years ago by Jun Omae

Resolution: fixed
Status: assignedclosed

(In [11904]) fixed #2252: autowikify works with CJK wiki name

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Ryan J Ollos.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.