Modify

Opened 7 years ago

Closed 2 years ago

#2252 closed defect (fixed)

Japanese wiki page name doesn't autowikify

Reported by: kondo@… Owned by: rjollos
Priority: normal Component: AutoWikifyPlugin
Severity: normal Keywords:
Cc: rjollos, jun66j5 Trac Release: 0.10

Description

Japanese wiki page (page name is written in Japanese, not contents) doesn't wikify automatically.

I think it's concerned with utf-8 multibyte string.

In _update method, utf-8 string is escaped.
Is it correct behavior ?

Attachments (2)

remove_blank.patch (814 bytes) - added by kondo@… 6 years ago.
namesort.diff (1.3 KB) - added by redboltz 6 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 7 years ago by athomas

The re.escape() is necessary to ensure page names with characters that are regex operators don't break the regular expression.

Are you running the latest version of Trac? Can you paste a page name that doesn't work?

comment:2 follow-up: Changed 7 years ago by kondo@…

I use trac version 0.10.4.

I inserted "print(self.pages)" in autowikify.py.

    def _all_pages(self):
        self.pages = set(WikiSystem(self.env).get_pages())
        print(self.pages)

Output:

set([u'WikiNewPage', u'\u3042\u3042\u3042\u3042', ...snip...])

For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

comment:3 Changed 7 years ago by athomas

Yes, thanks! I'll try this out tonight.

comment:4 Changed 6 years ago by kondo@…

I fixed this problem in my local environment.

  • trac version 0.11.1

Solution consists of two parts.

1.Add locale flag in trac wiki system

2.Remove blank "\b" from patterns in autowikify plugin.

  • See remove_blank.patch
  • Japanese words don't separate with blank.
    • But I think this function should be able to toggle by option.

Changed 6 years ago by kondo@…

comment:5 Changed 6 years ago by cboos

I closed #T7552 as wontfix, since I don't think the problem comes from Trac.

The actual problem is also not related to the way re.escape deals with unicode objects as I originally thought:

>>> import re
>>> pages = [u"ああああ", 'WikiStart']
>>> pattern = r'\b(?P<autowiki>' + '|'.join([re.escape(page) for page in pages]) + r')\b'
>>> pattern
u'\\b(?P<autowiki>\\\u3042\\\u3042\\\u3042\\\u3042|WikiStart)\\b'
>>> re.search(pattern, u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)
>>> re.search(pattern, u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True

So the original code should theoretically work.

Ah, I've just seen this: "Japanese words don't separate with blank."
Well, \b is not supposed to correspond only to blanks, but to "... whitespace or a non-alphanumeric, non-underscore character.". Nevertheless this reminded me that Japanese sentences look like a stream of characters with no obvious separator between words (at least to the untrained eye ;-) ).

So it looks like the intended behavior here is actually to extract a specific sequence of alphanumeric characters part of a larger sequence of characters, in the same way as one would extract 'WikiStart' from 'TheWikiStartPage' in English. Then I agree that the only way would be to optionally remove the \b markers.

Wouldn't that be enough? I still don't see the motivation for #T7552.

comment:6 Changed 6 years ago by redboltz

Unicode regex operates correctly.

The reason of using re.LOCALE is below.

In the _prepare_rules function, WikiParser makes regex based on the pattern.

An important point is the order of 'syntax'.

Plugin's syntax is ordered after internal syntax.
Syntax of the plug-in is arranged behind internal syntax.

Example.

(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))

Japanese wiki page name matches to internal syntax(i0) instead of autowikify syntax(i8).
If the re.LOCALE flag is set, it matches to autowikify syntax(i8) according to the expectation.

Check the below code.

# -*- coding: utf-8 -*-

import re
import locale

locale.setlocale(locale.LC_ALL, 'Japan')

def replace(fullmatch):
    """Replace one match with its corresponding expansion"""
    replacement = handle_match(fullmatch)
    if replacement:
        return _markup_to_unicode(replacement)

def handle_match(fullmatch):
    for itype, match in fullmatch.groupdict().items():
#       if match and not itype in self.wikiparser.helper_patterns:
        if match:
            # Check for preceding escape character '!'
            print "match:" + itype + "," + match

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\})|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ)))"

rules = re.compile(unicode(str), re.UNICODE|re.LOCALE)
print "re.UNICODE|re.LOCALE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))


rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

str = ur"(?:(?P<bolditalic>!?''''')|(?P<bold>!?''')|(?P<italic>!?'')|(?P<underline>!?__)|(?P<strike>!?~~)|(?P<subscript>!?,,)|(?P<superscript>!?\^)|(?P<inlinecode>!?\{\{\{(?P<inline>.*?)\}\}\})|(?P<inlinecode2>!?`(?P<inline2>.*?)`)|(?P<i8>(?P<autowiki>PageNameA|あいうえお|ああ))|(?P<i0>!?(?<!/)\b\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z))|(?P<i1>!?\[\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))+(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![A-Z0-9_]))+(?:@\d+)?(?:#[\w:](?<!\d)(?:[\w:.-]*[\w-])?)?(?=:(?:\Z|\s)|[^:a-zA-Z]|\s|\Z)\s+(?:'[^']+'|\"[^\"]+\"|[^\]]+)\])|(?P<i2>!?\[(?:'[^']+'|\"[^\"]+\")\])|(?P<i3>!?(?<!&)#(?P<it_ticket>[a-zA-Z.+-]*?)\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*)|(?P<i4>!?\[(?P<it_changeset>[a-zA-Z.+-]*?\s*)(?:\d+|[a-fA-F\d]{8,})(?:/[^\]]*)?(?:\?[^\]]*)?(?:#[^\]]*)?\]|(?:\b|!)r\d+\b(?!:\d))|(?P<i5>!?\[(?P<it_log>[a-zA-Z.+-]*?\s*)(?P<log_revs>(?:\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*|(?:\d+|[a-fA-F\d]{8,})))(?P<log_path>[/?][^\]]*)?\])|(?P<i6>(?:\b|!)r\d+(?:[-:]\d+)?(?:,\d+(?:[-:]\d+)?)*\b)|(?P<i7>!?\{(?P<it_report>[a-zA-Z.+-]*?\s*)\d+\}))"

rules = re.compile(unicode(str), re.UNICODE)
print "re.UNICODE and i8 before i0"

line = u"あいうえお"
result = re.sub(rules, replace, unicode(line))

line = u"PageNameA"
result = re.sub(rules, replace, unicode(line))

line = u"ああ"
result = re.sub(rules, replace, unicode(line))

Result.

re.UNICODE|re.LOCALE
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE
match:i0,あいうえお  <= unexpected
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ
re.UNICODE and i8 before i0
match:i8,あいうえお
match:autowiki,あいうえお
match:i8,PageNameA
match:autowiki,PageNameA
match:i8,ああ
match:autowiki,ああ

Japanese string "ああ" matches both re.UNICODE and re.UNICODE|re.LOCALE.
But string "あいうえお" matches only re.UNICODE|re.LOCALE.
It depends on string.

Probably, this behavior is concerned with the below. But I'm not sure.

Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.

There is an approach of arranging i8 in front of i0 if re.LOCALE is not set.
However, I am not predictable of the side effect.

Any ideas?


comment:7 Changed 6 years ago by cboos

So if I understand you correctly, when using the re.UNICODE flag only, "あいうえお" is matched as a regular wiki page name and not as an autowiki name. But why should that be an issue, as in both cases you'd get a link to that wiki page?

comment:8 follow-up: Changed 6 years ago by redboltz

I think trac internal wiki-link doesn't work for page "あいうえお".
And I checked it.

Autowikify plugin's page_formatter is below.

    def _page_formatter(self, f, n, match):
        page = match.group('autowiki')
        return Markup('<a href="%s" class="wiki">%s</a>'
                      % (self.env.href.wiki(page),
                         escape(page)))

It uses 'autowiki' tag.

The trac internal wiki link system might function only when it occasions about CamelCase and it describes explicitly like below.

[wiki:pagename]

Should I add any changes to something plugin?

comment:9 in reply to: ↑ 8 Changed 6 years ago by cboos

Replying to redboltz:

I think trac internal wiki-link doesn't work for page "あいうえお".

I'm afraid I'm not able to follow you... In comment:6, you justify the need for re.LOCALE by saying that without that flag, "あいうえお" gets matched by the internal wiki name regexp i0 (that's what:

re.UNICODE
match:i0,あいうえお  <= unexpected

shows).

Now you say that the internal wiki name regexp doesn't work for that page, which I can understand if (and only if) that name is part of some longer sentence, like in "あああああいうえおああああああ" (maybe an actual real example would help here). But in that situation, the autowikify regexp should match, provided the \b markers are dropped as discussed in comment:4 and comment:5.

comment:10 Changed 6 years ago by redboltz

There is a possibility that I do not understand enough either.

I think there are 2 topics.

It seems that there is a difference in recognition for the first topic.

  1. Why is it insufficient to match it to 'i0'?
    1. The function 'handle_match' is called.
          def handle_match(self, fullmatch):
              for itype, match in fullmatch.groupdict().items():
                  if match and not itype in self.wikiparser.helper_patterns:
                      # Check for preceding escape character '!'
                      if match[0] == '!':
                          return escape(match[1:])
                      if itype in self.wikiparser.external_handlers:
      =>                  external_handler = self.wikiparser.external_handlers[itype]
                          return external_handler(self, match, fullmatch)
                      else:
                          internal_handler = getattr(self, '_%s_formatter' % itype)
                          return internal_handler(match, fullmatch)
      
    2. 'i0' external_handler is the function 'wikipagename_link'.
              # Regular WikiPageNames
              def wikipagename_link(formatter, match, fullmatch):
                  if not _check_unicode_camelcase(match):
      =>              return match
                  return self._format_link(formatter, 'wiki', match,
                                           self.format_page_name(match),
                                           self.ignore_missing_pages)
      
    3. In the function 'wikipagename_link', Japanese string "ああああ" is judged not camelcase.
    4. After all, it is not wiki link, and a plain string "ああああ" is returned.
  1. What is the behavior expected when Japanese wiki-page-name is part of some longer sentence ?
    • In Japanese, it is necessary to match it.
    • In addition, wiki pagename in regex pattern should be the order of length.
      • The patch is being made now.

Changed 6 years ago by redboltz

comment:11 Changed 6 years ago by redboltz

I said..

In addition, wiki pagename in regex pattern should be the order of length.
The patch is being made now.

I made the patch.(namesort.diff)

This operates correctly though there might be a possibility of a further improvement in the usage of the collection class of python.

comment:12 Changed 2 years ago by rjollos

  • Cc rjollos added

comment:13 in reply to: ↑ 2 ; follow-up: Changed 2 years ago by rjollos

  • Cc jun66j5 added

Replying to kondo@t.email.ne.jp:

For example, page u'\u3042\u3042\u3042\u3042' does't autowikify.

I've tested AutoWikifyPlugin at r11819 with Trac 0.11, and a wiki page named ああああ is NOT autowikified (this is a clean Trac install with Genshi 0.6.0). On a clean 0.12 Trac install with Genshi 0.6.0 but no Babel, the wiki page named ああああ is autowikified. The same behavior is seen for other wiki page names that contain unicode characters, such as ÄÄÄÄ.

Page name is below same as u'\u3042\u3042\u3042\u3042'.(Can you see?)

ああああ

I don't see attachment:namesort.diff as the solution. The problem with removing the word boundaries from the regex is that on a Trac install with a wiki page named aaaa, aaaaa will be rendered as

aaaaa

So, two issues remain:

  • Why does this work with Trac 0.12 but not 0.11? I'm really not too concerned about this, and would just suggest anyone experiencing the problem to upgrade.
  • comment:4 says Japanese words don't separate with blank. I'm not sure how to deal with that issue because I don't think we want to remove the word boundaries from the regex. We could add an option for that, but I think someone that understands the language and Python locale issues should deal with that. I'd likely just make a mess of the situation.

So I'll leave this ticket open for now, but anyone experiencing issues should first upgrade to r11819 or later.

comment:14 in reply to: ↑ 13 Changed 2 years ago by redboltz

rjollos,

Thanks you for your reply. I agree with you. My patch introduces a side effect that you mentioned above. Providing an option is acceptable solution for me, but python locale issues are difficult to deal with. I support your decision.

comment:15 follow-up: Changed 2 years ago by rjollos

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

comment:16 in reply to: ↑ 15 ; follow-up: Changed 2 years ago by redboltz

Replying to rjollos:

I'm pleasantly surprised to get a reply on such an old ticket :)

What version of Trac are you running? Are you able to test out the latest version of AutoWikifyPlugin?

I'm using the TracLightning.
http://sourceforge.jp/projects/traclight/releases/53615

It is Japanese translation version that based on Trac 0.12.2. It is a kind of all-in-one package for windows. It's also includes the autowikify plugin and my patch is included in this package. I know it just now.
http://sourceforge.jp/ticket/browse.php?group_id=2810&tid=14661

For testing, I replaced TracLightning version of autowikify with trac-hacks trunk version.

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

comment:17 in reply to: ↑ 16 ; follow-up: Changed 2 years ago by rjollos

Replying to redboltz:

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

comment:18 in reply to: ↑ 17 ; follow-up: Changed 2 years ago by redboltz

Replying to rjollos:

Replying to redboltz:

It works correctly in English wiki page, but doesn't link automatically to Japanese wiki page.

Just to be sure, is the space-delimited word issue the only problem? That is, if the name of the Japanese wiki page is surrounded by whitespace, does it link okay for you with the latest version of the plugin?

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

comment:19 in reply to: ↑ 18 Changed 2 years ago by rjollos

Replying to redboltz:

Ah, I understood what you mean now. I tested it just now. The space-delimited Japanese wiki page name is autowikified correctly.

Okay, thanks a lot for testing. I'll make sure we have a solution within a week. If nothing else, I'll just add an option for specifying whether the word boundaries are whitespace-separated. Better, we might be able to have the locale determine this implicitly. Best, Japanese Trac developer jun66j5 will chime in and tell us what the best solution is ;)

comment:20 Changed 2 years ago by rjollos

  • Owner changed from athomas to rjollos
  • Status changed from new to assigned

comment:21 Changed 2 years ago by jun66j5

I worked in https://github.com/jun66j5/autowikifyplugin/tree/ticket2252/no-boundary-if-cjk-blocks.

If the leading or trailing characters of a page name are CJK characters, it generates the regexp without \b.
For details, please see unit tests.

Leading Trailing regexp
non CJK no CJK \b{page-name}\b
CJK no CJK {page-name}\b
non CJK CJK \b{page-name}
CJK CJK {page-name}

I don't think it's the best solution, however, I think it works well in the most cases.

comment:22 Changed 2 years ago by rjollos

(In [11843]) Refs #2252: Refactored, in preparation for applying Jun's patch to support Japanese wiki page names.

comment:23 Changed 2 years ago by rjollos

Jun, thanks for the patch. I'm still trying to understand it completely. I gave you commit access in case you want to push the changes yourself, otherwise I'll get to it sometime this weekend.

comment:24 follow-up: Changed 2 years ago by jun66j5

Thanks, Ryan!

I would like to push by myself. Could you please grant the right?

comment:25 in reply to: ↑ 24 Changed 2 years ago by rjollos

Replying to jun66j5:

I would like to push by myself. Could you please grant the right?

Sure, I added you for w-access to the autowikifyplugin path :)

comment:26 Changed 2 years ago by jun66j5

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [11904]) fixed #2252: autowikify works with CJK wiki name

Add Comment

Modify Ticket

Action
as closed .
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.