Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#8266 closed defect (fixed)

If acronym contains a hyphen, it is not linked to correct page

Reported by: rjollos Owned by: rjollos
Priority: normal Component: AcronymsPlugin
Severity: normal Keywords: unicode
Cc: hasienda,… Trac Release: 0.11


As described in comment:3#857, the following example:

||XXX    || XXX Page     || XXXPage    || ||
||YYY-XXX|| YYY-XXX Page || YYY-XXXPage|| ||

Results in:

Attachments (1)

HyphenExample.png (8.2 KB) - added by rjollos 4 years ago.

Download all attachments as: .zip

Change History (9)

Changed 4 years ago by rjollos

comment:1 Changed 4 years ago by rjollos

  • Owner changed from athomas to rjollos
  • Status changed from new to assigned

comment:2 Changed 4 years ago by rjollos

I've traced this to the regular expression not matching XXX-YYY. We'll need to modify the regular expression:

valid_acronym = re.compile('^\w+$')

comment:3 follow-up: Changed 4 years ago by rjollos

  • Cc hasienda added; anonymous removed

I've added the UNICODE flag so that acronyms with unicode characters classified as alphanumeric will be matched. An alternative would be to set the LOCALE flag, in which case characters classified as alphanumeric in the environment's locale would be matched. I'm not sure which is better.

hasienda, is this something you'd like to test out, since you have done a lot of work with locales?

comment:4 Changed 4 years ago by rjollos

(In [9585]) Refs #8266:

  • Some minor refactoring.
  • Set the UNICODE flag when compiling the regular expression used to match acronyms.

comment:5 in reply to: ↑ 3 Changed 4 years ago by hasienda

  • Keywords unicode added

Replying to rjollos:

hasienda, is this something you'd like to test out, since you have done a lot of work with locales?

Will do and report back here; thank you for the hint.

comment:6 Changed 4 years ago by rjollos

  • Cc… added

Received a hint about this in #5938, and will submit a fix shortly.

comment:7 Changed 4 years ago by rjollos

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [9662]) Use \S in the regular expression that extracts acronym definitions from the /wiki/acronym page. \S will match any non-whitespace character, whereas \w only matches alphanumeric characters and the underscore. Fixes #8266.

comment:8 Changed 3 years ago by hasienda

I'm re-iterating through issues for this plugin now while preparing for an upcoming Trac application.

The regexp change sucessfully solved another issue for me: acronyms with Unicode characters like German umlauts. Before coming to this ticket I've done own experiments on this matter. Results have been rather confusing to me: re.U flag for that r'^\w+$' expression didn't result in expected matches:

Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
>>> import re
>>> RE = re.compile(r'^\w+$', re.U)
>>> RE.match('ä')
>>> RE.match('ö')
>>> RE.match('ü')
<_sre.SRE_Match object at 0xb7359d08>
>>> RE.match('ß')
>>> RE.match('Ä')
>>> RE.match('Ö')
>>> RE.match('Ü')

The re.L flag didn't change matches at all. So the very general \S match is the best I can see right now. Still it troubles me, I may not understand that flags correctly...

Add Comment

Modify Ticket

as closed The owner will remain rjollos.
The resolution will be deleted. Next status will be 'reopened'.

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.