Opened 5 years ago

Last modified 5 years ago

#13534 closed enhancement

better separator handling and case-insensitive — at Version 13

Reported by: clemens Owned by: Ryan J Ollos
Priority: normal Component: TracKeywordsPlugin
Severity: normal Keywords:
Cc: Trac Release: 1.2

Description (last modified by Ryan J Ollos)

Hello

I propose and provide some moderate improvements to the trac_keywords.js file of the TracKeywordsPlugin.

Summary

  1. allow arbitrary keyword separators
  2. better handling for multiple spaces and separators
  3. keyword matching is case-insensitive
  4. avoid duplicate insertion of keyword section
  5. add some comments

Patch

Separators

The idea is not only to allow separation of keywords by SPACE and coma as previously, but also semi-colon, dash etc. I make use of the \W regex term to match all "non-word" characters as separators.

My changes also provide better handling in cases where multiple separators are used, for example bla,,,,blo.

Case-insensitive

The previous version did not consider a keyword if the case did not match, i.e. it was case-sensitive. One can discuss about it, but for me it is more useable and convenient if it behaves case-insensitive. The keywords BLA, bla, Bla and bLa should be treated as the same.

I am aware that other people may not share my opinion and might like to have case-sensitive. In this case it is easy to have a case-sensitive code variant. Look for my code comments about case-insensitive and toLowerCase and remove the corresponding code.

Avoid Duplicates

I added a simple check to avoid duplicate insertion of the keyword section. This is because I came across a duplicate when I used the browser "save"-feature. The saved HTML page ended up having with two keyword sections (one was static HTML, the other dynamic from JS). My simple check avoids this.

Test case

The following test case uses:

  • upper and lower case keywords
  • numbers and letters
  • different separators
  • multiple separators
  • trailing separators

power, Other A2, C3, double double more double, slave;radio,.+/:other     eol-----epa.....

Clemens

Change History (25)

Changed 5 years ago by clemens

Attachment: trac_keywords.js added

updated source code

Changed 5 years ago by clemens

Attachment: trac_keywords_20190214.diff added

patch file

comment:1 Changed 5 years ago by Jun Omae

Incorrect patch. At least, uses of \W break keyword with unicode characters. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions.

comment:2 in reply to:  1 Changed 5 years ago by anonymous

Replying to Jun Omae:

Incorrect patch. At least, uses of \W break keyword with unicode characters.

Dear Jun

Can you please give me your test cases (especially those with unicode.

Is it possible or not to use Unicode in the trac.ini?
In other words will one be able to define unicode keywords for the TracKeywordsPlugin or not?

Indeed I was not aware about Unicode and the fact that for Javascript \W matches any non-word character. Equivalent to [^A-Za-z0-9_]. Unicode characters will appear as "non-word character".

I will think about a solution. My idea about \W was to match as separator. I may replace it with a dedicated collection of separators, something like [ ;,-/.:|]. I'll have to test this...

Clemens

comment:3 in reply to:  1 Changed 5 years ago by clemens

About Unicode

Jun pointed out that my above patch from 2019-02-14 does not work with unicode. The main "problem" is that for Javascript \W (i.e. non-word characters) matches also on all unicode characters.

I have reworked (and tested) my code and my new patch is not using \W but the following characters as keyword separators: [ .,;:|]

The TracKeywordsPlugin now handles cases properly if unicode characters appear in the keyword field.

The new test cases also include unicode:

  • upper and lower case keywords
  • numbers and letters
  • different separators
  • multiple separators
  • trailing separators
  • umlauts
  • unicode

power, Other A2|C3, Ωףۺ,😮, 😲, double double more double, Umläüts, slave;radio, .+/:other eol-----epa.....

Be aware that the TracKeywordsPlugin does not really support to declare keywords (in the trac.ini file) which contain special unicode characters. For example you cannot use the TracKeywordsPlugin to declare a "smily" 😲 as toggle field. However, this restriction is not due to my patch, it has been there also before. In fact I am not even sure if it is possible to use unicode in the trac.ini file.

Changed 5 years ago by clemens

Attachment: trac_keywords_03032019.js added

reworked source code for better handling of unicode

Changed 5 years ago by clemens

Attachment: trac_keywords_20190303.js added

reworked source code for better handling of unicode (correct file name)

Changed 5 years ago by clemens

Attachment: trac_keywords_20190303.diff added

reworked patch file for better handling of unicode

comment:4 Changed 5 years ago by Ryan J Ollos

Owner: set to Ryan J Ollos
Status: newaccepted

comment:5 Changed 5 years ago by Jun Omae

The patch has the following line:

    var newval = orig.replace(new RegExp('\\b' + w + '\\b'), '');

However, the replacement works unexpectedly.

> 'blah tracking trac-hacks trac hacks'.replace(/\btrac\b/,'')
< "blah tracking -hacks trac hacks"

I think the line should be like:

    var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');

Another thing, w should be escaped if RegExp is used.

Changed 5 years ago by clemens

Attachment: trac_keywords_20190307.diff added

repaired patch file (sorry my previous one was inversed)

comment:6 Changed 5 years ago by clemens

Hello Jun and Ryan

Thanks for looking at my patch. I have to aplogize, that I made a odd mistake: By mistake my patchfile attachment:trac_keywords_20190303.diff is inversed, please do not use. Instead please use these files:

This is the original code:

var newval = orig.replace(new RegExp('\\b' + w + '\\b'), '');

... this the patched code:

var newval = orig.replace(new RegExp('[ .,;:|]*' + w + '\\b','i'), '');

Sorry for mixing up things!


Some comments about your feedback...

  var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');

As far as I understand this "split-join" code it works as follows:

  1. split the text at pre-defined separators ;,\s
  2. remove the searched keyword by replacing any occurance with "nothing"
  3. re-assemble the text using SPACE as separator

This is rather clever. But I am not sure if we should change the keyword text globally. Who says that SPACE is the "right" separator? People may like to use their own separator characters. I like SPACE, but for example the TRAC ticket batch-modify feature inserts commas for keyword modification.
The old approach on the other side modifies the the keyword text only locally i.e. only as much as necessary. All other parts of the text remain untouched - no matter how confuse they are.

The minus/hyphen character - is not part of my proposed separator list [ .,;:|]. We may add it. My approach was to use only a very restricted set of characters, which leave no doubt that they are intendet to be keyword separators. I am not sure if it is recommended to use keywords with hyphen e.g. trac-hacks. To avoid difficulties with ticket queries, my personal habit is to use only single word keywords, avoid hyphenated words like "trac-hack" or even phrases like "version control". I may add some more test cases to the TracKeywordsPlugin.

If a keyword appears double or multiple times (like C3, double double more double in my test case), then my proposed plug-in code will remove only one after the other one, not all at once. I believe this is a feature. People may have a crazy reason for why to repeat the keyword. However, it does not make much sense to use double or multiple keywords - it may be a rare special case.

I am not an JavaScript expert, but I am not convinced that w need to be escaped in this line:

var newval = orig.replace(new RegExp('[ .,;:|]*' + w + '\\b','i'), '');

w is a JavaScript variable and not a string literal like '\\b'. Also in the original code (without my patch) w appears non-escaped and it works.

Clemens

comment:7 in reply to:  6 ; Changed 5 years ago by Jun Omae

Some comments about your feedback...

  var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');

As far as I understand this "split-join" code it works as follows:

  1. split the text at pre-defined separators ;,\s
  2. remove the searched keyword by replacing any occurance with "nothing"
  3. re-assemble the text using SPACE as separator

This is rather clever. But I am not sure if we should change the keyword text globally. Who says that SPACE is the "right" separator?

At least, Trac splits the keywords using [;,\s]+ and renders the words which are concatenated with space character at trac:source:/tags/trac-1.2.3/trac/ticket/web_ui.py@:1529,1531#L1521.

I am not an JavaScript expert, but I am not convinced that w need to be escaped in this line:

var newval = orig.replace(new RegExp('[ .,;:|]*' + w + '\\b','i'), '');

w is a JavaScript variable and not a string literal like '\\b'. Also in the original code (without my patch) w appears non-escaped and it works.

I don't consider \b should be used.

  1. Add the following words to [keywords] section in trac.ini.
    [keywords]
    trac =
    trac-hacks =
    hacks =
    
  2. Visit newticket page.
  3. Click trac-hacks in Add Keywords
    • keywords field should be trac-hacks
  4. Click hacks in Add Keywords
    • keywords field should be trac-hacks hacks but trac-.

Reason to escape the w variable:

  1. Add svn+ssh to [keywords] section.
  2. Visit newticket page.
  3. Click svn+ssh in Add Keywords
    • svn+ssh should be toggled but the keyword is repeatedly added.

Patch to escape:

  • trackeywords/htdocs/trac_keywords.js

    old new  
    77    var el = document.getElementById(field_id);
    88    var orig = el.value;
    99    // remove the keyword including white spaces and separator, match case-insensitive
    10     var newval = orig.replace(new RegExp('[ .,;:|]*' + w + '\\b','i'), '');
     10    var newval = orig.replace(new RegExp('[ .,;:|]*' + regexp_escape(w) + '\\b','i'), '');
    1111    var link = document.getElementById('trac-keyword-' + w);
    1212    if (orig != newval) { // remove tag.
    1313        if(link) link.className = '';
     
    1919    el.value = newval.replace(/^[ .,;:|]+|[ .,;:|]+$/, '');
    2020  }
    2121
     22  function regexp_escape(text) {
     23    // Escape characters except hyphon, comma, underscore, 0-9, A-Z and a-z
     24    return text.replace(/[^-,_0-9A-Za-z]/g, '\\$&');
     25  }
     26
    2227  var $fieldset = $('<fieldset id="trac-keywords" />');
    2328  $fieldset.append('<legend>Add Keywords</legend>')
    2429  $ul = $('<ul></ul>');

Changed 5 years ago by clemens

Attachment: trac_keywords_20190308.diff added

Changed 5 years ago by clemens

Attachment: trac_keywords_20190308.js added

comment:8 in reply to:  7 ; Changed 5 years ago by clemens

Hello Jun

Thanks for your explanation about the escape thing. I have integrated your regexp_escape() function. And it works fine.

I have also inserted your "split-join" code. Basically it works, but see my issue about case-insensitive below. We could remove the regexp_escape() function again.

I have changed the code to allow only [;,\s] as separators. We can negotiate about more separators. I am happy with those. Things like trac-hacks or svn+ssh work now.

We may explain for our users in the Wiki plug-in documentation,

  • that the plug-in accepts [;,\s] as separators,
  • that the plug-in re-assembles the text and replaces all separators with SPACE. This may be seen as feature not as problem.

There is one last thing which is very important for me: case-insensitive. Your "split-join" code is not case-insensitive. Can you please propose an advanced version, which (at least optionally) works case-insensitive?

  var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');

Clemens

comment:9 in reply to:  8 ; Changed 5 years ago by Jun Omae

Replying to clemens:

There is one last thing which is very important for me: case-insensitive. Your "split-join" code is not case-insensitive. Can you please propose an advanced version, which (at least optionally) works case-insensitive?

  var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');

Could you please try the following changes?

-    var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');
+    var filter = function(v) {
+        return v.length !== w.length || v.toLowerCase() !== w.toLowerCase();
+    };
+    var newval = $.grep(orig.split(/[;,\s]+/), filter).join(' ');

Changed 5 years ago by clemens

Attachment: trac_keywords_20190309.js added

works well now

Changed 5 years ago by clemens

Attachment: trac_keywords_20190309.diff added

works well now

Changed 5 years ago by clemens

removed regexp_escape function

Changed 5 years ago by clemens

Attachment: trac_keywords_20190309b.js added

removed regexp_escape function

comment:10 in reply to:  9 Changed 5 years ago by clemens

Replying to Jun Omae:

Could you please try the following changes?

-    var newval = $.grep(orig.split(/[;,\s]+/), function(v) { return v !== w }).join(' ');
+    var filter = function(v) {
+        return v.length !== w.length || v.toLowerCase() !== w.toLowerCase();
+    };
+    var newval = $.grep(orig.split(/[;,\s]+/), filter).join(' ');

Thanks, I tested your improved "split-join" code and it works very well! Keywords are now handled case-insensitive.

I removed the regexp_escape() function. We do not need it anymore.

  function regexp_escape(text) {
    // Escape special regex characters (except hyphen, comma, underscore, 0-9, A-Z and a-z)
    return text.replace(/[^-,_0-9A-Za-z]/g, '\\$&');
  }

Here is the new code:

Please find below the summary of improvement which we reached so far. In my opinion a major step forward:

  1. Whitespace, comma and semicolom i.e. [;,\s] are allowed as keyword separators.
    (Previously only SPACE worked well.)
  2. Keyword matching is case-insensitive now. This is a real advantage, because often users use their arbitary case stlye, especially for acronyms.
    (Previously "HTML" and "html" and "Html" were treated differently.)
  3. Better handling for multiple spaces and separators.
    (Previously things like bla, , blo were not treated well.)
  4. Keywords with special characters (unicode and umlauts) are supported now. Great improvement for many non-English users.
    (Previously the use of \b regex operator did not work well with special characters.)
  5. Avoid duplicate insertion of keyword section. This is useful if the ticket is saved to static HTML file.
    (Previously the "keyword" section appeared twice in a static HTML file.)
  6. More code comments in scope of good engineering practice.

There is one last thing, which might be desireable: It would be nice if the keyword list ("trac-keywords" fieldset) could auto-update if someone would manually edit the keyword field. However, I have no idea how to accomplish this and I am afraid that this feature would require quite much effort. I can pretty well live without it. Unless somebody has a brilliant idea I would propose to continue with the existing improvements...

comment:11 Changed 5 years ago by Ryan J Ollos

No need to attach the source file, just the diff is all we need.

comment:12 Changed 5 years ago by clemens

Hello

The new plug-in code is running on our local TRAC installation. It works well. Everybody is happy with what we have reached.

Is there anything more I can help to promote this ticket?

Clemens

comment:13 Changed 5 years ago by Ryan J Ollos

Description: modified (diff)
Note: See TracTickets for help on using tickets.