Opened 10 years ago
Closed 8 years ago
#12058 closed defect (fixed)
Bad email default regexp
Reported by: | Dirk Stöcker | Owned by: | Steffen Hoffmann |
---|---|---|---|
Priority: | normal | Component: | AccountManagerPlugin |
Severity: | normal | Keywords: | registration antirobot |
Cc: | Trac Release: | 0.11 |
Description
Default email regexp is:
(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$
it should be
(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9.-]+\.)+[A-Z]{2,6}$
or otherwise subdomains are not possible.
This still prevents Umlauts like äöü and so on. You can reach me e.g. with dstöcker.de domain :-)
Attachments (0)
Change History (15)
comment:2 Changed 10 years ago by
Regarding subdomain: a space at the end of the mail prevented matching, not the subdomain. Sorry for that. Still jun66j5's comment stays valid and other valid UTF-8 characters.
comment:3 follow-up: 4 Changed 10 years ago by
dstöcker.de
This is way off-standard, isn't it? Even if current standards allow it without question, where does this discussion end in UTF-8 land? Allow only German an Japanese chars, but no Russian ones? You're certainly able to do that. However I think, that an admin should open up such degrees of freedom only after very careful consideration.
Speaking of the subject, SPAM prevention, there are likely more UTF-8 spam email addresses than legitimate ones, not even to speak of the total of real users, that only own an email address with UTF-8 chars.
Anyway, Jun's change proposal seems valid, and I'll stick to it, if there is not much magic coming from UTF-8 wizards in time.
comment:4 follow-up: 5 Changed 10 years ago by
Replying to hasienda:
dstöcker.de
This is way off-standard, isn't it? Even if current standards allow it without question, where does this discussion end in UTF-8 land? Allow only German an Japanese chars, but no Russian ones? You're certainly able to do that. However I think, that an admin should open up such degrees of freedom only after very careful consideration.
Well, world moved on. Still it is a bad hack, as the domain is converted with Punycode to ASCII first ("xn--dstcker-c1a.de") before using. No full UTF-8 support for domains yet and probably never coming. And there are different versions of the rulesets how to do that.
Because of these troubles I did not yet really use my umlaut domains at all for serious work and also never have seen any real use elsewhere. :-) But I wanted to mention it.
comment:5 Changed 10 years ago by
comment:6 Changed 10 years ago by
Keywords: | registration antirobot added |
---|---|
Status: | new → accepted |
Summary: | Bad emaill regexp → Bad email default regexp |
Trac Release: | → 0.11 |
comment:8 follow-up: 11 Changed 10 years ago by
Would the changes from this ticket be useful for updating EMAIL_LOOKALIKE_PATTERN in Trac? That pattern is used for finding email addresses in wiki content and extracting email address from a list of recipients.
comment:9 follow-up: 10 Changed 10 years ago by
Maybe I did initially start my development from that pattern, and I searched for such similar occurrences but failed to find them.
Keep in mind that not even admins can change such built-ins easily (without cumbersome source patching). So the status-quo reached and used here might just be enough to go there, extend and enhance it to omake it fully standards-compliant, with UTF-8 that time.
comment:10 Changed 10 years ago by
Replying to hasienda:
Keep in mind that not even admins can change such built-ins easily
I hope it was clear that I meant to propose backporting to Trac ...
(without cumbersome source patching). So the status-quo reached and used here might just be enough to go there, extend and enhance it to omake it fully standards-compliant, with UTF-8 that time.
... which I take from this statement that you understood my intention. I'll wait to see if Jun has a comment about the usefulness of backporting your regex to replace EMAIL_LOOKALIKE_PATTERN
in the core, before opening a ticket at t.e.o.
comment:11 follow-up: 12 Changed 10 years ago by
Replying to rjollos:
Would the changes from this ticket be useful for updating EMAIL_LOOKALIKE_PATTERN in Trac? That pattern is used for finding email addresses in wiki content and extracting email address from a list of recipients.
I don't think that it's need to backport the updates to Trac core. EMAIL_LOOKALIKE_PATTERN
currently matches TLDs with more than 6 octets and internationalized TLDs.
$ ~/venv/trac/1.0.2/bin/python Python 2.5.6 (r256:88840, Oct 21 2014, 22:26:35) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> from trac.notification import EMAIL_LOOKALIKE_PATTERN >>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.community', re.UNICODE) <_sre.SRE_Match object at 0x7f93eef039f0> >>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.xn--zckzah', re.UNICODE) <_sre.SRE_Match object at 0x1a58ed0>
However, I noticed another issue in EMAIL_LOOKALIKE_PATTERN
. The pattern is used \d
with re.UNICODE
. If a TLD has unicode digit characters, it wrongly matches. See http://trac.edgewall.org/demo-1.0/ticket/3692. That is a Trac issue and similar to t:#10678.
>>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.c௩௪௫om', re.UNICODE) <_sre.SRE_Match object at 0x1a58f38>
Trac core should use [0-9]
rather than \d
in EMAIL_LOOKALIKE_PATTERN
.
- '[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?' # TLD + '[a-zA-Z](?:[-a-zA-Z0-9]*[a-zA-Z0-9])?' # TLD
comment:12 Changed 10 years ago by
Replying to jun66j5:
I don't think that it's need to backport the updates to Trac core.
Your explanation is convincing.
comment:13 Changed 10 years ago by
I shouldn't have said changes in this ticket. I was thinking more broadly than just the TLD in considering differences in EMAIL_LOOKALIKE_PATTERN
.
EMAIL_LOOKALIKE_PATTERN
allows'
in the local part.>>> import re >>> from trac.notification import EMAIL_LOOKALIKE_PATTERN >>> email_regex = r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z0-9-]{2,63}$' >>> re.match(email_regex, 'user%5@gmail.com') >>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user%5@gmail.com') <_sre.SRE_Match object at 0x7f84a1313f38>
email_regexp
used here%
in local partIt appear even more special characters are allowed in the local part.>>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user%5@gmail.com') >>> import re >>> from trac.notification import EMAIL_LOOKALIKE_PATTERN >>> email_regex = r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z0-9-]{2,63}$' >>> re.match(email_regex, 'user\'@gmail.com') >>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user\'@gmail.com') <_sre.SRE_Match object at 0x7f84a1313f38>
email_regexp
is simpler and more readable thanEMAIL_LOOKALIKE_PATTERN
IMO.
EMAIL_LOOKALIKE_PATTERN
in Trac should be robust enough that it wouldn't need to be redefined in AccountManagerPlugin when used for the same general purposes. It appears that neither EMAIL_LOOKALIKE_PATTERN
or email_regex
are currently sufficient to match all valid email addresses according to RFC 6531.
The rules for valid syntax appear complex though. It seems like there should be some utility in the email
or other package to handle this for us.
comment:14 Changed 10 years ago by
You're asking for a single pattern to be used inside Trac, and ultimately in Trac plugins too. Reasonable. You're targeting Trac 1.0.3 for this change? Make sure to put a note down here when we could try to an import like
from trac.util.text import EMAIL_PATTERN
I don't think, a pre-compiled pattern ('EMAIL*_RE') is required for most applications.
comment:15 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
It seems the former of patterns matches subdomains....
Another considering, seems that max length of TLDs is larger than 6, and digits and
-
characters are used in internationalized TLD. According to domain name syntax, each label of domain name is 1 to 63 octets.