-
-
Notifications
You must be signed in to change notification settings - Fork 16
make get_tld return only eTLDs in the publicsuffix list #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -212,8 +212,8 @@ def test_get_sld_exceptions(self): | |
| def test_get_sld_no_wildcard(self): | ||
| psl = publicsuffix.PublicSuffixList() | ||
| # test completion when no wildcards should be processed | ||
| assert 'com.pg' == psl.get_sld('telinet.com.pg', wildcard=False) | ||
| expected = 'ap-southeast-1.elb.amazonaws.com' | ||
| assert 'pg' == psl.get_sld('telinet.com.pg', wildcard=False) | ||
| expected = 'amazonaws.com' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think this test was made when '*.amazonaws.com' was in the PSL -- so like Vadym's tests we should probably use a set list that isn't changing. In that case, same logic would happen for this. the wildcard goes away, and the item is processed as a suffix. I guess the readme isn't clear enough on this? anyhow, currently this is not wildcarded in the PSL so it looks wrong, but the test should be on a fixed example input.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I checked and it seems that |
||
| result = psl.get_sld('blah.ap-southeast-1.elb.amazonaws.com', wildcard=False) | ||
| assert expected == result | ||
|
|
||
|
|
@@ -226,20 +226,27 @@ def test_get_sld_top_convenience_function_is_the_same_as_PublicSuffixList_method | |
| def test_get_tld_returns_correct_tld_or_etld(self): | ||
| psl = publicsuffix.PublicSuffixList() | ||
| assert 'com' == psl.get_tld('com') | ||
| assert 'ck' == psl.get_tld('ck') | ||
| assert 'kobe.jp' == psl.get_tld('city.kobe.jp') | ||
| assert 'kobe.jp' == psl.get_tld('kobe.jp') | ||
| assert 'amazonaws.com' == psl.get_tld('amazonaws.com') | ||
| assert 'jp' == psl.get_tld('kobe.jp') | ||
| assert 'com' == psl.get_tld('amazonaws.com') | ||
|
Comment on lines
+231
to
+232
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Neither
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reordered for clarity... *.kobe.jp was in the psl at some point, as are several subdomains of amazon.aws. the public suffix list does change regularly, so we have to take some care that we aren't dependent on creating tests that depend on something we don't control. the behavior should be independent of the actual PSL. Also, as we noted in Vadym's pull request, some of the unit tests on the mozilla site were faulty (didn't follow the stated algorithm).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. our tests shouldn't be reliant on moving data. we should use a fixed test file.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I found that we already use static data for testing. https://github.com/nexB/python-publicsuffix2/blob/develop/src/publicsuffix2/public_suffix_list.dat |
||
| assert 'com.pg' == psl.get_tld('telinet.com.pg', wildcard=True) | ||
| assert 'pg' == psl.get_tld('telinet.com.pg', wildcard=False) | ||
| assert None == psl.get_tld('telinet.com.pg', wildcard=False) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't use
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure i agree here. certain companies remove the wildcards in the PSL so they can reduce the variety of output. if *.pg is in the list, which it usually is, then, when they remove the wildcard, pg will be in the list. It doesn't remove the entry itself, just the wildcard. we could argue that users shouldn't use the list this way (but they do), and i don't think we want to have a situation where a user puts in a legitimate domain and gets None for the tld. In this particular case, e.g., 'pg' is a real TLD, so it should never return None.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks to this, I now understand the semantics of |
||
| assert 'com.pg' == psl.get_tld('com.pg', wildcard=True) | ||
| assert 'pg' == psl.get_tld('com.pg', wildcard=False) | ||
| assert None == psl.get_tld('com.pg', wildcard=False) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should never return none for a valid TLD. |
||
| assert 'co.uk' == psl.get_tld('telinet.co.uk', wildcard=False) | ||
| assert 'co.uk' == psl.get_tld('co.uk', wildcard=True) | ||
| assert 'co.uk' == psl.get_tld('co.uk', wildcard=False) | ||
| assert None == psl.get_tld('blah.local', strict=True) | ||
| assert None == psl.get_tld('blah.local', wildcard=False) | ||
| assert 'local' == psl.get_tld('blah.local') | ||
|
|
||
| # See #18 | ||
| assert 'git-pages.rit.edu' == psl.get_tld('git-pages.rit.edu') | ||
| assert 'edu' == psl.get_tld('rit.edu') | ||
| assert 'edu' == psl.get_tld('edu') | ||
| assert 'com' == psl.get_tld('cdn.fbsbx.com') | ||
|
|
||
| def test_get_tld_returns_correct_tld_or_etld_for_fqdn(self): | ||
| psl = publicsuffix.PublicSuffixList() | ||
| assert 'com' == psl.get_tld('www.foo.com.') | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't agree with this one. when no wildcards are used (or totally regardless of the PSL at all), pg is a TLD. and so the sld is going to be one level up, i.e., 'com.pg'. It's never the case that 'pg' is an SLD.