mirror of
https://github.com/zebrajr/ArchiveBox.git
synced 2025-12-07 12:21:30 +01:00
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility. |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| generic_html.py | ||
| generic_json.py | ||
| generic_rss.py | ||
| generic_txt.py | ||
| medium_rss.py | ||
| netscape_html.py | ||
| pinboard_rss.py | ||
| pocket_api.py | ||
| pocket_html.py | ||
| shaarli_rss.py | ||
| url_list.py | ||
| wallabag_atom.py | ||