Merge branch 'master' into v0.5.0

This commit is contained in:
Nick Sweeting 2020-08-10 23:36:33 -04:00 committed by GitHub
commit 21c20985c4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
119 changed files with 5151 additions and 2000 deletions

View File

@ -1,6 +1,16 @@
output
__pycache__
.DS_Store
venv
.venv
data
._*
*.pyc
__pycache__/
.mypy_cache/
venv/
.venv/
.docker-venv/
*.egg-info/
build/
dist/
data/
output/

6
.flake8 Normal file
View File

@ -0,0 +1,6 @@
[flake8]
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E131,E241,E252,E266,E272,E701,E731,W293,W503,W291,W391
select = F,E9,W
max-line-length = 130
max-complexity = 10
exclude = migrations,tests,node_modules,vendor,venv,.venv,.venv2,.docker-venv

View File

@ -1 +1,40 @@
Make sure check in with me first or confirm your desired features line up with our roadmap: https://github.com/pirate/ArchiveBox#roadmap
# Contribution Process
1. Confirm your desired features fit into our bigger project goals roadmap: https://github.com/pirate/ArchiveBox#roadmap
2. Open an issue with your planned implementation to discuss
3. Check in with me before starting development to make sure your work wont conflict with or duplicate existing work
4. Setup your dev environment, make some changes, and test using the test input files
5. Commit, push, and submit a PR and wait for review feedback
6. Have patience, don't abandon your PR! We love contributors but we all have day jobs and don't always have time to respond to notifications instantly. If you want a faster response, ping @theSquashSH on twitter or Patreon.
**Useful links:**
- https://github.com/pirate/ArchiveBox/issues
- https://github.com/pirate/ArchiveBox/pulls
- https://github.com/pirate/ArchiveBox/wiki/Roadmap
- https://github.com/pirate/ArchiveBox/wiki/Install#manual-setup
### Development Setup
```bash
git clone https://github.com/pirate/ArchiveBox
cd ArchiveBox
# Optionally create a virtualenv
pip install -r requirements.txt
pip install -e .
```
### Running Tests
```bash
./bin/archive tests/*
# look for errors in stdout/stderr
# then confirm output html looks right
# if on >v0.4 run the django test suite:
archivebox manage test
```
### Getting Help
Open issues on Github or contact me https://sweeting.me/#contact.

3
.github/FUNDING.yml vendored Normal file
View File

@ -0,0 +1,3 @@
github: pirate
patreon: theSquashSH
custom: ["https://paypal.me/NicholasSweeting", "https://www.blockchain.com/eth/address/0x5D4c34D4a121Fe08d1dDB7969F07550f2dB9f471", "https://www.blockchain.com/btc/address/1HuxXriPE2Bbnag3jJrqa3bkNHrs297dYH"]

View File

@ -1,30 +1,41 @@
---
name: 🐞 Bug report
about: Create a report to help us improve
title: ''
labels: ''
title: 'Bugfix: ...'
labels: 'changes: bugfixes'
assignees: ''
---
(please fill out the following information, feel free to delete sections if they're not applicable)
<!--
Please fill out the following information,
feel free to delete sections if they're not applicable
or if long issue templates annoy you :)
-->
## Describe the bug
A description of what the bug is, what you expected to happen,
#### Describe the bug
<!--
A description of what the bug is,
what you expected to happen,
and any relevant context about issue.
-->
## Steps to reproduce
#### Steps to reproduce
<!--
For example:
1. Ran ArchiveBox with the following config '...'
2. Saw this output during archiving '....'
3. UI didn't show the thing I was expecting '....'
-->
## Screenshots or log output
#### Screenshots or log output
<!--
If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox.
If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**.
-->
## Software versions
#### Software versions
- OS: ([e.g. macOS 10.14] the operating system you're running ArchiveBox on)
- ArchiveBox version: (`git rev-parse HEAD | head -c7` [e.g. d798117] commit ID of the version you're running)

View File

@ -1,15 +1,16 @@
---
name: 📑 Documentation change
about: Submit a suggestion for the Wiki documentation
title: ''
title: 'Documentation: Improvement request ...'
labels: ''
assignees: ''
---
## Wiki Page URL
<!-- e.g. https://github.com/pirate/ArchiveBox/wiki/Configuration#use_color -->
## Suggested Edit
<!-- e.g. Please add more example usages, or please fix `xyz` typo to be `abc`. -->
...

View File

@ -1,38 +1,50 @@
---
name: 💡 Feature request
about: Suggest an idea for this project
title: ''
labels: ''
title: 'Feature Request: ...'
labels: 'changes: behavior,status: idea phase'
assignees: ''
---
(feel free to delete this template and write your own issue description if you don't find it helpful)
<!--
Please fill out the following information,
feel free to delete sections if they're not applicable
or if long issue templates annoy you :)
-->
## Type
- [ ] General Question or Disussion
- [ ] General question or discussion
- [ ] Propose a brand new feature
- [ ] Request modification of existing behavior or design
## What is the problem that your feature request solves
<!--
e.g. I need to be able to archive spanish and french subtitle files
from a particular <example.com> movie site that's going down soon.
-->
## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
<!--
e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages.
The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving.
-->
## What hacks or alternative solutions have you tried to solve the problem?
A clear and concise description of any alternative solutions or features you've considered.
<!--
A clear and concise description of any alternative solutions,
workarounds, or other software you've considered using to fix the problem.
-->
## How badly do you want this new feature?
- [ ] It's an urgent deal-breaker, I cant live without it
- [ ] It's an urgent deal-breaker, I can't live without it
- [ ] It's important to add it in the near-mid term future
- [ ] It would be nice to have eventually
---
- [ ] I'm willing to contribute to development / fixing this issue
- [ ] I'm willing to contribute dev time / money to fix this issue
- [ ] I like ArchiveBox so far / would recommend it to a friend
- [ ] I've had a lot of difficulty getting ArchiveBox set up

View File

@ -0,0 +1,9 @@
---
name: 💬 Question, discussion, or support request
about: Start a discussion or ask a question about ArchiveBox
title: 'Question: ...'
labels: ''
assignees: ''
---

145
.github/workflows/test.yml vendored Normal file
View File

@ -0,0 +1,145 @@
name: Test workflow
on: [push]
env:
MAX_LINE_LENGTH: 110
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.8
architecture: x64
- name: Install flake8
run: |
pip install flake8
- name: Lint with flake8
run: |
# one pass for show-stopper syntax errors or undefined names
flake8 archivebox --count --show-source --statistics
# one pass for small stylistic things
flake8 archivebox --count --max-line-length="$MAX_LINE_LENGTH" --statistics
test:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest]
python: [3.7, 3.8]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 1
- uses: actions/checkout@v2
with:
fetch-depth: 1
repository: "gildas-lormeau/SingleFile"
ref: "master"
path: "singlefile"
- name: Install npm requirements for singlefile
run: npm install --prefix singlefile/cli
- name: Give singlefile execution permissions
run: chmod +x singlefile/cli/single-file
- name: Set SINGLEFILE_BINARY
run: echo "::set-env name=SINGLEFILE_BINARY::$GITHUB_WORKSPACE/singlefile/cli/single-file"
- name: Set up Python ${{ matrix.python }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python }}
architecture: x64
- name: Get pip cache dir
id: pip-cache
run: |
echo "::set-output name=dir::$(pip cache dir)"
- name: Cache pip
uses: actions/cache@v2
id: cache-pip
with:
path: ${{ steps.pip-cache.outputs.dir }}
key: ${{ runner.os }}-${{ matrix.python }}-venv-${{ hashFiles('setup.py') }}
restore-keys: |
${{ runner.os }}-${{ matrix.python }}-venv-
- name: Use nodejs 14.7.0
uses: actions/setup-node@v1
with:
node-version: 14.7.0
- name: Debug
run: ls ./
- name: Install dependencies
run: |
python -m pip install .
python -m pip install pytest bottle
- name: Test built package with pytest
run: |
python -m pytest -s
docker-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 1
- uses: satackey/action-docker-layer-caching@v0.0.4
- name: Build image
run: |
docker build . -t archivebox
- name: Init data dir
run: |
mkdir data
docker run -v "$PWD"/data:/data archivebox init
- name: Run test server
run: |
sudo bash -c 'echo "127.0.0.1 www.test-nginx-1.local www.test-nginx-2.local" >> /etc/hosts'
docker run --name www-nginx -p 80:80 -d nginx
- name: Add link
run: |
docker run -v "$PWD"/data:/data --network host archivebox add http://www.test-nginx-1.local
- name: Add stdin link
run: |
echo "http://www.test-nginx-2.local" | docker run -i -v "$PWD"/data:/data archivebox add
- name: List links
run: |
docker run -v "$PWD"/data:/data archivebox list | grep -q "www.test-nginx-1.local" || { echo "The site 1 isn't in the list"; exit 1; }
docker run -v "$PWD"/data:/data archivebox list | grep -q "www.test-nginx-2.local" || { echo "The site 2 isn't in the list"; exit 1; }
- name: Start docker-compose stack
run: |
docker-compose run archivebox init
docker-compose up -d
sleep 5
curl --silent --location 'http://127.0.0.1:8000' | grep 'ArchiveBox'
curl --silent --location 'http://127.0.0.1:8000/static/admin/js/jquery.init.js' | grep 'django.jQuery'
- name: Check added urls show up in index
run: |
docker-compose run archivebox add 'http://example.com/#test_docker' --index-only
curl --silent --location 'http://127.0.0.1:8000' | grep 'http://example.com/#test_docker'
docker-compose down || true

27
.gitignore vendored
View File

@ -1,27 +1,16 @@
# OS cruft
.DS_Store
._*
# python
*.pyc
__pycache__/
.mypy_cache/
venv
.venv
archivebox/.venv
archivebox/venv
archivebox/docs/_build
# vim
.swp*
venv/
.venv/
.docker-venv/
# output artifacts
output
output/
data
data/
archivebox/output
archivebox/data
archivebox.egg-info/
*.egg-info/
build/
dist/
data/
output/

View File

@ -1,71 +1,82 @@
# This Dockerfile for ArchiveBox installs the following in a container:
# - curl, wget, python3, youtube-dl, google-chrome-beta
# - ArchiveBox
# This is the Dockerfile for ArchiveBox, it includes the following major pieces:
# git, curl, wget, python3, youtube-dl, google-chrome-stable, ArchiveBox
# Usage:
# docker build github.com/pirate/ArchiveBox -t archivebox
# echo 'https://example.com' | docker run -i --mount type=bind,source=./data,target=/data archivebox /bin/archive
# docker run --mount type=bind,source=./data,target=/data archivebox /bin/archive 'https://example.com/some/rss/feed.xml'
# docker build . -t archivebox
# docker run -v "$PWD/data":/data archivebox init
# docker run -v "$PWD/data":/data archivebox add 'https://example.com'
# Documentation:
# https://github.com/pirate/ArchiveBox/wiki/Docker#docker
FROM node:11-slim
LABEL maintainer="Nick Sweeting <archivebox-git@sweeting.me>"
FROM python:3.8-slim-buster
RUN apt-get update \
&& apt-get install -yq --no-install-recommends \
git zlib1g-dev wget curl youtube-dl gnupg2 libgconf-2-4 python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
LABEL name="archivebox" \
maintainer="Nick Sweeting <archivebox-git@sweeting.me>" \
description="All-in-one personal internet archiving container"
# Install latest chrome package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
&& apt-get update \
&& apt-get install -y google-chrome-beta fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /src/*.deb
# It's a good idea to use dumb-init to help prevent zombie chrome processes.
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
RUN chmod +x /usr/local/bin/dumb-init
# Uncomment to skip the chromium download when installing puppeteer. If you do,
# you'll need to launch puppeteer with:
# browser.launch({executablePath: 'google-chrome-beta'})
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
# Install puppeteer so it's available in the container.
RUN npm i puppeteer
# Add user so we don't need --no-sandbox.
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
&& mkdir -p /home/pptruser/Downloads \
&& chown -R pptruser:pptruser /home/pptruser \
&& chown -R pptruser:pptruser /node_modules
# Install the ArchiveBox repository and pip requirements
RUN git clone https://github.com/pirate/ArchiveBox /home/pptruser/app \
&& mkdir -p /data \
&& chown -R pptruser:pptruser /data \
&& ln -s /data /home/pptruser/app/archivebox/output \
&& ln -s /home/pptruser/app/bin/* /bin/ \
&& ln -s /home/pptruser/app/bin/archivebox /bin/archive \
&& chown -R pptruser:pptruser /home/pptruser/app/archivebox
# && pip3 install -r /home/pptruser/app/archivebox/requirements.txt
VOLUME /data
ENV LANG=C.UTF-8 \
ENV TZ=UTC \
LANGUAGE=en_US:en \
LC_ALL=C.UTF-8 \
LANG=C.UTF-8 \
PYTHONIOENCODING=UTF-8 \
CHROME_SANDBOX=False \
CHROME_BINARY=google-chrome-beta \
OUTPUT_DIR=/data
PYTHONUNBUFFERED=1 \
APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE=1 \
CODE_PATH=/app \
VENV_PATH=/venv \
DATA_PATH=/data \
EXTRA_PATH=/extra
# First install CLI utils and base deps, then Chrome + Fons + nodejs
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections \
&& apt-get update -qq \
&& apt-get install -qq -y --no-install-recommends \
apt-transport-https ca-certificates apt-utils gnupg gosu gnupg2 libgconf-2-4 zlib1g-dev \
dumb-init jq git wget curl youtube-dl ffmpeg \
&& curl -sSL "https://dl.google.com/linux/linux_signing_key.pub" | apt-key add - \
&& echo "deb https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list \
&& curl -sL https://deb.nodesource.com/setup_14.x | bash - \
&& apt-get update -qq \
&& apt-get install -qq -y --no-install-recommends \
google-chrome-stable \
fontconfig \
fonts-ipafont-gothic \
fonts-wqy-zenhei \
fonts-thai-tlwg \
fonts-kacst \
fonts-symbola \
fonts-noto \
fonts-freefont-ttf \
nodejs \
unzip \
&& rm -rf /var/lib/apt/lists/*
# Clone singlefile and move it to the /bin folder so archivebox can find it
WORKDIR "$EXTRA_PATH"
RUN wget -qO - https://github.com/gildas-lormeau/SingleFile/archive/master.zip > SingleFile.zip \
&& unzip -q SingleFile.zip \
&& npm install --prefix SingleFile-master/cli --production > /dev/null 2>&1 \
&& chmod +x SingleFile-master/cli/single-file
# Run everything from here on out as non-privileged user
USER pptruser
WORKDIR /home/pptruser/app
RUN groupadd --system archivebox \
&& useradd --system --create-home --gid archivebox --groups audio,video archivebox
ENTRYPOINT ["dumb-init", "--"]
CMD ["/bin/archive"]
ADD . "$CODE_PATH"
WORKDIR "$CODE_PATH"
ENV PATH="${PATH}:$VENV_PATH/bin"
RUN python -m venv --clear --symlinks "$VENV_PATH" \
&& pip install --upgrade pip setuptools \
&& pip install -e .
VOLUME "$DATA_PATH"
WORKDIR "$DATA_PATH"
EXPOSE 8000
ENV IN_DOCKER=True \
CHROME_BINARY=google-chrome \
CHROME_SANDBOX=False \
SINGLEFILE_BINARY="$EXTRA_PATH/SingleFile-master/cli/single-file"
RUN env ALLOW_ROOT=True archivebox version
ENTRYPOINT ["dumb-init", "--", "/app/bin/docker_entrypoint.sh"]
CMD ["archivebox", "server", "0.0.0.0:8000"]

View File

@ -1,8 +1,4 @@
include LICENSE
include README.md
include archivebox/VERSION
graft archivebox/themes
graft archivebox/themes/static
graft archivebox/themes/admin
graft archivebox/themes/default
graft archivebox/themes/default/static
graft archivebox/themes/legacy
graft archivebox/themes/legacy/static
recursive-include archivebox/themes *

26
Pipfile
View File

@ -3,26 +3,10 @@ name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
ipdb = "*"
flake8 = "*"
mypy = "*"
django-stubs = "*"
setuptools = "*"
sphinx = "*"
recommonmark = "*"
sphinx-rtd-theme = "*"
[packages]
dataclasses = "*"
base32-crockford = "*"
django = "*"
django-extensions = "*"
youtube-dl = "*"
python-crontab = "*"
croniter = "*"
ipython = "*"
mypy-extensions = "*"
# see setup.py for package dependency list
"e1839a8" = {path = ".", editable = true}
[requires]
python_version = "3.7"
[dev-packages]
# see setup.py for dev package dependency list
"e1839a8" = {path = ".", extras = ["dev"], editable = true}

644
Pipfile.lock generated
View File

@ -1,644 +0,0 @@
{
"_meta": {
"hash": {
"sha256": "8ac4f9e5cd266406a861a283b321b9eee0ca469638f838e93467403ef2f0594d"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.7"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"appnope": {
"hashes": [
"sha256:5b26757dc6f79a3b7dc9fab95359328d5747fcb2409d331ea66d0272b90ab2a0",
"sha256:8b995ffe925347a2138d7ac0fe77155e4311a0ea6d6da4f5128fe4b3cbe5ed71"
],
"markers": "sys_platform == 'darwin'",
"version": "==0.1.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"base32-crockford": {
"hashes": [
"sha256:115f5bd32ae32b724035cb02eb65069a8824ea08c08851eb80c8b9f63443a969",
"sha256:295ef5ffbf6ed96b6e739ffd36be98fa7e90a206dd18c39acefb15777eedfe6e"
],
"index": "pypi",
"version": "==0.3.0"
},
"croniter": {
"hashes": [
"sha256:0d905dbe6f131a910fd3dde792f0129788cd2cb3a8048c5f7aaa212670b0cef2",
"sha256:538adeb3a7f7816c3cdec6db974c441620d764c25ff4ed0146ee7296b8a50590"
],
"index": "pypi",
"version": "==0.3.30"
},
"dataclasses": {
"hashes": [
"sha256:454a69d788c7fda44efd71e259be79577822f5e3f53f029a22d08004e951dc9f",
"sha256:6988bd2b895eef432d562370bb707d540f32f7360ab13da45340101bc2307d84"
],
"index": "pypi",
"version": "==0.6"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"django": {
"hashes": [
"sha256:7c3543e4fb070d14e10926189a7fcf42ba919263b7473dceaefce34d54e8a119",
"sha256:a2814bffd1f007805b19194eb0b9a331933b82bd5da1c3ba3d7b7ba16e06dc4b"
],
"index": "pypi",
"version": "==2.2"
},
"django-extensions": {
"hashes": [
"sha256:109004f80b6f45ad1f56addaa59debca91d94aa0dc1cb19678b9364b4fe9b6f4",
"sha256:307766e5e6c1caffe76c5d99239d8115d14ae3f7cab2cd991fcffd763dad904b"
],
"index": "pypi",
"version": "==2.1.6"
},
"ipython": {
"hashes": [
"sha256:54c5a8aa1eadd269ac210b96923688ccf01ebb2d0f21c18c3c717909583579a8",
"sha256:e840810029224b56cd0d9e7719dc3b39cf84d577f8ac686547c8ba7a06eeab26"
],
"index": "pypi",
"version": "==7.5.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"jedi": {
"hashes": [
"sha256:2bb0603e3506f708e792c7f4ad8fc2a7a9d9c2d292a358fbbd58da531695595b",
"sha256:2c6bcd9545c7d6440951b12b44d373479bf18123a401a52025cf98563fbd826c"
],
"version": "==0.13.3"
},
"mypy-extensions": {
"hashes": [
"sha256:37e0e956f41369209a3d5f34580150bcacfabaa57b33a15c0b25f4b5725e0812",
"sha256:b16cabe759f55e3409a7d231ebd2841378fb0c27a5d1994719e340e4f429ac3e"
],
"index": "pypi",
"version": "==0.4.1"
},
"parso": {
"hashes": [
"sha256:17cc2d7a945eb42c3569d4564cdf49bde221bc2b552af3eca9c1aad517dcdd33",
"sha256:2e9574cb12e7112a87253e14e2c380ce312060269d04bd018478a3c92ea9a376"
],
"version": "==0.4.0"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"pygments": {
"hashes": [
"sha256:5ffada19f6203563680669ee7f53b64dabbeb100eb51b61996085e99c03b284a",
"sha256:e8218dd399a61674745138520d0d4cf2621d7e032439341bc3f647bff125818d"
],
"version": "==2.3.1"
},
"python-crontab": {
"hashes": [
"sha256:91ce4b245ee5e5c117aa0b21b485bc43f2d80df854a36e922b707643f50d7923"
],
"index": "pypi",
"version": "==2.3.6"
},
"python-dateutil": {
"hashes": [
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
],
"version": "==2.8.0"
},
"pytz": {
"hashes": [
"sha256:303879e36b721603cc54604edcac9d20401bdbe31e1e4fdee5b9f98d5d31dfda",
"sha256:d747dd3d23d77ef44c6a3526e274af6efeb0a6f1afd5a69ba4d5be4098c8e141"
],
"version": "==2019.1"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"sqlparse": {
"hashes": [
"sha256:40afe6b8d4b1117e7dff5504d7a8ce07d9a1b15aeeade8a2d10f130a834f8177",
"sha256:7c3dca29c022744e95b547e867cee89f4fce4373f3549ccd8797d8eb52cdb873"
],
"version": "==0.3.0"
},
"traitlets": {
"hashes": [
"sha256:9c4bd2d267b7153df9152698efb1050a5d84982d3384a37b2c1f7723ba3e7835",
"sha256:c6cb5e6f57c5a9bdaa40fa71ce7b4af30298fbab9ece9815b5d995ab6217c7d9"
],
"version": "==4.3.2"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
},
"youtube-dl": {
"hashes": [
"sha256:46f6e30c673ba71de84748dad4c264d1b6fb30beebf1ef834846a651b4524a78",
"sha256:b20d110e1bed8d16f5771bb938ab6e5da67f08af62b599af65301cca290f2e15"
],
"index": "pypi",
"version": "==2019.4.24"
}
},
"develop": {
"alabaster": {
"hashes": [
"sha256:446438bdcca0e05bd45ea2de1668c1d9b032e1a9154c2c259092d77031ddd359",
"sha256:a661d72d58e6ea8a57f7a86e37d86716863ee5e92788398526d58b26a4e4dc02"
],
"version": "==0.7.12"
},
"appnope": {
"hashes": [
"sha256:5b26757dc6f79a3b7dc9fab95359328d5747fcb2409d331ea66d0272b90ab2a0",
"sha256:8b995ffe925347a2138d7ac0fe77155e4311a0ea6d6da4f5128fe4b3cbe5ed71"
],
"markers": "sys_platform == 'darwin'",
"version": "==0.1.0"
},
"babel": {
"hashes": [
"sha256:6778d85147d5d85345c14a26aada5e478ab04e39b078b0745ee6870c2b5cf669",
"sha256:8cba50f48c529ca3fa18cf81fa9403be176d374ac4d60738b839122dfaaa3d23"
],
"version": "==2.6.0"
},
"backcall": {
"hashes": [
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
],
"version": "==0.1.0"
},
"certifi": {
"hashes": [
"sha256:59b7658e26ca9c7339e00f8f4636cdfe59d34fa37b9b04f6f9e9926b3cece1a5",
"sha256:b26104d6835d1f5e49452a26eb2ff87fe7090b89dfcaee5ea2212697e1e1d7ae"
],
"version": "==2019.3.9"
},
"chardet": {
"hashes": [
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
],
"version": "==3.0.4"
},
"commonmark": {
"hashes": [
"sha256:9f6dda7876b2bb88dd784440166f4bc8e56cb2b2551264051123bacb0b6c1d8a",
"sha256:abcbc854e0eae5deaf52ae5e328501b78b4a0758bf98ac8bb792fce993006084"
],
"version": "==0.8.1"
},
"decorator": {
"hashes": [
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
],
"version": "==4.4.0"
},
"django-stubs": {
"hashes": [
"sha256:9c06a4b28fc8c18f6abee4f199f8ee29cb5cfcecf349e912ded31cb3526ea2b6",
"sha256:9ef230843a24b5d74f2ebd4c60f9bea09c21911bc119d0325e8bb47e2f495e70"
],
"index": "pypi",
"version": "==0.12.1"
},
"docutils": {
"hashes": [
"sha256:02aec4bd92ab067f6ff27a38a38a41173bf01bed8f89157768c1573f53e474a6",
"sha256:51e64ef2ebfb29cae1faa133b3710143496eca21c530f3f71424d77687764274",
"sha256:7a4bd47eaf6596e1295ecb11361139febe29b084a87bf005bf899f9a42edc3c6"
],
"version": "==0.14"
},
"entrypoints": {
"hashes": [
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
],
"version": "==0.3"
},
"flake8": {
"hashes": [
"sha256:859996073f341f2670741b51ec1e67a01da142831aa1fdc6242dbf88dffbe661",
"sha256:a796a115208f5c03b18f332f7c11729812c8c3ded6c46319c59b53efd3819da8"
],
"index": "pypi",
"version": "==3.7.7"
},
"future": {
"hashes": [
"sha256:67045236dcfd6816dc439556d009594abf643e5eb48992e36beac09c2ca659b8"
],
"version": "==0.17.1"
},
"idna": {
"hashes": [
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
],
"version": "==2.8"
},
"imagesize": {
"hashes": [
"sha256:3f349de3eb99145973fefb7dbe38554414e5c30abd0c8e4b970a7c9d09f3a1d8",
"sha256:f3832918bc3c66617f92e35f5d70729187676313caa60c187eb0f28b8fe5e3b5"
],
"version": "==1.1.0"
},
"ipdb": {
"hashes": [
"sha256:dce2112557edfe759742ca2d0fee35c59c97b0cc7a05398b791079d78f1519ce"
],
"index": "pypi",
"version": "==0.12"
},
"ipython": {
"hashes": [
"sha256:54c5a8aa1eadd269ac210b96923688ccf01ebb2d0f21c18c3c717909583579a8",
"sha256:e840810029224b56cd0d9e7719dc3b39cf84d577f8ac686547c8ba7a06eeab26"
],
"index": "pypi",
"version": "==7.5.0"
},
"ipython-genutils": {
"hashes": [
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
],
"version": "==0.2.0"
},
"jedi": {
"hashes": [
"sha256:2bb0603e3506f708e792c7f4ad8fc2a7a9d9c2d292a358fbbd58da531695595b",
"sha256:2c6bcd9545c7d6440951b12b44d373479bf18123a401a52025cf98563fbd826c"
],
"version": "==0.13.3"
},
"jinja2": {
"hashes": [
"sha256:065c4f02ebe7f7cf559e49ee5a95fb800a9e4528727aec6f24402a5374c65013",
"sha256:14dd6caf1527abb21f08f86c784eac40853ba93edb79552aa1e4b8aef1b61c7b"
],
"version": "==2.10.1"
},
"markupsafe": {
"hashes": [
"sha256:00bc623926325b26bb9605ae9eae8a215691f33cae5df11ca5424f06f2d1f473",
"sha256:09027a7803a62ca78792ad89403b1b7a73a01c8cb65909cd876f7fcebd79b161",
"sha256:09c4b7f37d6c648cb13f9230d847adf22f8171b1ccc4d5682398e77f40309235",
"sha256:1027c282dad077d0bae18be6794e6b6b8c91d58ed8a8d89a89d59693b9131db5",
"sha256:24982cc2533820871eba85ba648cd53d8623687ff11cbb805be4ff7b4c971aff",
"sha256:29872e92839765e546828bb7754a68c418d927cd064fd4708fab9fe9c8bb116b",
"sha256:43a55c2930bbc139570ac2452adf3d70cdbb3cfe5912c71cdce1c2c6bbd9c5d1",
"sha256:46c99d2de99945ec5cb54f23c8cd5689f6d7177305ebff350a58ce5f8de1669e",
"sha256:500d4957e52ddc3351cabf489e79c91c17f6e0899158447047588650b5e69183",
"sha256:535f6fc4d397c1563d08b88e485c3496cf5784e927af890fb3c3aac7f933ec66",
"sha256:62fe6c95e3ec8a7fad637b7f3d372c15ec1caa01ab47926cfdf7a75b40e0eac1",
"sha256:6dd73240d2af64df90aa7c4e7481e23825ea70af4b4922f8ede5b9e35f78a3b1",
"sha256:717ba8fe3ae9cc0006d7c451f0bb265ee07739daf76355d06366154ee68d221e",
"sha256:79855e1c5b8da654cf486b830bd42c06e8780cea587384cf6545b7d9ac013a0b",
"sha256:7c1699dfe0cf8ff607dbdcc1e9b9af1755371f92a68f706051cc8c37d447c905",
"sha256:88e5fcfb52ee7b911e8bb6d6aa2fd21fbecc674eadd44118a9cc3863f938e735",
"sha256:8defac2f2ccd6805ebf65f5eeb132adcf2ab57aa11fdf4c0dd5169a004710e7d",
"sha256:98c7086708b163d425c67c7a91bad6e466bb99d797aa64f965e9d25c12111a5e",
"sha256:9add70b36c5666a2ed02b43b335fe19002ee5235efd4b8a89bfcf9005bebac0d",
"sha256:9bf40443012702a1d2070043cb6291650a0841ece432556f784f004937f0f32c",
"sha256:ade5e387d2ad0d7ebf59146cc00c8044acbd863725f887353a10df825fc8ae21",
"sha256:b00c1de48212e4cc9603895652c5c410df699856a2853135b3967591e4beebc2",
"sha256:b1282f8c00509d99fef04d8ba936b156d419be841854fe901d8ae224c59f0be5",
"sha256:b2051432115498d3562c084a49bba65d97cf251f5a331c64a12ee7e04dacc51b",
"sha256:ba59edeaa2fc6114428f1637ffff42da1e311e29382d81b339c1817d37ec93c6",
"sha256:c8716a48d94b06bb3b2524c2b77e055fb313aeb4ea620c8dd03a105574ba704f",
"sha256:cd5df75523866410809ca100dc9681e301e3c27567cf498077e8551b6d20e42f",
"sha256:e249096428b3ae81b08327a63a485ad0878de3fb939049038579ac0ef61e17e7"
],
"version": "==1.1.1"
},
"mccabe": {
"hashes": [
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
],
"version": "==0.6.1"
},
"mypy": {
"hashes": [
"sha256:2afe51527b1f6cdc4a5f34fc90473109b22bf7f21086ba3e9451857cf11489e6",
"sha256:56a16df3e0abb145d8accd5dbb70eba6c4bd26e2f89042b491faa78c9635d1e2",
"sha256:5764f10d27b2e93c84f70af5778941b8f4aa1379b2430f85c827e0f5464e8714",
"sha256:5bbc86374f04a3aa817622f98e40375ccb28c4836f36b66706cf3c6ccce86eda",
"sha256:6a9343089f6377e71e20ca734cd8e7ac25d36478a9df580efabfe9059819bf82",
"sha256:6c9851bc4a23dc1d854d3f5dfd5f20a016f8da86bcdbb42687879bb5f86434b0",
"sha256:b8e85956af3fcf043d6f87c91cbe8705073fc67029ba6e22d3468bfee42c4823",
"sha256:b9a0af8fae490306bc112229000aa0c2ccc837b49d29a5c42e088c132a2334dd",
"sha256:bbf643528e2a55df2c1587008d6e3bda5c0445f1240dfa85129af22ae16d7a9a",
"sha256:c46ab3438bd21511db0f2c612d89d8344154c0c9494afc7fbc932de514cf8d15",
"sha256:f7a83d6bd805855ef83ec605eb01ab4fa42bcef254b13631e451cbb44914a9b0"
],
"index": "pypi",
"version": "==0.701"
},
"mypy-extensions": {
"hashes": [
"sha256:37e0e956f41369209a3d5f34580150bcacfabaa57b33a15c0b25f4b5725e0812",
"sha256:b16cabe759f55e3409a7d231ebd2841378fb0c27a5d1994719e340e4f429ac3e"
],
"index": "pypi",
"version": "==0.4.1"
},
"packaging": {
"hashes": [
"sha256:0c98a5d0be38ed775798ece1b9727178c4469d9c3b4ada66e8e6b7849f8732af",
"sha256:9e1cbf8c12b1f1ce0bb5344b8d7ecf66a6f8a6e91bcb0c84593ed6d3ab5c4ab3"
],
"version": "==19.0"
},
"parso": {
"hashes": [
"sha256:17cc2d7a945eb42c3569d4564cdf49bde221bc2b552af3eca9c1aad517dcdd33",
"sha256:2e9574cb12e7112a87253e14e2c380ce312060269d04bd018478a3c92ea9a376"
],
"version": "==0.4.0"
},
"pexpect": {
"hashes": [
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
],
"markers": "sys_platform != 'win32'",
"version": "==4.7.0"
},
"pickleshare": {
"hashes": [
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
],
"version": "==0.7.5"
},
"prompt-toolkit": {
"hashes": [
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
],
"version": "==2.0.9"
},
"ptyprocess": {
"hashes": [
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
],
"version": "==0.6.0"
},
"pycodestyle": {
"hashes": [
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
],
"version": "==2.5.0"
},
"pyflakes": {
"hashes": [
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
],
"version": "==2.1.1"
},
"pygments": {
"hashes": [
"sha256:5ffada19f6203563680669ee7f53b64dabbeb100eb51b61996085e99c03b284a",
"sha256:e8218dd399a61674745138520d0d4cf2621d7e032439341bc3f647bff125818d"
],
"version": "==2.3.1"
},
"pyparsing": {
"hashes": [
"sha256:1873c03321fc118f4e9746baf201ff990ceb915f433f23b395f5580d1840cb2a",
"sha256:9b6323ef4ab914af344ba97510e966d64ba91055d6b9afa6b30799340e89cc03"
],
"version": "==2.4.0"
},
"pytz": {
"hashes": [
"sha256:303879e36b721603cc54604edcac9d20401bdbe31e1e4fdee5b9f98d5d31dfda",
"sha256:d747dd3d23d77ef44c6a3526e274af6efeb0a6f1afd5a69ba4d5be4098c8e141"
],
"version": "==2019.1"
},
"recommonmark": {
"hashes": [
"sha256:a520b8d25071a51ae23a27cf6252f2fe387f51bdc913390d83b2b50617f5bb48",
"sha256:c85228b9b7aea7157662520e74b4e8791c5eacd375332ec68381b52bf10165be"
],
"index": "pypi",
"version": "==0.5.0"
},
"requests": {
"hashes": [
"sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e",
"sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b"
],
"version": "==2.21.0"
},
"six": {
"hashes": [
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
],
"version": "==1.12.0"
},
"snowballstemmer": {
"hashes": [
"sha256:919f26a68b2c17a7634da993d91339e288964f93c274f1343e3bbbe2096e1128",
"sha256:9f3bcd3c401c3e862ec0ebe6d2c069ebc012ce142cce209c098ccb5b09136e89"
],
"version": "==1.2.1"
},
"sphinx": {
"hashes": [
"sha256:423280646fb37944dd3c85c58fb92a20d745793a9f6c511f59da82fa97cd404b",
"sha256:de930f42600a4fef993587633984cc5027dedba2464bcf00ddace26b40f8d9ce"
],
"index": "pypi",
"version": "==2.0.1"
},
"sphinx-rtd-theme": {
"hashes": [
"sha256:00cf895504a7895ee433807c62094cf1e95f065843bf3acd17037c3e9a2becd4",
"sha256:728607e34d60456d736cc7991fd236afb828b21b82f956c5ea75f94c8414040a"
],
"index": "pypi",
"version": "==0.4.3"
},
"sphinxcontrib-applehelp": {
"hashes": [
"sha256:edaa0ab2b2bc74403149cb0209d6775c96de797dfd5b5e2a71981309efab3897",
"sha256:fb8dee85af95e5c30c91f10e7eb3c8967308518e0f7488a2828ef7bc191d0d5d"
],
"version": "==1.0.1"
},
"sphinxcontrib-devhelp": {
"hashes": [
"sha256:6c64b077937330a9128a4da74586e8c2130262f014689b4b89e2d08ee7294a34",
"sha256:9512ecb00a2b0821a146736b39f7aeb90759834b07e81e8cc23a9c70bacb9981"
],
"version": "==1.0.1"
},
"sphinxcontrib-htmlhelp": {
"hashes": [
"sha256:4670f99f8951bd78cd4ad2ab962f798f5618b17675c35c5ac3b2132a14ea8422",
"sha256:d4fd39a65a625c9df86d7fa8a2d9f3cd8299a3a4b15db63b50aac9e161d8eff7"
],
"version": "==1.0.2"
},
"sphinxcontrib-jsmath": {
"hashes": [
"sha256:2ec2eaebfb78f3f2078e73666b1415417a116cc848b72e5172e596c871103178",
"sha256:a9925e4a4587247ed2191a22df5f6970656cb8ca2bd6284309578f2153e0c4b8"
],
"version": "==1.0.1"
},
"sphinxcontrib-qthelp": {
"hashes": [
"sha256:513049b93031beb1f57d4daea74068a4feb77aa5630f856fcff2e50de14e9a20",
"sha256:79465ce11ae5694ff165becda529a600c754f4bc459778778c7017374d4d406f"
],
"version": "==1.0.2"
},
"sphinxcontrib-serializinghtml": {
"hashes": [
"sha256:c0efb33f8052c04fd7a26c0a07f1678e8512e0faec19f4aa8f2473a8b81d5227",
"sha256:db6615af393650bf1151a6cd39120c29abaf93cc60db8c48eb2dddbfdc3a9768"
],
"version": "==1.1.3"
},
"traitlets": {
"hashes": [
"sha256:9c4bd2d267b7153df9152698efb1050a5d84982d3384a37b2c1f7723ba3e7835",
"sha256:c6cb5e6f57c5a9bdaa40fa71ce7b4af30298fbab9ece9815b5d995ab6217c7d9"
],
"version": "==4.3.2"
},
"typed-ast": {
"hashes": [
"sha256:04894d268ba6eab7e093d43107869ad49e7b5ef40d1a94243ea49b352061b200",
"sha256:16616ece19daddc586e499a3d2f560302c11f122b9c692bc216e821ae32aa0d0",
"sha256:252fdae740964b2d3cdfb3f84dcb4d6247a48a6abe2579e8029ab3be3cdc026c",
"sha256:2af80a373af123d0b9f44941a46df67ef0ff7a60f95872412a145f4500a7fc99",
"sha256:2c88d0a913229a06282b285f42a31e063c3bf9071ff65c5ea4c12acb6977c6a7",
"sha256:2ea99c029ebd4b5a308d915cc7fb95b8e1201d60b065450d5d26deb65d3f2bc1",
"sha256:3d2e3ab175fc097d2a51c7a0d3fda442f35ebcc93bb1d7bd9b95ad893e44c04d",
"sha256:4766dd695548a15ee766927bf883fb90c6ac8321be5a60c141f18628fb7f8da8",
"sha256:56b6978798502ef66625a2e0f80cf923da64e328da8bbe16c1ff928c70c873de",
"sha256:5cddb6f8bce14325b2863f9d5ac5c51e07b71b462361fd815d1d7706d3a9d682",
"sha256:644ee788222d81555af543b70a1098f2025db38eaa99226f3a75a6854924d4db",
"sha256:64cf762049fc4775efe6b27161467e76d0ba145862802a65eefc8879086fc6f8",
"sha256:68c362848d9fb71d3c3e5f43c09974a0ae319144634e7a47db62f0f2a54a7fa7",
"sha256:6c1f3c6f6635e611d58e467bf4371883568f0de9ccc4606f17048142dec14a1f",
"sha256:b213d4a02eec4ddf622f4d2fbc539f062af3788d1f332f028a2e19c42da53f15",
"sha256:bb27d4e7805a7de0e35bd0cb1411bc85f807968b2b0539597a49a23b00a622ae",
"sha256:c9d414512eaa417aadae7758bc118868cd2396b0e6138c1dd4fda96679c079d3",
"sha256:f0937165d1e25477b01081c4763d2d9cdc3b18af69cb259dd4f640c9b900fe5e",
"sha256:fb96a6e2c11059ecf84e6741a319f93f683e440e341d4489c9b161eca251cf2a",
"sha256:fc71d2d6ae56a091a8d94f33ec9d0f2001d1cb1db423d8b4355debfe9ce689b7"
],
"version": "==1.3.4"
},
"typing-extensions": {
"hashes": [
"sha256:07b2c978670896022a43c4b915df8958bec4a6b84add7f2c87b2b728bda3ba64",
"sha256:f3f0e67e1d42de47b5c67c32c9b26641642e9170fe7e292991793705cd5fef7c",
"sha256:fb2cd053238d33a8ec939190f30cfd736c00653a85a2919415cecf7dc3d9da71"
],
"version": "==3.7.2"
},
"urllib3": {
"hashes": [
"sha256:4c291ca23bbb55c76518905869ef34bdd5f0e46af7afe6861e8375643ffee1a0",
"sha256:9a247273df709c4fedb38c711e44292304f73f39ab01beda9f6b9fc375669ac3"
],
"version": "==1.24.2"
},
"wcwidth": {
"hashes": [
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
],
"version": "==0.1.7"
}
}
}

159
README.md
View File

@ -3,7 +3,7 @@
<h1>ArchiveBox<br/><sub>The open-source self-hosted web archive.</sub></h1>
▶️ <a href="https://github.com/pirate/ArchiveBox/wiki/Quickstart">Quickstart</a> |
<a href="https://archive.sweeting.me">Demo</a> |
<a href="https://archivebox.zervice.io/">Demo</a> |
<a href="https://github.com/pirate/ArchiveBox">Github</a> |
<a href="https://github.com/pirate/ArchiveBox/wiki">Documentation</a> |
<a href="#background--motivation">Info & Motivation</a> |
@ -14,35 +14,41 @@
"Your own personal internet archive" (网站存档 / 爬虫)
</pre>
<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>
<!--<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>-->
<a href="https://github.com/pirate/ArchiveBox/blob/master/LICENSE"><img src="https://img.shields.io/badge/Open_source-MIT-green.svg?logo=git&logoColor=green"/></a>
<a href="https://github.com/pirate/ArchiveBox/commits/dev"><img src="https://img.shields.io/github/last-commit/pirate/ArchiveBox.svg?logo=Sublime+Text&logoColor=green&label=Active"/></a>
<a href="https://github.com/pirate/ArchiveBox"><img src="https://img.shields.io/github/stars/pirate/ArchiveBox.svg?logo=github&label=Stars&logoColor=blue"/></a>
<a href="https://test.pypi.org/project/archivebox/"><img src="https://img.shields.io/badge/Python-%3E%3D3.5-yellow.svg?logo=python&logoColor=yellow"/></a>
<a href="https://test.pypi.org/project/archivebox/"><img src="https://img.shields.io/badge/Python-%3E%3D3.7-yellow.svg?logo=python&logoColor=yellow"/></a>
<a href="https://github.com/pirate/ArchiveBox/wiki/Install#dependencies"><img src="https://img.shields.io/badge/Chromium-%3E%3D59-orange.svg?logo=Google+Chrome&logoColor=orange"/></a>
<a href="https://hub.docker.com/r/nikisweeting/archivebox"><img src="https://img.shields.io/badge/Docker-all%20platforms-lightblue.svg?logo=docker&logoColor=lightblue"/></a>
<hr/>
</div>
**ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).**
You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, autheticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.
You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, authenticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.
#### How does it work?
```bash
echo 'http://example.com' | ./archive
mkdir data && cd data
archivebox init
archivebox add 'https://example.com'
archivebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
archivebox server
```
After installing the dependencies, just pipe some new links into the `./archive` command to start your archive.
ArchiveBox is written in Python 3.5 and uses wget, Chrome headless, youtube-dl, pywb, and other common unix tools to save each page you add in multiple redundant formats. It doesn't require a constantly running server or backend, just open the generated `output/index.html` in a browser to view the archive. It can import and export links as JSON (among other formats), so it's easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).
After installing archivebox, just pass some new links to the `archivebox add` command to start your collection.
ArchiveBox is written in Python 3.7 and uses wget, Chrome headless, youtube-dl, pywb, and other common UNIX tools to save each page you add in multiple redundant formats. It doesn't require a constantly running server or backend (though it does include an optional one), just open the generated `data/index.html` in a browser to view the archive or run `archivebox server` to use the interactive Web UI. It can import and export links as JSON (among other formats), so it's easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).
<div align="center">
<img src="https://i.imgur.com/3tBL7PU.png" width="30%" alt="CLI Screenshot" align="top">
<img src="https://i.imgur.com/viklZNG.png" width="30%" alt="Desktop index screenshot" align="top">
<img src="https://i.imgur.com/RefWsXB.jpg" width="30%" alt="Desktop details page Screenshot"/><br/>
<img src="https://i.imgur.com/3tBL7PU.png" width="22%" alt="CLI Screenshot" align="top">
<img src="https://i.imgur.com/viklZNG.png" width="22%" alt="Desktop index screenshot" align="top">
<img src="https://i.imgur.com/RefWsXB.jpg" width="22%" alt="Desktop details page Screenshot"/>
<img src="https://i.imgur.com/M6HhzVx.png" width="22%" alt="Desktop details page Screenshot"/><br/>
<sup><a href="https://archive.sweeting.me/">Demo</a> | <a href="https://github.com/pirate/ArchiveBox/wiki/Usage">Usage</a> | <a href="#screenshots">Screenshots</a></sup>
<br/>
<sub>. . . . . . . . . . . . . . . . . . . . . . . . . . . .</sub>
@ -50,26 +56,56 @@ ArchiveBox is written in Python 3.5 and uses wget, Chrome headless, youtube-dl,
## Quickstart
ArchiveBox has [3 main dependencies](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies) beyond `python3`: `wget`, `chromium`, and `youtube-dl`.
ArchiveBox is written in `python3.7` and has [3 main binary dependencies](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies): `wget`, `chromium`, and `youtube-dl`.
To get started, you can [install them manually](https://github.com/pirate/ArchiveBox/wiki/Install) using your system's package manager, use the [automated helper script](https://github.com/pirate/ArchiveBox/wiki/Quickstart), or use the official [Docker](https://github.com/pirate/ArchiveBox/wiki/Docker) container. All three dependencies are optional if [disabled](https://github.com/pirate/ArchiveBox/wiki/Configuration#archive-method-toggles) in settings.
```bash
# 1. Install dependencies (use apt on ubuntu, brew on mac, or pkg on BSD)
apt install python3 python3-pip git curl wget youtube-dl chromium-browser
# 2. Download ArchiveBox
git clone https://github.com/pirate/ArchiveBox.git && cd ArchiveBox
# 3. Add your first links to your archive
echo 'https://example.com' | ./archive # pass URLs to archive via stdin
./archive https://getpocket.com/users/example/feed/all # or import an RSS/JSON/XML/TXT feed
# Docker
mkdir data && cd data
docker run -v $PWD:/data nikisweeting/archivebox init
docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
docker run -v $PWD:/data -it nikisweeting/archivebox manage createsuperuser
docker run -v $PWD:/data -p 8000:8000 nikisweeting/archivebox server 0.0.0.0:8000
open http://127.0.0.1:8000
```
One you've added your first links, open `output/index.html` in a browser to view the archive. [DEMO: archive.sweeting.me](https://archive.sweeting.me)
For more information, see the [full Quickstart guide](https://github.com/pirate/ArchiveBox/wiki/Quickstart), [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage), and [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration) docs.
```bash
# Docker Compose
# first download: https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml
docker-compose run archivebox init
docker-compose run archivebox add 'https://example.com'
docker-compose run archivebox manage createsuperuser
docker-compose up
open http://127.0.0.1:8000
```
*(`pip install archivebox` will be available in the near future, follow our [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) for progress)*
```bash
# Bare Metal
# Use apt on Ubuntu/Debian, brew on mac, or pkg on BSD
apt install python3 python3-pip git curl wget youtube-dl chromium-browser
pip install archivebox # install archivebox
mkdir data && cd data # (doesn't have to be called data)
archivebox init
archivebox add 'https://example.com' # add URLs via args or stdin
# or import an RSS/JSON/XML/TXT feed/list of links
archivebox add https://getpocket.com/users/USERNAME/feed/all --depth=1
```
Once you've added your first links, open `data/index.html` in a browser to view the static archive.
You can also start it as a server with a full web UI to manage your links:
```bash
archivebox manage createsuperuser
archivebox server
```
You can visit `http://127.0.0.1:8000` in your browser to access it.
[DEMO: archivebox.zervice.io/](https://archivebox.zervice.io)
For more information, see the [full Quickstart guide](https://github.com/pirate/ArchiveBox/wiki/Quickstart), [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage), and [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration) docs.
---
@ -87,17 +123,18 @@ complex, finicky websites in at least a few high-quality, long-term data formats
ArchiveBox imports a list of URLs from stdin, remote URL, or file, then adds the pages to a local archive folder using wget to create a browsable HTML clone, youtube-dl to extract media, and a full instance of Chrome headless for PDF, Screenshot, and DOM dumps, and more...
Running `./archive` adds only new, unique links into `output/` on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to [run on a timer](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save `None` and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/pirate/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs).
Running `archivebox add` adds only new, unique links into your collection on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to [run on a timer](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save `None` and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/pirate/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs).
All the archived links are stored by date bookmarked in `output/archive/<timestamp>`, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.
All the archived links are stored by date bookmarked in `./archive/<timestamp>`, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.
#### Can import links from many formats:
```bash
echo 'http://example.com' | ./archive
./archive ~/Downloads/firefox_bookmarks_export.html
./archive https://example.com/some/rss/feed.xml
echo 'http://example.com' | archivebox add
archivebox add ~/Downloads/firefox_bookmarks_export.html --depth=1
archivebox add https://example.com/some/rss/feed.xml --depth=1
```
- <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> Browser history or bookmarks exports (Chrome, Firefox, Safari, IE, Opera, and more)
- <img src="https://nicksweeting.com/images/rss.svg" height="22px"/> RSS, XML, JSON, CSV, SQL, HTML, Markdown, TXT, or any other text-based format
- <img src="https://getpocket.com/favicon.ico" height="22px"/> Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, OneTab, and more
@ -107,7 +144,7 @@ See the [Usage: CLI](https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage)
#### Saves lots of useful stuff for each imported link:
```bash
ls output/archive/<timestamp>/
ls ./archive/<timestamp>/
```
- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details
@ -121,7 +158,7 @@ See the [Usage: CLI](https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage)
- **URL to Archive.org:** `archive.org.txt` A link to the saved site on archive.org
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl
- **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links
- *More coming soon! See the [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)...*
- _More coming soon! See the [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)..._
It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/pirate/ArchiveBox/wiki/Configuration) via environment variables or config file.
@ -135,7 +172,7 @@ If you're importing URLs with secret tokens in them (e.g Google Docs, CodiMD not
- **Doesn't require a constantly-running server**, proxy, or native app
- Easy to set up **[scheduled importing](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) from multiple sources**
- Uses common, **durable, [long-term formats](#saves-lots-of-useful-stuff-for-each-imported-link)** like HTML, JSON, PDF, PNG, and WARC
- **Suitable for paywalled / [authenticated content](https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_user_data_dir)** (can use your cookies)
- ~~**Suitable for paywalled / [authenticated content](https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_user_data_dir)** (can use your cookies)~~ (do not do this until v0.5 is released with some security fixes)
- Can [**run scripts during archiving**](https://github.com/pirate/ArchiveBox/issues/51) to [scroll pages](https://github.com/pirate/ArchiveBox/issues/80), [close modals](https://github.com/pirate/ArchiveBox/issues/175), expand comment threads, etc.
- Can also [**mirror content to 3rd-party archiving services**](https://github.com/pirate/ArchiveBox/wiki/Configuration#submit_archive_dot_org) automatically for redundancy
@ -155,7 +192,6 @@ archive internet content enables to you save the stuff you care most about befor
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful.
I don't think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.
## Comparison to Other Projects
▶ **Check out our [community page](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community) for an index of web archiving initiatives and projects.**
@ -164,13 +200,11 @@ I don't think everything should be preserved in an automated fashion, making all
#### User Interface & Intended Purpose
ArchiveBox differentiates itself from [similar projects](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects) by being a simple, one-shot CLI inferface for users to ingest built feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI.
An alternative tool [pywb](https://github.com/webrecorder/pywb) allows you to run a browser through an always-running archiving proxy which records the traffic to WARC files. ArchiveBox intends to support this style of live proxy-archiving using `pywb` in the future, but for now it only ingests lists of links at a time via browser history, bookmarks, RSS, etc.
ArchiveBox differentiates itself from [similar projects](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects) by being a simple, one-shot CLI interface for users to ingest bulk feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI. However, we also have the option to add urls via a web interface through our Django frontend.
#### Private Local Archives vs Centralized Public Archives
Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, including private/authenticated content that you wouldn't otherwise share with a centralized service. Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.
Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, ~~including private/authenticated content that you wouldn't otherwise share with a centralized service~~ (do not do this until v0.5 is released with some security fixes). Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.
#### Storage Requirements
@ -178,21 +212,21 @@ Because ArchiveBox is designed to ingest a firehose of browser history and bookm
## Learn more
▶ **Join out our [community chat](http://webchat.freenode.net?channels=ArchiveBox&uio=d4) hosted on IRC freenode.net:`#ArchiveBox`!**
<!-- **Join out our [community chat](http://webchat.freenode.net?channels=ArchiveBox&uio=d4) hosted on IRC freenode.net:`#ArchiveBox`!**-->
Whether you want learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!
<img src="https://i.imgur.com/0ZOmOvN.png" width="14%" align="right"/>
- [Community Wiki](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
+ [The Master Lists](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#The-Master-Lists)
*Community-maintained indexes of archiving tools and institutions.*
+ [Web Archiving Software](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects)
*Open source tools and projects in the internet archiving space.*
+ [Reading List](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Reading-List)
*Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.*
+ [Communities](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Communities)
*A collection of the most active internet archiving communities and initiatives.*
- [The Master Lists](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#The-Master-Lists)
_Community-maintained indexes of archiving tools and institutions._
- [Web Archiving Software](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects)
_Open source tools and projects in the internet archiving space._
- [Reading List](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Reading-List)
_Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._
- [Communities](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Communities)
_A collection of the most active internet archiving communities and initiatives._
- Check out the ArchiveBox [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) and [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog)
- Learn why archiving the internet is important by reading the "[On the Importance of Web Archiving](https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/)" blog post.
- Or reach out to me for questions and comments via [@theSquashSH](https://twitter.com/thesquashSH) on Twitter.
@ -208,6 +242,7 @@ We use the [Github wiki system](https://github.com/pirate/ArchiveBox/wiki) and [
You can also access the docs locally by looking in the [`ArchiveBox/docs/`](https://github.com/pirate/ArchiveBox/wiki/Home) folder.
You can build the docs by running:
```python
cd ArchiveBox
pipenv install --dev
@ -245,40 +280,22 @@ make html
---
# Screenshots
<div align="center">
<img src="https://i.imgur.com/biVfFYr.png" width="18%" alt="CLI Screenshot" align="top">
<img src="https://i.imgur.com/viklZNG.png" width="40%" alt="Desktop index screenshot" align="top">
<img src="https://i.imgur.com/wnpdAVM.jpg" width="30%" alt="Desktop details page Screenshot" align="top">
<img src="https://i.imgur.com/mW2dITg.png" width="8%" alt="Mobile details page screenshot" align="top">
</div>
---
<div align="center">
<br/><br/>
<img src="https://raw.githubusercontent.com/Monadical-SAS/redux-time/HEAD/examples/static/jeremy.jpg" height="40px"/>
<br/>
<sub><i>This project is maintained mostly in <a href="https://nicksweeting.com/blog#About">my spare time</a> with the help from generous contributors.</i></sub>
<sub><i>This project is maintained mostly in <a href="https://nicksweeting.com/blog#About">my spare time</a> with the help from generous contributors and Monadical.com.</i></sub>
<br/><br/>
Contributor Spotlight:<br/><br/>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/0"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/0"></a>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/1"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/1"></a>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/2"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/2"></a>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/3"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/3"></a>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/4"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/4"></a>
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/5"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/5"></a>
<br/>
<a href="https://github.com/sponsors/pirate">Sponsor us on Github</a>
<br>
<br>
<a href="https://www.patreon.com/theSquashSH"><img src="https://img.shields.io/badge/Donate_to_support_development-via_Patreon-%23DD5D76.svg?style=flat"/></a>
<br/>
<br/>
<a href="https://twitter.com/thesquashSH"><img src="https://img.shields.io/badge/Tweet-%40theSquashSH-blue.svg?style=flat"/></a>
<a href="https://github.com/pirate/ArchiveBox"><img src="https://img.shields.io/github/stars/pirate/ArchiveBox.svg?style=flat&label=Star+on+Github"/></a>
<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>
<br/><br/>

View File

@ -1 +1 @@
theme: jekyll-theme-merlot
theme: jekyll-theme-minimal

View File

@ -1,4 +1,6 @@
[flake8]
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E127,E131,E241,E252,E266,E272,E701,E731,W293,W503
select = F,E9
exclude = migrations,util_scripts,node_modules,venv
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E131,E241,E252,E266,E272,E701,E731,W293,W503,W291,W391
select = F,E9,W
max-line-length = 130
max-complexity = 10
exclude = migrations,tests,node_modules,vendor,static,venv,.venv,.venv2,.docker-venv

View File

@ -1 +1 @@
0.4.0
0.4.13

View File

@ -1,6 +1 @@
__package__ = 'archivebox'
from . import core
from . import cli
from .main import *

View File

@ -3,13 +3,9 @@
__package__ = 'archivebox'
import sys
from .cli import archivebox
def main():
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
from .cli import main
if __name__ == '__main__':
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
main(args=sys.argv[1:], stdin=sys.stdin)

View File

@ -1,8 +1,14 @@
__package__ = 'archivebox.cli'
__command__ = 'archivebox'
import os
import sys
import argparse
from typing import Optional, Dict, List, IO
from ..config import OUTPUT_DIR
from typing import Dict, List, Optional, IO
from importlib import import_module
CLI_DIR = os.path.dirname(os.path.abspath(__file__))
@ -24,6 +30,7 @@ is_valid_cli_module = lambda module, subcommand: (
and module.__command__.split(' ')[-1] == subcommand
)
def list_subcommands() -> Dict[str, str]:
"""find and import all valid archivebox_<subcommand>.py files in CLI_DIR"""
@ -57,6 +64,69 @@ def run_subcommand(subcommand: str,
SUBCOMMANDS = list_subcommands()
class NotProvided:
pass
def main(args: Optional[List[str]]=NotProvided, stdin: Optional[IO]=NotProvided, pwd: Optional[str]=None) -> None:
args = sys.argv[1:] if args is NotProvided else args
stdin = sys.stdin if stdin is NotProvided else stdin
subcommands = list_subcommands()
parser = argparse.ArgumentParser(
prog=__command__,
description='ArchiveBox: The self-hosted internet archive',
add_help=False,
)
group = parser.add_mutually_exclusive_group()
group.add_argument(
'--help', '-h',
action='store_true',
help=subcommands['help'],
)
group.add_argument(
'--version',
action='store_true',
help=subcommands['version'],
)
group.add_argument(
"subcommand",
type=str,
help= "The name of the subcommand to run",
nargs='?',
choices=subcommands.keys(),
default=None,
)
parser.add_argument(
"subcommand_args",
help="Arguments for the subcommand",
nargs=argparse.REMAINDER,
)
command = parser.parse_args(args or ())
if command.help or command.subcommand is None:
command.subcommand = 'help'
elif command.version:
command.subcommand = 'version'
if command.subcommand not in ('help', 'version', 'status'):
from ..logging_util import log_cli_command
log_cli_command(
subcommand=command.subcommand,
subcommand_args=command.subcommand_args,
stdin=stdin,
pwd=pwd or OUTPUT_DIR
)
run_subcommand(
subcommand=command.subcommand,
subcommand_args=command.subcommand_args,
stdin=stdin,
pwd=pwd or OUTPUT_DIR,
)
__all__ = (
'SUBCOMMANDS',
'list_subcommands',

View File

@ -1,63 +0,0 @@
#!/usr/bin/env python3
# archivebox [command]
__package__ = 'archivebox.cli'
__command__ = 'archivebox'
import sys
import argparse
from typing import Optional, List, IO
from . import list_subcommands, run_subcommand
from ..config import OUTPUT_DIR
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
subcommands = list_subcommands()
parser = argparse.ArgumentParser(
prog=__command__,
description='ArchiveBox: The self-hosted internet archive',
add_help=False,
)
group = parser.add_mutually_exclusive_group()
group.add_argument(
'--help', '-h',
action='store_true',
help=subcommands['help'],
)
group.add_argument(
'--version',
action='store_true',
help=subcommands['version'],
)
group.add_argument(
"subcommand",
type=str,
help= "The name of the subcommand to run",
nargs='?',
choices=subcommands.keys(),
default=None,
)
parser.add_argument(
"subcommand_args",
help="Arguments for the subcommand",
nargs=argparse.REMAINDER,
)
command = parser.parse_args(args or ())
if command.help or command.subcommand is None:
command.subcommand = 'help'
if command.version:
command.subcommand = 'version'
run_subcommand(
subcommand=command.subcommand,
subcommand_args=command.subcommand_args,
stdin=stdin,
pwd=pwd or OUTPUT_DIR,
)
if __name__ == '__main__':
main(args=sys.argv[1:], stdin=sys.stdin)

View File

@ -8,9 +8,10 @@ import argparse
from typing import List, Optional, IO
from ..main import add, docstring
from ..main import add
from ..util import docstring
from ..config import OUTPUT_DIR, ONLY_NEW
from .logging import SmartFormatter, accept_stdin
from ..logging_util import SmartFormatter, accept_stdin, stderr
@docstring(add.__doc__)
@ -33,23 +34,39 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
help="Add the links to the main index without archiving them",
)
parser.add_argument(
'import_path',
nargs='?',
'urls',
nargs='*',
type=str,
default=None,
help=(
'URL or path to local file containing a list of links to import. e.g.:\n'
'URLs or paths to archive e.g.:\n'
' https://getpocket.com/users/USERNAME/feed/all\n'
' https://example.com/some/rss/feed.xml\n'
' https://example.com\n'
' ~/Downloads/firefox_bookmarks_export.html\n'
' ~/Desktop/sites_list.csv\n'
)
)
parser.add_argument(
"--depth",
action="store",
default=0,
choices=[0, 1],
type=int,
help="Recursively archive all linked pages up to this many hops away"
)
command = parser.parse_args(args or ())
import_str = accept_stdin(stdin)
urls = command.urls
stdin_urls = accept_stdin(stdin)
if (stdin_urls and urls) or (not stdin and not urls):
stderr(
'[X] You must pass URLs/paths to add via stdin or CLI arguments.\n',
color='red',
)
raise SystemExit(2)
add(
import_str=import_str,
import_path=command.import_path,
urls=stdin_urls or urls,
depth=command.depth,
update_all=command.update_all,
index_only=command.index_only,
out_dir=pwd or OUTPUT_DIR,
@ -63,12 +80,6 @@ if __name__ == '__main__':
# TODO: Implement these
#
# parser.add_argument(
# '--depth', #'-d',
# type=int,
# help='Recursively archive all linked pages up to this many hops away',
# default=0,
# )
# parser.add_argument(
# '--mirror', #'-m',
# action='store_true',
# help='Archive an entire site (finding all linked pages below it on the same domain)',

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import config, docstring
from ..main import config
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, accept_stdin
from ..logging_util import SmartFormatter, accept_stdin
@docstring(config.__doc__)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import help, docstring
from ..main import help
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(help.__doc__)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import init, docstring
from ..main import init
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(init.__doc__)

View File

@ -8,7 +8,8 @@ import argparse
from typing import Optional, List, IO
from ..main import list_all, docstring
from ..main import list_all
from ..util import docstring
from ..config import OUTPUT_DIR
from ..index import (
get_indexed_folders,
@ -22,7 +23,7 @@ from ..index import (
get_corrupted_folders,
get_unrecognized_folders,
)
from .logging import SmartFormatter, accept_stdin
from ..logging_util import SmartFormatter, accept_stdin
@docstring(list_all.__doc__)

View File

@ -7,7 +7,8 @@ import sys
from typing import Optional, List, IO
from ..main import manage, docstring
from ..main import manage
from ..util import docstring
from ..config import OUTPUT_DIR

View File

@ -0,0 +1,62 @@
#!/usr/bin/env python3
__package__ = 'archivebox.cli'
__command__ = 'archivebox oneshot'
import sys
import argparse
from pathlib import Path
from typing import List, Optional, IO
from ..main import oneshot
from ..util import docstring
from ..config import OUTPUT_DIR
from ..logging_util import SmartFormatter, accept_stdin, stderr
@docstring(oneshot.__doc__)
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
parser = argparse.ArgumentParser(
prog=__command__,
description=oneshot.__doc__,
add_help=True,
formatter_class=SmartFormatter,
)
parser.add_argument(
'url',
type=str,
default=None,
help=(
'URLs or paths to archive e.g.:\n'
' https://getpocket.com/users/USERNAME/feed/all\n'
' https://example.com/some/rss/feed.xml\n'
' https://example.com\n'
' ~/Downloads/firefox_bookmarks_export.html\n'
' ~/Desktop/sites_list.csv\n'
)
)
parser.add_argument(
'--out-dir',
type=str,
default=OUTPUT_DIR,
help= "Path to save the single archive folder to, e.g. ./example.com_archive"
)
command = parser.parse_args(args or ())
url = command.url
stdin_url = accept_stdin(stdin)
if (stdin_url and url) or (not stdin and not url):
stderr(
'[X] You must pass a URL/path to add via stdin or CLI arguments.\n',
color='red',
)
raise SystemExit(2)
oneshot(
url=stdin_url or url,
out_dir=str(Path(command.out_dir).absolute()),
)
if __name__ == '__main__':
main(args=sys.argv[1:], stdin=sys.stdin)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import remove, docstring
from ..main import remove
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, accept_stdin
from ..logging_util import SmartFormatter, accept_stdin
@docstring(remove.__doc__)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import schedule, docstring
from ..main import schedule
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(schedule.__doc__)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import server, docstring
from ..main import server
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(server.__doc__)
@ -38,6 +39,11 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
action='store_true',
help='Enable DEBUG=True mode with more verbose errors',
)
parser.add_argument(
'--init',
action='store_true',
help='Run archivebox init before starting the server',
)
command = parser.parse_args(args or ())
reject_stdin(__command__, stdin)
@ -45,6 +51,7 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
runserver_args=command.runserver_args,
reload=command.reload,
debug=command.debug,
init=command.init,
out_dir=pwd or OUTPUT_DIR,
)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import shell, docstring
from ..main import shell
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(shell.__doc__)

View File

@ -1,30 +1,31 @@
#!/usr/bin/env python3
__package__ = 'archivebox.cli'
__command__ = 'archivebox info'
__command__ = 'archivebox status'
import sys
import argparse
from typing import Optional, List, IO
from ..main import info, docstring
from ..main import status
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(info.__doc__)
@docstring(status.__doc__)
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
parser = argparse.ArgumentParser(
prog=__command__,
description=info.__doc__,
description=status.__doc__,
add_help=True,
formatter_class=SmartFormatter,
)
parser.parse_args(args or ())
reject_stdin(__command__, stdin)
info(out_dir=pwd or OUTPUT_DIR)
status(out_dir=pwd or OUTPUT_DIR)
if __name__ == '__main__':

View File

@ -8,7 +8,8 @@ import argparse
from typing import List, Optional, IO
from ..main import update, docstring
from ..main import update
from ..util import docstring
from ..config import OUTPUT_DIR
from ..index import (
get_indexed_folders,
@ -22,7 +23,7 @@ from ..index import (
get_corrupted_folders,
get_unrecognized_folders,
)
from .logging import SmartFormatter, accept_stdin
from ..logging_util import SmartFormatter, accept_stdin
@docstring(update.__doc__)

View File

@ -8,9 +8,10 @@ import argparse
from typing import Optional, List, IO
from ..main import version, docstring
from ..main import version
from ..util import docstring
from ..config import OUTPUT_DIR
from .logging import SmartFormatter, reject_stdin
from ..logging_util import SmartFormatter, reject_stdin
@docstring(version.__doc__)

View File

@ -198,7 +198,7 @@ class TestRemove(unittest.TestCase):
def test_remove_regex(self):
with output_hidden():
archivebox_remove.main(['--yes', '--delete', '--filter-type=regex', 'http(s)?:\/\/(.+\.)?(example\d\.com)'])
archivebox_remove.main(['--yes', '--delete', '--filter-type=regex', r'http(s)?:\/\/(.+\.)?(example\d\.com)'])
all_links = load_main_index(out_dir=OUTPUT_DIR)
assert len(all_links) == 4

View File

@ -9,9 +9,11 @@ import getpass
import shutil
from hashlib import md5
from pathlib import Path
from typing import Optional, Type, Tuple, Dict
from subprocess import run, PIPE, DEVNULL
from configparser import ConfigParser
from collections import defaultdict
from .stubs import (
SimpleConfigValueDict,
@ -21,6 +23,14 @@ from .stubs import (
ConfigDefaultDict,
)
# precedence order for config:
# 1. cli args
# 2. shell environment vars
# 3. config file
# 4. defaults
# env USE_COLO=false archivebox add '...'
# env SHOW_PROGRESS=1 archivebox add '...'
# ******************************************************************************
# Documentation: https://github.com/pirate/ArchiveBox/wiki/Configuration
@ -35,6 +45,8 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
'IS_TTY': {'type': bool, 'default': lambda _: sys.stdout.isatty()},
'USE_COLOR': {'type': bool, 'default': lambda c: c['IS_TTY']},
'SHOW_PROGRESS': {'type': bool, 'default': lambda c: c['IS_TTY']},
'IN_DOCKER': {'type': bool, 'default': False},
# TODO: 'SHOW_HINTS': {'type: bool, 'default': True},
},
'GENERAL_CONFIG': {
@ -44,21 +56,33 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
'TIMEOUT': {'type': int, 'default': 60},
'MEDIA_TIMEOUT': {'type': int, 'default': 3600},
'OUTPUT_PERMISSIONS': {'type': str, 'default': '755'},
'FOOTER_INFO': {'type': str, 'default': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.'},
'RESTRICT_FILE_NAMES': {'type': str, 'default': 'windows'},
'URL_BLACKLIST': {'type': str, 'default': None},
},
'SERVER_CONFIG': {
'SECRET_KEY': {'type': str, 'default': None},
'ALLOWED_HOSTS': {'type': str, 'default': '*'},
'DEBUG': {'type': bool, 'default': False},
'PUBLIC_INDEX': {'type': bool, 'default': True},
'PUBLIC_SNAPSHOTS': {'type': bool, 'default': True},
'FOOTER_INFO': {'type': str, 'default': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.'},
'ACTIVE_THEME': {'type': str, 'default': 'default'},
},
'ARCHIVE_METHOD_TOGGLES': {
'SAVE_TITLE': {'type': bool, 'default': True, 'aliases': ('FETCH_TITLE',)},
'SAVE_FAVICON': {'type': bool, 'default': True, 'aliases': ('FETCH_FAVICON',)},
'SAVE_WGET': {'type': bool, 'default': True, 'aliases': ('FETCH_WGET',)},
'SAVE_WGET_REQUISITES': {'type': bool, 'default': True, 'aliases': ('FETCH_WGET_REQUISITES',)},
'SAVE_SINGLEFILE': {'type': bool, 'default': True, 'aliases': ('FETCH_SINGLEFILE',)},
'SAVE_PDF': {'type': bool, 'default': True, 'aliases': ('FETCH_PDF',)},
'SAVE_SCREENSHOT': {'type': bool, 'default': True, 'aliases': ('FETCH_SCREENSHOT',)},
'SAVE_DOM': {'type': bool, 'default': True, 'aliases': ('FETCH_DOM',)},
'SAVE_WARC': {'type': bool, 'default': True, 'aliases': ('FETCH_WARC',)},
'SAVE_GIT': {'type': bool, 'default': True, 'aliases': ('FETCH_GIT',)},
'SAVE_MEDIA': {'type': bool, 'default': True, 'aliases': ('FETCH_MEDIA',)},
'SAVE_PLAYLISTS': {'type': bool, 'default': True, 'aliases': ('FETCH_PLAYLISTS',)},
'SAVE_ARCHIVE_DOT_ORG': {'type': bool, 'default': True, 'aliases': ('SUBMIT_ARCHIVE_DOT_ORG',)},
},
@ -67,6 +91,7 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
'GIT_DOMAINS': {'type': str, 'default': 'github.com,bitbucket.org,gitlab.com'},
'CHECK_SSL_VALIDITY': {'type': bool, 'default': True},
'CURL_USER_AGENT': {'type': str, 'default': 'ArchiveBox/{VERSION} (+https://github.com/pirate/ArchiveBox/) curl/{CURL_VERSION}'},
'WGET_USER_AGENT': {'type': str, 'default': 'ArchiveBox/{VERSION} (+https://github.com/pirate/ArchiveBox/) wget/{WGET_VERSION}'},
'CHROME_USER_AGENT': {'type': str, 'default': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'},
@ -75,11 +100,13 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
'CHROME_HEADLESS': {'type': bool, 'default': True},
'CHROME_SANDBOX': {'type': bool, 'default': True},
},
'DEPENDENCY_CONFIG': {
'USE_CURL': {'type': bool, 'default': True},
'USE_WGET': {'type': bool, 'default': True},
'USE_SINGLEFILE': {'type': bool, 'default': True},
'USE_GIT': {'type': bool, 'default': True},
'USE_CHROME': {'type': bool, 'default': True},
'USE_YOUTUBEDL': {'type': bool, 'default': True},
@ -87,6 +114,7 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
'CURL_BINARY': {'type': str, 'default': 'curl'},
'GIT_BINARY': {'type': str, 'default': 'git'},
'WGET_BINARY': {'type': str, 'default': 'wget'},
'SINGLEFILE_BINARY': {'type': str, 'default': 'single-file'},
'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
'CHROME_BINARY': {'type': str, 'default': None},
},
@ -119,8 +147,20 @@ DEFAULT_CLI_COLORS = {
}
ANSI = {k: '' for k in DEFAULT_CLI_COLORS.keys()}
COLOR_DICT = defaultdict(lambda: [(0, 0, 0), (0, 0, 0)], {
'00': [(0, 0, 0), (0, 0, 0)],
'30': [(0, 0, 0), (0, 0, 0)],
'31': [(255, 0, 0), (128, 0, 0)],
'32': [(0, 200, 0), (0, 128, 0)],
'33': [(255, 255, 0), (128, 128, 0)],
'34': [(0, 0, 255), (0, 0, 128)],
'35': [(255, 0, 255), (128, 0, 128)],
'36': [(0, 255, 255), (0, 128, 128)],
'37': [(255, 255, 255), (255, 255, 255)],
})
STATICFILE_EXTENSIONS = {
# 99.999% of the time, URLs ending in these extentions are static files
# 99.999% of the time, URLs ending in these extensions are static files
# that can be downloaded as-is, not html pages that need to be rendered
'gif', 'jpeg', 'jpg', 'png', 'tif', 'tiff', 'wbmp', 'ico', 'jng', 'bmp',
'svg', 'svgz', 'webp', 'ps', 'eps', 'ai',
@ -137,7 +177,7 @@ STATICFILE_EXTENSIONS = {
# pl pm, prc pdb, rar, rpm, sea, sit, tcl tk, der, pem, crt, xpi, xspf,
# ra, mng, asx, asf, 3gpp, 3gp, mid, midi, kar, jad, wml, htc, mml
# Thse are always treated as pages, not as static files, never add them:
# These are always treated as pages, not as static files, never add them:
# html, htm, shtml, xhtml, xml, aspx, php, cgi
}
@ -195,13 +235,14 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
'PYTHON_BINARY': {'default': lambda c: sys.executable},
'PYTHON_ENCODING': {'default': lambda c: sys.stdout.encoding.upper()},
'PYTHON_VERSION': {'default': lambda c: '{}.{}'.format(sys.version_info.major, sys.version_info.minor)},
'PYTHON_VERSION': {'default': lambda c: '{}.{}.{}'.format(*sys.version_info[:3])},
'DJANGO_BINARY': {'default': lambda c: django.__file__.replace('__init__.py', 'bin/django-admin.py')},
'DJANGO_VERSION': {'default': lambda c: '{}.{}.{} {} ({})'.format(*django.VERSION)},
'USE_CURL': {'default': lambda c: c['USE_CURL'] and (c['SAVE_FAVICON'] or c['SAVE_ARCHIVE_DOT_ORG'])},
'USE_CURL': {'default': lambda c: c['USE_CURL'] and (c['SAVE_FAVICON'] or c['SAVE_TITLE'] or c['SAVE_ARCHIVE_DOT_ORG'])},
'CURL_VERSION': {'default': lambda c: bin_version(c['CURL_BINARY']) if c['USE_CURL'] else None},
'CURL_USER_AGENT': {'default': lambda c: c['CURL_USER_AGENT'].format(**c)},
'SAVE_FAVICON': {'default': lambda c: c['USE_CURL'] and c['SAVE_FAVICON']},
'SAVE_ARCHIVE_DOT_ORG': {'default': lambda c: c['USE_CURL'] and c['SAVE_ARCHIVE_DOT_ORG']},
@ -212,6 +253,9 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
'SAVE_WGET': {'default': lambda c: c['USE_WGET'] and c['SAVE_WGET']},
'SAVE_WARC': {'default': lambda c: c['USE_WGET'] and c['SAVE_WARC']},
'USE_SINGLEFILE': {'default': lambda c: c['USE_SINGLEFILE'] and c['SAVE_SINGLEFILE']},
'SINGLEFILE_VERSION': {'default': lambda c: bin_version(c['SINGLEFILE_BINARY']) if c['USE_SINGLEFILE'] else None},
'USE_GIT': {'default': lambda c: c['USE_GIT'] and c['SAVE_GIT']},
'GIT_VERSION': {'default': lambda c: bin_version(c['GIT_BINARY']) if c['USE_GIT'] else None},
'SAVE_GIT': {'default': lambda c: c['USE_GIT'] and c['SAVE_GIT']},
@ -219,13 +263,15 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
'USE_YOUTUBEDL': {'default': lambda c: c['USE_YOUTUBEDL'] and c['SAVE_MEDIA']},
'YOUTUBEDL_VERSION': {'default': lambda c: bin_version(c['YOUTUBEDL_BINARY']) if c['USE_YOUTUBEDL'] else None},
'SAVE_MEDIA': {'default': lambda c: c['USE_YOUTUBEDL'] and c['SAVE_MEDIA']},
'SAVE_PLAYLISTS': {'default': lambda c: c['SAVE_PLAYLISTS'] and c['SAVE_MEDIA']},
'USE_CHROME': {'default': lambda c: c['USE_CHROME'] and (c['SAVE_PDF'] or c['SAVE_SCREENSHOT'] or c['SAVE_DOM'])},
'USE_CHROME': {'default': lambda c: c['USE_CHROME'] and (c['SAVE_PDF'] or c['SAVE_SCREENSHOT'] or c['SAVE_DOM'] or c['SAVE_SINGLEFILE'])},
'CHROME_BINARY': {'default': lambda c: c['CHROME_BINARY'] if c['CHROME_BINARY'] else find_chrome_binary()},
'CHROME_VERSION': {'default': lambda c: bin_version(c['CHROME_BINARY']) if c['USE_CHROME'] else None},
'SAVE_PDF': {'default': lambda c: c['USE_CHROME'] and c['SAVE_PDF']},
'SAVE_SCREENSHOT': {'default': lambda c: c['USE_CHROME'] and c['SAVE_SCREENSHOT']},
'SAVE_DOM': {'default': lambda c: c['USE_CHROME'] and c['SAVE_DOM']},
'SAVE_SINGLEFILE': {'default': lambda c: c['USE_CHROME'] and c['USE_SINGLEFILE']},
'DEPENDENCIES': {'default': lambda c: get_dependency_info(c)},
'CODE_LOCATIONS': {'default': lambda c: get_code_locations(c)},
@ -245,6 +291,8 @@ def load_config_val(key: str,
config: Optional[ConfigDict]=None,
env_vars: Optional[os._Environ]=None,
config_file_vars: Optional[Dict[str, str]]=None) -> ConfigValue:
"""parse bool, int, and str key=value pairs from env"""
config_keys_to_check = (key, *(aliases or ()))
for key in config_keys_to_check:
@ -284,6 +332,7 @@ def load_config_val(key: str,
raise Exception('Config values can only be str, bool, or int')
def load_config_file(out_dir: str=None) -> Optional[Dict[str, str]]:
"""load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf"""
@ -304,53 +353,67 @@ def load_config_file(out_dir: str=None) -> Optional[Dict[str, str]]:
return config_file_vars
return None
def write_config_file(config: Dict[str, str], out_dir: str=None) -> ConfigDict:
"""load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf"""
from ..system import atomic_write
out_dir = out_dir or os.path.abspath(os.getenv('OUTPUT_DIR', '.'))
config_path = os.path.join(out_dir, CONFIG_FILENAME)
if not os.path.exists(config_path):
with open(config_path, 'w+') as f:
f.write(CONFIG_HEADER)
if not config:
return {}
if not os.path.exists(config_path):
atomic_write(config_path, CONFIG_HEADER)
config_file = ConfigParser()
config_file.optionxform = str
config_file.read(config_path)
with open(config_path, 'r') as old:
atomic_write(f'{config_path}.bak', old.read())
find_section = lambda key: [name for name, opts in CONFIG_DEFAULTS.items() if key in opts][0]
with open(f'{config_path}.old', 'w+') as old:
with open(config_path, 'r') as new:
old.write(new.read())
with open(config_path, 'w+') as f:
# Set up sections in empty config file
for key, val in config.items():
section = find_section(key)
if section in config_file:
existing_config = dict(config_file[section])
else:
existing_config = {}
config_file[section] = {**existing_config, key: val}
config_file.write(f)
# always make sure there's a SECRET_KEY defined for Django
existing_secret_key = None
if 'SERVER_CONFIG' in config_file and 'SECRET_KEY' in config_file['SERVER_CONFIG']:
existing_secret_key = config_file['SERVER_CONFIG']['SECRET_KEY']
if (not existing_secret_key) or ('not a valid secret' in existing_secret_key):
from django.utils.crypto import get_random_string
chars = 'abcdefghijklmnopqrstuvwxyz0123456789-_+!.'
random_secret_key = get_random_string(50, chars)
if 'SERVER_CONFIG' in config_file:
config_file['SERVER_CONFIG']['SECRET_KEY'] = random_secret_key
else:
config_file['SERVER_CONFIG'] = {'SECRET_KEY': random_secret_key}
with open(config_path, 'w+') as new:
config_file.write(new)
try:
# validate the config by attempting to re-parse it
CONFIG = load_all_config()
return {
key.upper(): CONFIG.get(key.upper())
for key in config.keys()
}
except:
with open(f'{config_path}.old', 'r') as old:
with open(config_path, 'w+') as new:
new.write(old.read())
# something went horribly wrong, rever to the previous version
with open(f'{config_path}.bak', 'r') as old:
atomic_write(config_path, old.read())
if os.path.exists(f'{config_path}.old'):
os.remove(f'{config_path}.old')
if os.path.exists(f'{config_path}.bak'):
os.remove(f'{config_path}.bak')
return {}
@ -438,8 +501,10 @@ def bin_path(binary: Optional[str]) -> Optional[str]:
return shutil.which(os.path.expanduser(binary)) or binary
def bin_hash(binary: Optional[str]) -> Optional[str]:
if binary is None:
return None
abs_path = bin_path(binary)
if abs_path is None:
if abs_path is None or not Path(abs_path).exists():
return None
file_hash = md5()
@ -457,6 +522,7 @@ def find_chrome_binary() -> Optional[str]:
'chromium-browser',
'chromium',
'/Applications/Chromium.app/Contents/MacOS/Chromium',
'chrome',
'google-chrome',
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
'google-chrome-stable',
@ -483,6 +549,7 @@ def find_chrome_data_dir() -> Optional[str]:
'~/.config/chromium',
'~/Library/Application Support/Chromium',
'~/AppData/Local/Chromium/User Data',
'~/.config/chrome',
'~/.config/google-chrome',
'~/Library/Application Support/Google/Chrome',
'~/AppData/Local/Google/Chrome/User Data',
@ -615,6 +682,13 @@ def get_dependency_info(config: ConfigDict) -> ConfigValue:
'enabled': config['USE_WGET'],
'is_valid': bool(config['WGET_VERSION']),
},
'SINGLEFILE_BINARY': {
'path': bin_path(config['SINGLEFILE_BINARY']),
'version': config['SINGLEFILE_VERSION'],
'hash': bin_hash(config['SINGLEFILE_BINARY']),
'enabled': config['USE_SINGLEFILE'],
'is_valid': bool(config['SINGLEFILE_VERSION']),
},
'GIT_BINARY': {
'path': bin_path(config['GIT_BINARY']),
'version': config['GIT_VERSION'],
@ -664,6 +738,9 @@ def load_all_config():
CONFIG = load_all_config()
globals().update(CONFIG)
# Timezone set as UTC
os.environ["TZ"] = 'UTC'
############################## Importable Checkers #############################
@ -676,7 +753,7 @@ def check_system_config(config: ConfigDict=CONFIG) -> None:
raise SystemExit(2)
### Check Python environment
if float(config['PYTHON_VERSION']) < 3.6:
if sys.version_info[:3] < (3, 6, 0):
stderr(f'[X] Python version is not new enough: {config["PYTHON_VERSION"]} (>3.6 is required)', color='red')
stderr(' See https://github.com/pirate/ArchiveBox/wiki/Troubleshooting#python for help upgrading your Python installation.')
raise SystemExit(2)
@ -705,9 +782,16 @@ def check_system_config(config: ConfigDict=CONFIG) -> None:
stderr(' CHROME_USER_DATA_DIR="{}"'.format(config['CHROME_USER_DATA_DIR'].split('/Default')[0]))
raise SystemExit(2)
def dependency_additional_info(dependency: str) -> str:
if dependency == "SINGLEFILE_BINARY":
return "Please follow the installation instructions at https://github.com/gildas-lormeau/SingleFile/tree/master/cli and set SINGLEFILE_BINARY or set USE_SINGLEFILE=false"
return ""
def check_dependencies(config: ConfigDict=CONFIG, show_help: bool=True) -> None:
invalid = [
'{}: {} ({})'.format(name, info['path'] or 'unable to find binary', info['version'] or 'unable to detect version')
'{}: {} ({}). {}'.format(name, info['path'] or 'unable to find binary', info['version'] or 'unable to detect version',
dependency_additional_info(name))
for name, info in config['DEPENDENCIES'].items()
if info['enabled'] and not info['is_valid']
]
@ -726,7 +810,7 @@ def check_dependencies(config: ConfigDict=CONFIG, show_help: bool=True) -> None:
stderr()
stderr(f'[!] Warning: TIMEOUT is set too low! (currently set to TIMEOUT={config["TIMEOUT"]} seconds)', color='red')
stderr(' You must allow *at least* 5 seconds for indexing and archive methods to run succesfully.')
stderr(' (Setting it to somewhere between 30 and 300 seconds is recommended)')
stderr(' (Setting it to somewhere between 30 and 3000 seconds is recommended)')
stderr()
stderr(' If you want to make ArchiveBox run faster, disable specific archive methods instead:')
stderr(' https://github.com/pirate/ArchiveBox/wiki/Configuration#archive-method-toggles')
@ -756,14 +840,14 @@ def check_data_folder(out_dir: Optional[str]=None, config: ConfigDict=CONFIG) ->
json_index_exists = os.path.exists(os.path.join(output_dir, JSON_INDEX_FILENAME))
if not json_index_exists:
stderr('[X] No archive main index was found in current directory.', color='red')
stderr(f' {output_dir}')
stderr('[X] No archivebox index found in the current directory.', color='red')
stderr(f' {output_dir}', color='lightyellow')
stderr()
stderr(' Are you running archivebox in the right folder?')
stderr(' {lightred}Hint{reset}: Are you running archivebox in the right folder?'.format(**config['ANSI']))
stderr(' cd path/to/your/archive/folder')
stderr(' archivebox [command]')
stderr()
stderr(' To create a new archive collection or import existing data in this folder, run:')
stderr(' {lightred}Hint{reset}: To create a new archive collection or import existing data in this folder, run:'.format(**config['ANSI']))
stderr(' archivebox init')
raise SystemExit(2)
@ -785,9 +869,15 @@ def check_data_folder(out_dir: Optional[str]=None, config: ConfigDict=CONFIG) ->
stderr(' archivebox init')
raise SystemExit(3)
sources_dir = os.path.join(output_dir, SOURCES_DIR_NAME)
if not os.path.exists(sources_dir):
os.makedirs(sources_dir)
def setup_django(out_dir: str=None, check_db=False, config: ConfigDict=CONFIG) -> None:
check_system_config()
output_dir = out_dir or config['OUTPUT_DIR']
assert isinstance(output_dir, str) and isinstance(config['PYTHON_DIR'], str)
@ -806,4 +896,4 @@ def setup_django(out_dir: str=None, check_db=False, config: ConfigDict=CONFIG) -
except KeyboardInterrupt:
raise SystemExit(2)
check_system_config()
os.umask(0o777 - int(OUTPUT_PERMISSIONS, base=8)) # noqa: F821

View File

@ -12,9 +12,24 @@ class BaseConfig(TypedDict):
pass
class ConfigDict(BaseConfig, total=False):
"""
# Regenerate by pasting this quine into `archivebox shell` 🥚
from archivebox.config import ConfigDict, CONFIG_DEFAULTS
print('class ConfigDict(BaseConfig, total=False):')
print(' ' + '"'*3 + ConfigDict.__doc__ + '"'*3)
for section, configs in CONFIG_DEFAULTS.items():
for key, attrs in configs.items():
Type, default = attrs['type'], attrs['default']
if default is None:
print(f' {key}: Optional[{Type.__name__}]')
else:
print(f' {key}: {Type.__name__}')
print()
"""
IS_TTY: bool
USE_COLOR: bool
SHOW_PROGRESS: bool
IN_DOCKER: bool
OUTPUT_DIR: str
CONFIG_FILE: str
@ -22,9 +37,16 @@ class ConfigDict(BaseConfig, total=False):
TIMEOUT: int
MEDIA_TIMEOUT: int
OUTPUT_PERMISSIONS: str
FOOTER_INFO: str
URL_BLACKLIST: Optional[str]
SECRET_KEY: str
ALLOWED_HOSTS: str
DEBUG: bool
PUBLIC_INDEX: bool
PUBLIC_SNAPSHOTS: bool
FOOTER_INFO: str
ACTIVE_THEME: str
SAVE_TITLE: bool
SAVE_FAVICON: bool
SAVE_WGET: bool
@ -32,14 +54,17 @@ class ConfigDict(BaseConfig, total=False):
SAVE_PDF: bool
SAVE_SCREENSHOT: bool
SAVE_DOM: bool
SAVE_SINGLEFILE: bool
SAVE_WARC: bool
SAVE_GIT: bool
SAVE_MEDIA: bool
SAVE_PLAYLISTS: bool
SAVE_ARCHIVE_DOT_ORG: bool
RESOLUTION: str
GIT_DOMAINS: str
CHECK_SSL_VALIDITY: bool
CURL_USER_AGENT: str
WGET_USER_AGENT: str
CHROME_USER_AGENT: str
COOKIES_FILE: Optional[str]
@ -52,12 +77,14 @@ class ConfigDict(BaseConfig, total=False):
USE_GIT: bool
USE_CHROME: bool
USE_YOUTUBEDL: bool
USE_SINGLEFILE: bool
CURL_BINARY: Optional[str]
GIT_BINARY: Optional[str]
WGET_BINARY: Optional[str]
YOUTUBEDL_BINARY: Optional[str]
CHROME_BINARY: Optional[str]
SINGLEFILE_BINARY: Optional[str]
TERM_WIDTH: Callable[[], int]
USER: str

View File

@ -1,17 +1,202 @@
__package__ = 'archivebox.core'
from io import StringIO
from contextlib import redirect_stdout
from pathlib import Path
from django.contrib import admin
from django.urls import path
from django.utils.html import format_html
from django.utils.safestring import mark_safe
from django.shortcuts import render, redirect
from django.contrib.auth import get_user_model
from core.models import Snapshot
from core.forms import AddLinkForm
from util import htmldecode, urldecode, ansi_to_html
from logging_util import printable_filesize
from main import add, remove
from config import OUTPUT_DIR
from extractors import archive_links
# TODO: https://stackoverflow.com/questions/40760880/add-custom-button-to-django-admin-panel
def update_snapshots(modeladmin, request, queryset):
archive_links([
snapshot.as_link()
for snapshot in queryset
], out_dir=OUTPUT_DIR)
update_snapshots.short_description = "Archive"
def update_titles(modeladmin, request, queryset):
archive_links([
snapshot.as_link()
for snapshot in queryset
], overwrite=True, methods=('title',), out_dir=OUTPUT_DIR)
update_titles.short_description = "Pull title"
def overwrite_snapshots(modeladmin, request, queryset):
archive_links([
snapshot.as_link()
for snapshot in queryset
], overwrite=True, out_dir=OUTPUT_DIR)
overwrite_snapshots.short_description = "Re-archive (overwrite)"
def verify_snapshots(modeladmin, request, queryset):
for snapshot in queryset:
print(snapshot.timestamp, snapshot.url, snapshot.is_archived, snapshot.archive_size, len(snapshot.history))
verify_snapshots.short_description = "Check"
def delete_snapshots(modeladmin, request, queryset):
remove(links=[snapshot.as_link() for snapshot in queryset], yes=True, delete=True, out_dir=OUTPUT_DIR)
delete_snapshots.short_description = "Delete"
class SnapshotAdmin(admin.ModelAdmin):
list_display = ('timestamp', 'short_url', 'title', 'is_archived', 'num_outputs', 'added', 'updated', 'url_hash')
readonly_fields = ('num_outputs', 'is_archived', 'added', 'updated', 'bookmarked')
fields = ('url', 'timestamp', 'title', 'tags', *readonly_fields)
list_display = ('added', 'title_str', 'url_str', 'files', 'size')
sort_fields = ('title_str', 'url_str', 'added')
readonly_fields = ('id', 'url', 'timestamp', 'num_outputs', 'is_archived', 'url_hash', 'added', 'updated')
search_fields = ('url', 'timestamp', 'title', 'tags')
fields = ('title', 'tags', *readonly_fields)
list_filter = ('added', 'updated', 'tags')
ordering = ['-added']
actions = [delete_snapshots, overwrite_snapshots, update_snapshots, update_titles, verify_snapshots]
actions_template = 'admin/actions_as_select.html'
def short_url(self, obj):
return obj.url[:64]
def id_str(self, obj):
return format_html(
'<code style="font-size: 10px">{}</code>',
obj.url_hash[:8],
)
def updated(self, obj):
return obj.isoformat()
def title_str(self, obj):
canon = obj.as_link().canonical_outputs()
tags = ''.join(
format_html('<span>{}</span>', tag.strip())
for tag in obj.tags.split(',')
) if obj.tags else ''
return format_html(
'<a href="/{}">'
'<img src="/{}/{}" class="favicon" onerror="this.remove()">'
'</a>'
'<a href="/{}/{}">'
'<b class="status-{}">{}</b>'
'</a>',
obj.archive_path,
obj.archive_path, canon['favicon_path'],
obj.archive_path, canon['wget_path'] or '',
'fetched' if obj.latest_title or obj.title else 'pending',
urldecode(htmldecode(obj.latest_title or obj.title or ''))[:128] or 'Pending...'
) + mark_safe(f'<span class="tags">{tags}</span>')
def files(self, obj):
link = obj.as_link()
canon = link.canonical_outputs()
out_dir = Path(link.link_dir)
link_tuple = lambda link, method: (link.archive_path, canon[method] or '', canon[method] and (out_dir / (canon[method] or 'notdone')).exists())
return format_html(
'<span class="files-icons" style="font-size: 1.2em; opacity: 0.8">'
'<a href="/{}/{}/" class="exists-{}" title="Wget clone">🌐 </a> '
'<a href="/{}/{}" class="exists-{}" title="PDF">📄</a> '
'<a href="/{}/{}" class="exists-{}" title="Screenshot">🖥 </a> '
'<a href="/{}/{}" class="exists-{}" title="HTML dump">🅷 </a> '
'<a href="/{}/{}/" class="exists-{}" title="WARC">🆆 </a> '
'<a href="/{}/{}" class="exists-{}" title="SingleFile">&#128476; </a>'
'<a href="/{}/{}/" class="exists-{}" title="Media files">📼 </a> '
'<a href="/{}/{}/" class="exists-{}" title="Git repos">📦 </a> '
'<a href="{}" class="exists-{}" title="Archive.org snapshot">🏛 </a> '
'</span>',
*link_tuple(link, 'wget_path'),
*link_tuple(link, 'pdf_path'),
*link_tuple(link, 'screenshot_path'),
*link_tuple(link, 'dom_path'),
*link_tuple(link, 'warc_path')[:2], any((out_dir / canon['warc_path']).glob('*.warc.gz')),
*link_tuple(link, 'singlefile_path'),
*link_tuple(link, 'media_path')[:2], any((out_dir / canon['media_path']).glob('*')),
*link_tuple(link, 'git_path')[:2], any((out_dir / canon['git_path']).glob('*')),
canon['archive_org_path'], (out_dir / 'archive.org.txt').exists(),
)
def size(self, obj):
return format_html(
'<a href="/{}" title="View all files">{}</a>',
obj.archive_path,
printable_filesize(obj.archive_size) if obj.archive_size else 'pending',
)
def url_str(self, obj):
return format_html(
'<a href="{}">{}</a>',
obj.url,
obj.url.split('://www.', 1)[-1].split('://', 1)[-1][:64],
)
id_str.short_description = 'ID'
title_str.short_description = 'Title'
url_str.short_description = 'Original URL'
id_str.admin_order_field = 'id'
title_str.admin_order_field = 'title'
url_str.admin_order_field = 'url'
class ArchiveBoxAdmin(admin.AdminSite):
site_header = 'ArchiveBox'
index_title = 'Links'
site_title = 'Index'
def get_urls(self):
return [
path('core/snapshot/add/', self.add_view, name='Add'),
] + super().get_urls()
def add_view(self, request):
if not request.user.is_authenticated:
return redirect(f'/admin/login/?next={request.path}')
request.current_app = self.name
context = {
**self.each_context(request),
'title': 'Add URLs',
}
if request.method == 'GET':
context['form'] = AddLinkForm()
elif request.method == 'POST':
form = AddLinkForm(request.POST)
if form.is_valid():
url = form.cleaned_data["url"]
print(f'[+] Adding URL: {url}')
depth = 0 if form.cleaned_data["depth"] == "0" else 1
input_kwargs = {
"urls": url,
"depth": depth,
"update_all": False,
"out_dir": OUTPUT_DIR,
}
add_stdout = StringIO()
with redirect_stdout(add_stdout):
add(**input_kwargs)
print(add_stdout.getvalue())
context.update({
"stdout": ansi_to_html(add_stdout.getvalue().strip()),
"form": AddLinkForm()
})
else:
context["form"] = form
return render(template_name='add_links.html', request=request, context=context)
admin.site = ArchiveBoxAdmin()
admin.site.register(get_user_model())
admin.site.register(Snapshot, SnapshotAdmin)
admin.site.disable_action('delete_selected')

14
archivebox/core/forms.py Normal file
View File

@ -0,0 +1,14 @@
__package__ = 'archivebox.core'
from django import forms
from ..util import URL_REGEX
CHOICES = (
('0', 'depth = 0 (archive just these URLs)'),
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
)
class AddLinkForm(forms.Form):
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
depth = forms.ChoiceField(label="Archive depth", choices=CHOICES, widget=forms.RadioSelect, initial='0')

View File

@ -0,0 +1,18 @@
# Generated by Django 3.0.7 on 2020-06-25 15:21
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0001_initial'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='timestamp',
field=models.CharField(default=None, max_length=32, null=True),
),
]

View File

@ -0,0 +1,38 @@
# Generated by Django 3.0.7 on 2020-06-30 10:34
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0002_auto_20200625_1521'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='added',
field=models.DateTimeField(auto_now_add=True, db_index=True),
),
migrations.AlterField(
model_name='snapshot',
name='tags',
field=models.CharField(db_index=True, default=None, max_length=256, null=True),
),
migrations.AlterField(
model_name='snapshot',
name='timestamp',
field=models.CharField(db_index=True, default=None, max_length=32, null=True),
),
migrations.AlterField(
model_name='snapshot',
name='title',
field=models.CharField(db_index=True, default=None, max_length=128, null=True),
),
migrations.AlterField(
model_name='snapshot',
name='updated',
field=models.DateTimeField(db_index=True, default=None, null=True),
),
]

View File

@ -0,0 +1,19 @@
# Generated by Django 3.0.7 on 2020-07-13 15:52
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0003_auto_20200630_1034'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='timestamp',
field=models.CharField(db_index=True, default=None, max_length=32, unique=True),
preserve_default=False,
),
]

View File

@ -0,0 +1,28 @@
# Generated by Django 3.0.7 on 2020-07-28 03:26
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('core', '0004_auto_20200713_1552'),
]
operations = [
migrations.AlterField(
model_name='snapshot',
name='tags',
field=models.CharField(blank=True, db_index=True, max_length=256, null=True),
),
migrations.AlterField(
model_name='snapshot',
name='title',
field=models.CharField(blank=True, db_index=True, max_length=128, null=True),
),
migrations.AlterField(
model_name='snapshot',
name='updated',
field=models.DateTimeField(blank=True, db_index=True, null=True),
),
]

View File

@ -3,6 +3,7 @@ __package__ = 'archivebox.core'
import uuid
from django.db import models
from django.utils.functional import cached_property
from ..util import parse_date
from ..index.schema import Link
@ -12,22 +13,24 @@ class Snapshot(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
url = models.URLField(unique=True)
timestamp = models.CharField(unique=True, max_length=32, null=True, default=None)
timestamp = models.CharField(max_length=32, unique=True, db_index=True)
title = models.CharField(max_length=128, null=True, default=None)
tags = models.CharField(max_length=256, null=True, default=None)
title = models.CharField(max_length=128, null=True, blank=True, db_index=True)
tags = models.CharField(max_length=256, null=True, blank=True, db_index=True)
created = models.DateTimeField(auto_now_add=True)
updated = models.DateTimeField(null=True, default=None)
added = models.DateTimeField(auto_now_add=True, db_index=True)
updated = models.DateTimeField(null=True, blank=True, db_index=True)
# bookmarked = models.DateTimeField()
keys = ('url', 'timestamp', 'title', 'tags', 'updated')
def __repr__(self) -> str:
return f'[{self.timestamp}] {self.url[:64]} ({self.title[:64]})'
title = self.title or '-'
return f'[{self.timestamp}] {self.url[:64]} ({title[:64]})'
def __str__(self) -> str:
return f'[{self.timestamp}] {self.url[:64]} ({self.title[:64]})'
title = self.title or '-'
return f'[{self.timestamp}] {self.url[:64]} ({title[:64]})'
@classmethod
def from_json(cls, info: dict):
@ -44,30 +47,52 @@ class Snapshot(models.Model):
def as_link(self) -> Link:
return Link.from_json(self.as_json())
@property
@cached_property
def bookmarked(self):
return parse_date(self.timestamp)
@property
@cached_property
def is_archived(self):
return self.as_link().is_archived
@property
@cached_property
def num_outputs(self):
return self.as_link().num_outputs
@property
@cached_property
def url_hash(self):
return self.as_link().url_hash
@property
@cached_property
def base_url(self):
return self.as_link().base_url
@property
@cached_property
def link_dir(self):
return self.as_link().link_dir
@cached_property
def archive_path(self):
return self.as_link().archive_path
@cached_property
def archive_size(self):
return self.as_link().archive_size
@cached_property
def history(self):
from ..index import load_link_details
return load_link_details(self.as_link()).history
@cached_property
def latest_title(self):
if ('title' in self.history
and self.history['title']
and (self.history['title'][-1].status == 'succeeded')
and self.history['title'][-1].output.strip()):
return self.history['title'][-1].output.strip()
return None
class SnapshotResult(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)

View File

@ -2,10 +2,7 @@ __package__ = 'archivebox.core'
import os
import sys
SECRET_KEY = '---------------- not a valid secret key ! ----------------'
DEBUG = os.getenv('DEBUG', 'False').lower() == 'true'
ALLOWED_HOSTS = ['*']
from django.utils.crypto import get_random_string
IS_PUBLIC = True # whether archive data requires logging in to view
@ -14,20 +11,29 @@ OUTPUT_DIR = os.path.abspath(os.getenv('OUTPUT_DIR', os.curdir))
ARCHIVE_DIR = os.path.join(OUTPUT_DIR, 'archive')
DATABASE_FILE = os.path.join(OUTPUT_DIR, 'index.sqlite3')
ACTIVE_THEME = 'default'
from ..config import ( # noqa: F401
DEBUG,
SECRET_KEY,
ALLOWED_HOSTS,
PYTHON_DIR,
ACTIVE_THEME,
SQL_INDEX_FILENAME,
OUTPUT_DIR,
}
ALLOWED_HOSTS = ALLOWED_HOSTS.split(',')
IS_SHELL = 'shell' in sys.argv[:3] or 'shell_plus' in sys.argv[:3]
APPEND_SLASH = True
SECRET_KEY = SECRET_KEY or get_random_string(50, 'abcdefghijklmnopqrstuvwxyz0123456789-_+!.')
INSTALLED_APPS = [
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
# 'django.contrib.sites',
'django.contrib.messages',
'django.contrib.admin',
'django.contrib.staticfiles',
'django.contrib.admin',
'core',
@ -42,17 +48,17 @@ MIDDLEWARE = [
'django.middleware.csrf.CsrfViewMiddleware',
'django.contrib.auth.middleware.AuthenticationMiddleware',
'django.contrib.messages.middleware.MessageMiddleware',
# 'django.middleware.clickjacking.XFrameOptionsMiddleware',
]
ROOT_URLCONF = 'core.urls'
APPEND_SLASH = True
TEMPLATES = [
{
'BACKEND': 'django.template.backends.django.DjangoTemplates',
'DIRS': [
os.path.join(REPO_DIR, 'themes', ACTIVE_THEME),
os.path.join(REPO_DIR, 'themes', 'default'),
os.path.join(REPO_DIR, 'themes'),
os.path.join(PYTHON_DIR, 'themes', ACTIVE_THEME),
os.path.join(PYTHON_DIR, 'themes', 'default'),
os.path.join(PYTHON_DIR, 'themes'),
],
'APP_DIRS': True,
'OPTIONS': {
@ -71,7 +77,7 @@ WSGI_APPLICATION = 'core.wsgi.application'
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': DATABASE_FILE,
'NAME': os.path.join(OUTPUT_DIR, SQL_INDEX_FILENAME),
}
}
@ -106,25 +112,23 @@ SHELL_PLUS_PRINT_SQL = False
IPYTHON_ARGUMENTS = ['--no-confirm-exit', '--no-banner']
IPYTHON_KERNEL_DISPLAY_NAME = 'ArchiveBox Django Shell'
if IS_SHELL:
os.environ['PYTHONSTARTUP'] = os.path.join(REPO_DIR, 'core', 'welcome_message.py')
os.environ['PYTHONSTARTUP'] = os.path.join(PYTHON_DIR, 'core', 'welcome_message.py')
LANGUAGE_CODE = 'en-us'
TIME_ZONE = 'UTC'
USE_I18N = True
USE_L10N = True
USE_I18N = False
USE_L10N = False
USE_TZ = False
DATETIME_FORMAT = 'Y-m-d g:iA'
SHORT_DATETIME_FORMAT = 'Y-m-d h:iA'
EMAIL_BACKEND = 'django.core.mail.backends.console.EmailBackend'
STATIC_URL = '/static/'
STATICFILES_DIRS = [
os.path.join(REPO_DIR, 'themes', ACTIVE_THEME, 'static'),
os.path.join(REPO_DIR, 'themes', 'default', 'static'),
os.path.join(REPO_DIR, 'themes', 'static'),
os.path.join(PYTHON_DIR, 'themes', ACTIVE_THEME, 'static'),
os.path.join(PYTHON_DIR, 'themes', 'default', 'static'),
]
SERVE_STATIC = True

View File

@ -1,3 +1,3 @@
from django.test import TestCase
#from django.test import TestCase
# Create your tests here.

View File

@ -3,28 +3,32 @@ from django.contrib import admin
from django.urls import path, include
from django.views import static
from django.conf import settings
from django.contrib.staticfiles import views
from django.views.generic.base import RedirectView
from core.views import MainIndex, AddLinks, LinkDetails
from core.views import MainIndex, OldIndex, LinkDetails
admin.site.site_header = 'ArchiveBox Admin'
admin.site.index_title = 'Archive Administration'
# print('DEBUG', settings.DEBUG)
urlpatterns = [
path('index.html', RedirectView.as_view(url='/')),
path('index.json', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'index.json'}),
path('robots.txt', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'robots.txt'}),
path('favicon.ico', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'favicon.ico'}),
path('docs/', RedirectView.as_view(url='https://github.com/pirate/ArchiveBox/wiki'), name='Docs'),
path('archive/', RedirectView.as_view(url='/')),
path('archive/<path:path>', LinkDetails.as_view(), name='LinkAssets'),
path('add/', AddLinks.as_view(), name='AddLinks'),
path('add/', RedirectView.as_view(url='/admin/core/snapshot/add/')),
path('accounts/login/', RedirectView.as_view(url='/admin/login/')),
path('accounts/logout/', RedirectView.as_view(url='/admin/logout/')),
path('static/<path>', views.serve),
path('accounts/', include('django.contrib.auth.urls')),
path('admin/', admin.site.urls),
path('old.html', OldIndex.as_view(), name='OldHome'),
path('index.html', RedirectView.as_view(url='/')),
path('index.json', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'index.json'}),
path('', MainIndex.as_view(), name='Home'),
]

View File

@ -8,7 +8,13 @@ from django.views import View, static
from core.models import Snapshot
from ..index import load_main_index, load_main_index_meta
from ..config import OUTPUT_DIR, VERSION, FOOTER_INFO
from ..config import (
OUTPUT_DIR,
VERSION,
FOOTER_INFO,
PUBLIC_INDEX,
PUBLIC_SNAPSHOTS,
)
from ..util import base_url
@ -16,6 +22,21 @@ class MainIndex(View):
template = 'main_index.html'
def get(self, request):
if request.user.is_authenticated:
return redirect('/admin/core/snapshot/')
if PUBLIC_INDEX:
return redirect('OldHome')
return redirect(f'/admin/login/?next={request.path}')
class OldIndex(View):
template = 'main_index.html'
def get(self, request):
if PUBLIC_INDEX or request.user.is_authenticated:
all_links = load_main_index(out_dir=OUTPUT_DIR)
meta_info = load_main_index_meta(out_dir=OUTPUT_DIR)
@ -29,23 +50,7 @@ class MainIndex(View):
return render(template_name=self.template, request=request, context=context)
class AddLinks(View):
template = 'add_links.html'
def get(self, request):
context = {}
return render(template_name=self.template, request=request, context=context)
def post(self, request):
import_path = request.POST['url']
# TODO: add the links to the index here using archivebox.main.add
print(f'Adding URL: {import_path}')
return render(template_name=self.template, request=request, context={})
return redirect(f'/admin/login/?next={request.path}')
class LinkDetails(View):
@ -54,6 +59,9 @@ class LinkDetails(View):
if '/' not in path:
return redirect(f'{path}/index.html')
if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
return redirect(f'/admin/login/?next={request.path}')
try:
slug, archivefile = path.split('/', 1)
except (IndexError, ValueError):
@ -64,7 +72,10 @@ class LinkDetails(View):
# slug is a timestamp
by_ts = {page.timestamp: page for page in all_pages}
try:
return static.serve(request, archivefile, by_ts[slug].link_dir, show_indexes=True)
# print('SERVING STATICFILE', by_ts[slug].link_dir, request.path, path)
response = static.serve(request, archivefile, document_root=by_ts[slug].link_dir, show_indexes=True)
response["Link"] = f'<{by_ts[slug].url}>; rel="canonical"'
return response
except KeyError:
pass

View File

@ -1,6 +1,5 @@
from cli.logging import log_shell_welcome_msg
from archivebox.logging_util import log_shell_welcome_msg
if __name__ == '__main__':
from main import *
log_shell_welcome_msg()

View File

@ -2,7 +2,7 @@ __package__ = 'archivebox.extractors'
import os
from typing import Optional
from typing import Optional, List, Iterable
from datetime import datetime
from ..index.schema import Link
@ -12,7 +12,10 @@ from ..index import (
patch_main_index,
)
from ..util import enforce_types
from ..cli.logging import (
from ..logging_util import (
log_archiving_started,
log_archiving_paused,
log_archiving_finished,
log_link_archiving_started,
log_link_archiving_finished,
log_archive_method_started,
@ -22,6 +25,7 @@ from ..cli.logging import (
from .title import should_save_title, save_title
from .favicon import should_save_favicon, save_favicon
from .wget import should_save_wget, save_wget
from .singlefile import should_save_singlefile, save_singlefile
from .pdf import should_save_pdf, save_pdf
from .screenshot import should_save_screenshot, save_screenshot
from .dom import should_save_dom, save_dom
@ -29,22 +33,38 @@ from .git import should_save_git, save_git
from .media import should_save_media, save_media
from .archive_org import should_save_archive_dot_org, save_archive_dot_org
@enforce_types
def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None) -> Link:
"""download the DOM, PDF, and a screenshot into a folder named after the link's timestamp"""
ARCHIVE_METHODS = (
def get_default_archive_methods():
return [
('title', should_save_title, save_title),
('favicon', should_save_favicon, save_favicon),
('wget', should_save_wget, save_wget),
('singlefile', should_save_singlefile, save_singlefile),
('pdf', should_save_pdf, save_pdf),
('screenshot', should_save_screenshot, save_screenshot),
('dom', should_save_dom, save_dom),
('git', should_save_git, save_git),
('media', should_save_media, save_media),
('archive_org', should_save_archive_dot_org, save_archive_dot_org),
)
]
@enforce_types
def ignore_methods(to_ignore: List[str]):
ARCHIVE_METHODS = get_default_archive_methods()
methods = filter(lambda x: x[0] not in to_ignore, ARCHIVE_METHODS)
methods = map(lambda x: x[1], methods)
return list(methods)
@enforce_types
def archive_link(link: Link, overwrite: bool=False, methods: Optional[Iterable[str]]=None, out_dir: Optional[str]=None, skip_index: bool=False) -> Link:
"""download the DOM, PDF, and a screenshot into a folder named after the link's timestamp"""
ARCHIVE_METHODS = get_default_archive_methods()
if methods is not None:
ARCHIVE_METHODS = [
method for method in ARCHIVE_METHODS
if method[1] in methods
]
out_dir = out_dir or link.link_dir
try:
@ -53,6 +73,7 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
os.makedirs(out_dir)
link = load_link_details(link, out_dir=out_dir)
write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
log_link_archiving_started(link, out_dir, is_new)
link = link.overwrite(updated=datetime.now())
stats = {'skipped': 0, 'succeeded': 0, 'failed': 0}
@ -81,7 +102,15 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
# print(' ', stats)
write_link_details(link, out_dir=link.link_dir)
try:
latest_title = link.history['title'][-1].output.strip()
if latest_title and len(latest_title) >= len(link.title or ''):
link = link.overwrite(title=latest_title)
except Exception:
pass
write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
if not skip_index:
patch_main_index(link)
# # If any changes were made, update the main links index json and html
@ -103,3 +132,25 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
raise
return link
@enforce_types
def archive_links(links: List[Link], overwrite: bool=False, methods: Optional[Iterable[str]]=None, out_dir: Optional[str]=None) -> List[Link]:
if not links:
return []
log_archiving_started(len(links))
idx: int = 0
link: Link = links[0]
try:
for idx, link in enumerate(links):
archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir)
except KeyboardInterrupt:
log_archiving_paused(len(links), idx, link.timestamp)
raise SystemExit(0)
except BaseException:
print()
raise
log_archiving_finished(len(links))
return links

View File

@ -6,20 +6,20 @@ from typing import Optional, List, Dict, Tuple
from collections import defaultdict
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, DEVNULL, chmod_file
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
)
from ..config import (
VERSION,
TIMEOUT,
CHECK_SSL_VALIDITY,
SAVE_ARCHIVE_DOT_ORG,
CURL_BINARY,
CURL_VERSION,
CHECK_SSL_VALIDITY
CURL_USER_AGENT,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@ -45,17 +45,19 @@ def save_archive_dot_org(link: Link, out_dir: Optional[str]=None, timeout: int=T
submit_url = 'https://web.archive.org/save/{}'.format(link.url)
cmd = [
CURL_BINARY,
'--silent',
'--location',
'--head',
'--user-agent', 'ArchiveBox/{} (+https://github.com/pirate/ArchiveBox/)'.format(VERSION), # be nice to the Archive.org people and show them where all this ArchiveBox traffic is coming from
'--compressed',
'--max-time', str(timeout),
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
submit_url,
]
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=DEVNULL, cwd=out_dir, timeout=timeout)
result = run(cmd, cwd=out_dir, timeout=timeout)
content_location, errors = parse_archive_dot_org_response(result.stdout)
if content_location:
archive_org_url = 'https://web.archive.org{}'.format(content_location[0])
@ -105,7 +107,7 @@ def parse_archive_dot_org_response(response: bytes) -> Tuple[List[str], List[str
headers[name.lower().strip()].append(val.strip())
# Get successful archive url in "content-location" header or any errors
content_location = headers['content-location']
content_location = headers.get('content-location', headers['location'])
errors = headers['x-archive-wayback-runtime-error']
return content_location, errors

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, chmod_file
from ..system import run, chmod_file, atomic_write
from ..util import (
enforce_types,
is_static_file,
@ -16,7 +16,7 @@ from ..config import (
SAVE_DOM,
CHROME_VERSION,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@ -46,8 +46,8 @@ def save_dom(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
with open(output_path, 'w+') as f:
result = run(cmd, stdout=f, stderr=PIPE, cwd=out_dir, timeout=timeout)
result = run(cmd, cwd=out_dir, timeout=timeout)
atomic_write(output_path, result.stdout)
if result.returncode:
hints = result.stderr.decode()

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput
from ..system import chmod_file, run, PIPE
from ..system import chmod_file, run
from ..util import enforce_types, domain
from ..config import (
TIMEOUT,
@ -13,8 +13,9 @@ from ..config import (
CURL_BINARY,
CURL_VERSION,
CHECK_SSL_VALIDITY,
CURL_USER_AGENT,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@enforce_types
@ -33,17 +34,21 @@ def save_favicon(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT)
output: ArchiveOutput = 'favicon.ico'
cmd = [
CURL_BINARY,
'--silent',
'--max-time', str(timeout),
'--location',
'--compressed',
'--output', str(output),
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
'https://www.google.com/s2/favicons?domain={}'.format(domain(link.url)),
]
status = 'succeeded'
status = 'pending'
timer = TimedProgress(timeout, prefix=' ')
try:
run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
run(cmd, cwd=out_dir, timeout=timeout)
chmod_file(output, cwd=out_dir)
status = 'succeeded'
except Exception as err:
status = 'failed'
output = err

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, chmod_file
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
@ -22,7 +22,7 @@ from ..config import (
GIT_DOMAINS,
CHECK_SSL_VALIDITY
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@ -56,7 +56,6 @@ def save_git(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
cmd = [
GIT_BINARY,
'clone',
'--mirror',
'--recursive',
*([] if CHECK_SSL_VALIDITY else ['-c', 'http.sslVerify=false']),
without_query(without_fragment(link.url)),
@ -64,8 +63,7 @@ def save_git(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=output_path, timeout=timeout + 1)
result = run(cmd, cwd=output_path, timeout=timeout + 1)
if result.returncode == 128:
# ignore failed re-download when the folder already exists
pass

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, chmod_file
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
@ -13,11 +13,12 @@ from ..util import (
from ..config import (
MEDIA_TIMEOUT,
SAVE_MEDIA,
SAVE_PLAYLISTS,
YOUTUBEDL_BINARY,
YOUTUBEDL_VERSION,
CHECK_SSL_VALIDITY
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@enforce_types
@ -45,7 +46,6 @@ def save_media(link: Link, out_dir: Optional[str]=None, timeout: int=MEDIA_TIMEO
'--write-description',
'--write-info-json',
'--write-annotations',
'--yes-playlist',
'--write-thumbnail',
'--no-call-home',
'--no-check-certificate',
@ -59,13 +59,14 @@ def save_media(link: Link, out_dir: Optional[str]=None, timeout: int=MEDIA_TIMEO
'--audio-quality', '320K',
'--embed-thumbnail',
'--add-metadata',
*(['--yes-playlist'] if SAVE_PLAYLISTS else []),
*([] if CHECK_SSL_VALIDITY else ['--no-check-certificate']),
link.url,
]
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=output_path, timeout=timeout + 1)
result = run(cmd, cwd=output_path, timeout=timeout + 1)
chmod_file(output, cwd=out_dir)
if result.returncode:
if (b'ERROR: Unsupported URL' in result.stderr

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, chmod_file
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
@ -16,7 +16,7 @@ from ..config import (
SAVE_PDF,
CHROME_VERSION,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@enforce_types
@ -45,7 +45,7 @@ def save_pdf(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
result = run(cmd, cwd=out_dir, timeout=timeout)
if result.returncode:
hints = (result.stderr or result.stdout).decode()
@ -58,6 +58,7 @@ def save_pdf(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
finally:
timer.end()
return ArchiveResult(
cmd=cmd,
pwd=out_dir,

View File

@ -5,7 +5,7 @@ import os
from typing import Optional
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE, chmod_file
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
@ -16,7 +16,7 @@ from ..config import (
SAVE_SCREENSHOT,
CHROME_VERSION,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@ -45,7 +45,7 @@ def save_screenshot(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOU
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
result = run(cmd, cwd=out_dir, timeout=timeout)
if result.returncode:
hints = (result.stderr or result.stdout).decode()

View File

@ -0,0 +1,87 @@
__package__ = 'archivebox.extractors'
from pathlib import Path
from typing import Optional
import json
from ..index.schema import Link, ArchiveResult, ArchiveError
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
chrome_args,
)
from ..config import (
TIMEOUT,
SAVE_SINGLEFILE,
SINGLEFILE_BINARY,
SINGLEFILE_VERSION,
CHROME_BINARY,
)
from ..logging_util import TimedProgress
@enforce_types
def should_save_singlefile(link: Link, out_dir: Optional[str]=None) -> bool:
out_dir = out_dir or link.link_dir
if is_static_file(link.url):
return False
output = Path(out_dir or link.link_dir) / 'singlefile.html'
return SAVE_SINGLEFILE and (not output.exists())
@enforce_types
def save_singlefile(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
"""download full site using single-file"""
out_dir = out_dir or link.link_dir
output = str(Path(out_dir).absolute() / "singlefile.html")
browser_args = chrome_args(TIMEOUT=0)
# SingleFile CLI Docs: https://github.com/gildas-lormeau/SingleFile/tree/master/cli
cmd = [
SINGLEFILE_BINARY,
'--browser-executable-path={}'.format(CHROME_BINARY),
'--browser-args="{}"'.format(json.dumps(browser_args[1:])),
link.url,
output
]
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, cwd=out_dir, timeout=timeout)
# parse out number of files downloaded from last line of stderr:
# "Downloaded: 76 files, 4.0M in 1.6s (2.52 MB/s)"
output_tail = [
line.strip()
for line in (result.stdout + result.stderr).decode().rsplit('\n', 3)[-3:]
if line.strip()
]
hints = (
'Got single-file response code: {}.'.format(result.returncode),
*output_tail,
)
# Check for common failure cases
if (result.returncode > 0):
raise ArchiveError('SingleFile was not able to archive the page', hints)
chmod_file(output)
except Exception as err:
status = 'failed'
output = err
finally:
timer.end()
return ArchiveResult(
cmd=cmd,
pwd=out_dir,
cmd_version=SINGLEFILE_VERSION,
output=output,
status=status,
**timer.stats,
)

View File

@ -12,11 +12,14 @@ from ..util import (
)
from ..config import (
TIMEOUT,
CHECK_SSL_VALIDITY,
SAVE_TITLE,
CURL_BINARY,
CURL_VERSION,
CURL_USER_AGENT,
setup_django,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
HTML_TITLE_REGEX = re.compile(
@ -41,13 +44,19 @@ def should_save_title(link: Link, out_dir: Optional[str]=None) -> bool:
def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
"""try to guess the page's title from its content"""
setup_django(out_dir=out_dir)
from core.models import Snapshot
output: ArchiveOutput = None
cmd = [
CURL_BINARY,
'--silent',
'--max-time', str(timeout),
'--location',
'--compressed',
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
link.url,
'|',
'grep',
'<title',
]
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
@ -55,7 +64,10 @@ def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
html = download_url(link.url, timeout=timeout)
match = re.search(HTML_TITLE_REGEX, html)
output = htmldecode(match.group(1).strip()) if match else None
if not output:
if output:
if not link.title or len(output) >= len(link.title):
Snapshot.objects.filter(url=link.url, timestamp=link.timestamp).update(title=output)
else:
raise ArchiveError('Unable to detect page title')
except Exception as err:
status = 'failed'

View File

@ -7,7 +7,7 @@ from typing import Optional
from datetime import datetime
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
from ..system import run, PIPE
from ..system import run, chmod_file
from ..util import (
enforce_types,
is_static_file,
@ -24,13 +24,14 @@ from ..config import (
SAVE_WARC,
WGET_BINARY,
WGET_VERSION,
RESTRICT_FILE_NAMES,
CHECK_SSL_VALIDITY,
SAVE_WGET_REQUISITES,
WGET_AUTO_COMPRESSION,
WGET_USER_AGENT,
COOKIES_FILE,
)
from ..cli.logging import TimedProgress
from ..logging_util import TimedProgress
@enforce_types
@ -66,21 +67,22 @@ def save_wget(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
'--span-hosts',
'--no-parent',
'-e', 'robots=off',
'--restrict-file-names=windows',
'--timeout={}'.format(timeout),
*([] if SAVE_WARC else ['--timestamping']),
*(['--restrict-file-names={}'.format(RESTRICT_FILE_NAMES)] if RESTRICT_FILE_NAMES else []),
*(['--warc-file={}'.format(warc_path)] if SAVE_WARC else []),
*(['--page-requisites'] if SAVE_WGET_REQUISITES else []),
*(['--user-agent={}'.format(WGET_USER_AGENT)] if WGET_USER_AGENT else []),
*(['--load-cookies', COOKIES_FILE] if COOKIES_FILE else []),
*(['--compression=auto'] if WGET_AUTO_COMPRESSION else []),
*([] if SAVE_WARC else ['--timestamping']),
*([] if CHECK_SSL_VALIDITY else ['--no-check-certificate', '--no-hsts']),
link.url,
]
status = 'succeeded'
timer = TimedProgress(timeout, prefix=' ')
try:
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
result = run(cmd, cwd=out_dir, timeout=timeout)
output = wget_output_path(link)
# parse out number of files downloaded from last line of stderr:
@ -95,22 +97,21 @@ def save_wget(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
if 'Downloaded:' in output_tail[-1]
else 0
)
# Check for common failure cases
if result.returncode > 0 and files_downloaded < 1:
hints = (
'Got wget response code: {}.'.format(result.returncode),
*output_tail,
)
# Check for common failure cases
if (result.returncode > 0 and files_downloaded < 1) or output is None:
if b'403: Forbidden' in result.stderr:
raise ArchiveError('403 Forbidden (try changing WGET_USER_AGENT)', hints)
if b'404: Not Found' in result.stderr:
raise ArchiveError('404 Not Found', hints)
if b'ERROR 500: Internal Server Error' in result.stderr:
raise ArchiveError('500 Internal Server Error', hints)
raise ArchiveError('Got an error from the server', hints)
# chmod_file(output, cwd=out_dir)
raise ArchiveError('Wget failed or got an error from the server', hints)
chmod_file(output, cwd=out_dir)
except Exception as err:
status = 'failed'
output = err
@ -134,7 +135,6 @@ def wget_output_path(link: Link) -> Optional[str]:
See docs on wget --adjust-extension (-E)
"""
if is_static_file(link.url):
return without_scheme(without_fragment(link.url))
@ -172,10 +172,9 @@ def wget_output_path(link: Link) -> Optional[str]:
full_path = without_fragment(without_query(path(link.url))).strip('/')
search_dir = os.path.join(
link.link_dir,
domain(link.url),
domain(link.url).replace(":", "+"),
urldecode(full_path),
)
for _ in range(4):
if os.path.exists(search_dir):
if os.path.isdir(search_dir):

View File

@ -26,15 +26,16 @@ from ..config import (
URL_BLACKLIST_PTN,
ANSI,
stderr,
OUTPUT_PERMISSIONS
)
from ..cli.logging import (
from ..logging_util import (
TimedProgress,
log_indexing_process_started,
log_indexing_process_finished,
log_indexing_started,
log_indexing_finished,
log_parsing_started,
log_parsing_finished,
log_deduping_finished,
)
from .schema import Link, ArchiveResult
@ -51,6 +52,7 @@ from .json import (
from .sql import (
write_sql_main_index,
parse_sql_main_index,
write_sql_link_details,
)
### Link filtering and checking
@ -231,6 +233,8 @@ def write_main_index(links: List[Link], out_dir: str=OUTPUT_DIR, finished: bool=
with timed_index_update(os.path.join(out_dir, SQL_INDEX_FILENAME)):
write_sql_main_index(links, out_dir=out_dir)
os.chmod(os.path.join(out_dir, SQL_INDEX_FILENAME), int(OUTPUT_PERMISSIONS, base=8)) # set here because we don't write it with atomic writes
with timed_index_update(os.path.join(out_dir, JSON_INDEX_FILENAME)):
write_json_main_index(links, out_dir=out_dir)
@ -267,20 +271,29 @@ def load_main_index_meta(out_dir: str=OUTPUT_DIR) -> Optional[dict]:
return None
@enforce_types
def import_new_links(existing_links: List[Link],
import_path: str,
out_dir: str=OUTPUT_DIR) -> Tuple[List[Link], List[Link]]:
def parse_links_from_source(source_path: str) -> Tuple[List[Link], List[Link]]:
from ..parsers import parse_links
new_links: List[Link] = []
# parse and validate the import file
log_parsing_started(import_path)
raw_links, parser_name = parse_links(import_path)
raw_links, parser_name = parse_links(source_path)
new_links = validate_links(raw_links)
if parser_name:
num_parsed = len(raw_links)
log_parsing_finished(num_parsed, parser_name)
return new_links
@enforce_types
def dedupe_links(existing_links: List[Link],
new_links: List[Link]) -> Tuple[List[Link], List[Link]]:
# merge existing links in out_dir and new links
all_links = validate_links(existing_links + new_links)
all_link_urls = {link.url for link in existing_links}
@ -290,10 +303,11 @@ def import_new_links(existing_links: List[Link],
if link.url not in all_link_urls
]
if parser_name:
num_parsed = len(raw_links)
num_new_links = len(all_links) - len(existing_links)
log_parsing_finished(num_parsed, num_new_links, parser_name)
all_links_deduped = {link.url: link for link in all_links}
for i in range(len(new_links)):
if new_links[i].url in all_links_deduped.keys():
new_links[i] = all_links_deduped[new_links[i].url]
log_deduping_finished(len(new_links))
return all_links, new_links
@ -325,7 +339,8 @@ def patch_main_index(link: Link, out_dir: str=OUTPUT_DIR) -> None:
# Patch HTML main index
html_path = os.path.join(out_dir, 'index.html')
with open(html_path, 'r') as f:
html = f.read().split('\n')
html = f.read().splitlines()
for idx, line in enumerate(html):
if title and ('<span data-title-for="{}"'.format(link.url) in line):
html[idx] = '<span>{}</span>'.format(title)
@ -333,17 +348,19 @@ def patch_main_index(link: Link, out_dir: str=OUTPUT_DIR) -> None:
html[idx] = '<span>{}</span>'.format(successful)
break
atomic_write('\n'.join(html), html_path)
atomic_write(html_path, '\n'.join(html))
### Link Details Index
@enforce_types
def write_link_details(link: Link, out_dir: Optional[str]=None) -> None:
def write_link_details(link: Link, out_dir: Optional[str]=None, skip_sql_index: bool=False) -> None:
out_dir = out_dir or link.link_dir
write_json_link_details(link, out_dir=out_dir)
write_html_link_details(link, out_dir=out_dir)
if not skip_sql_index:
write_sql_link_details(link)
@enforce_types
@ -512,6 +529,14 @@ def get_unrecognized_folders(links, out_dir: str=OUTPUT_DIR) -> Dict[str, Option
link = None
try:
link = parse_json_link_details(entry.path)
except KeyError:
# Try to fix index
if index_exists:
try:
# Last attempt to repair the detail index
link_guessed = parse_json_link_details(entry.path, guess=True)
write_json_link_details(link_guessed, out_dir=entry.path)
link = parse_json_link_details(entry.path)
except Exception:
pass
@ -538,7 +563,7 @@ def is_valid(link: Link) -> bool:
return False
if dir_exists and index_exists:
try:
parsed_link = parse_json_link_details(link.link_dir)
parsed_link = parse_json_link_details(link.link_dir, guess=True)
return link.url == parsed_link.url
except Exception:
pass
@ -569,7 +594,10 @@ def fix_invalid_folder_locations(out_dir: str=OUTPUT_DIR) -> Tuple[List[str], Li
for entry in os.scandir(os.path.join(out_dir, ARCHIVE_DIR_NAME)):
if entry.is_dir(follow_symlinks=True):
if os.path.exists(os.path.join(entry.path, 'index.json')):
try:
link = parse_json_link_details(entry.path)
except KeyError:
link = None
if not link:
continue

View File

@ -41,7 +41,7 @@ TITLE_LOADING_MSG = 'Not yet archived...'
def parse_html_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[str]:
"""parse an archive index html file and return the list of urls"""
index_path = os.path.join(out_dir, HTML_INDEX_FILENAME)
index_path = join(out_dir, HTML_INDEX_FILENAME)
if os.path.exists(index_path):
with open(index_path, 'r', encoding='utf-8') as f:
for line in f:
@ -58,7 +58,7 @@ def write_html_main_index(links: List[Link], out_dir: str=OUTPUT_DIR, finished:
copy_and_overwrite(join(TEMPLATES_DIR, STATIC_DIR_NAME), join(out_dir, STATIC_DIR_NAME))
rendered_html = main_index_template(links, finished=finished)
atomic_write(rendered_html, join(out_dir, HTML_INDEX_FILENAME))
atomic_write(join(out_dir, HTML_INDEX_FILENAME), rendered_html)
@enforce_types
@ -90,7 +90,7 @@ def main_index_row_template(link: Link) -> str:
**link._asdict(extended=True),
# before pages are finished archiving, show loading msg instead of title
'title': (
'title': htmlencode(
link.title
or (link.base_url if link.is_archived else TITLE_LOADING_MSG)
),
@ -116,7 +116,7 @@ def write_html_link_details(link: Link, out_dir: Optional[str]=None) -> None:
out_dir = out_dir or link.link_dir
rendered_html = link_details_template(link)
atomic_write(rendered_html, join(out_dir, HTML_INDEX_FILENAME))
atomic_write(join(out_dir, HTML_INDEX_FILENAME), rendered_html)
@enforce_types
@ -129,15 +129,15 @@ def link_details_template(link: Link) -> str:
return render_legacy_template(LINK_DETAILS_TEMPLATE, {
**link_info,
**link_info['canonical'],
'title': (
'title': htmlencode(
link.title
or (link.base_url if link.is_archived else TITLE_LOADING_MSG)
),
'url_str': htmlencode(urldecode(link.base_url)),
'archive_url': urlencode(
wget_output_path(link)
or (link.domain if link.is_archived else 'about:blank')
),
or (link.domain if link.is_archived else '')
) or 'about:blank',
'extension': link.extension or 'html',
'tags': link.tags or 'untagged',
'status': 'archived' if link.is_archived else 'not yet archived',

View File

@ -3,6 +3,7 @@ __package__ = 'archivebox.index'
import os
import sys
import json as pyjson
from pathlib import Path
from datetime import datetime
from typing import List, Optional, Iterator, Any
@ -18,6 +19,7 @@ from ..config import (
DEPENDENCIES,
JSON_INDEX_FILENAME,
ARCHIVE_DIR_NAME,
ANSI
)
@ -37,7 +39,6 @@ MAIN_INDEX_HEADER = {
},
}
### Main Links Index
@enforce_types
@ -49,8 +50,19 @@ def parse_json_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[Link]:
with open(index_path, 'r', encoding='utf-8') as f:
links = pyjson.load(f)['links']
for link_json in links:
try:
yield Link.from_json(link_json)
except KeyError:
try:
detail_index_path = Path(OUTPUT_DIR) / ARCHIVE_DIR_NAME / link_json['timestamp']
yield parse_json_link_details(str(detail_index_path))
except KeyError:
# as a last effort, try to guess the missing values out of existing ones
try:
yield Link.from_json(link_json, guess=True)
except KeyError:
print(" {lightyellow}! Failed to load the index.json from {}".format(detail_index_path, **ANSI))
continue
return ()
@enforce_types
@ -74,7 +86,7 @@ def write_json_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
'last_run_cmd': sys.argv,
'links': links,
}
atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME))
atomic_write(os.path.join(out_dir, JSON_INDEX_FILENAME), main_index_json)
### Link Details Index
@ -85,19 +97,18 @@ def write_json_link_details(link: Link, out_dir: Optional[str]=None) -> None:
out_dir = out_dir or link.link_dir
path = os.path.join(out_dir, JSON_INDEX_FILENAME)
atomic_write(link._asdict(extended=True), path)
atomic_write(path, link._asdict(extended=True))
@enforce_types
def parse_json_link_details(out_dir: str) -> Optional[Link]:
def parse_json_link_details(out_dir: str, guess: Optional[bool]=False) -> Optional[Link]:
"""load the json link index from a given directory"""
existing_index = os.path.join(out_dir, JSON_INDEX_FILENAME)
if os.path.exists(existing_index):
with open(existing_index, 'r', encoding='utf-8') as f:
try:
link_json = pyjson.load(f)
return Link.from_json(link_json)
return Link.from_json(link_json, guess)
except pyjson.JSONDecodeError:
pass
return None
@ -110,7 +121,10 @@ def parse_json_links_details(out_dir: str) -> Iterator[Link]:
for entry in os.scandir(os.path.join(out_dir, ARCHIVE_DIR_NAME)):
if entry.is_dir(follow_symlinks=True):
if os.path.exists(os.path.join(entry.path, 'index.json')):
try:
link = parse_json_link_details(entry.path)
except KeyError:
link = None
if link:
yield link
@ -149,5 +163,3 @@ class ExtendedEncoder(pyjson.JSONEncoder):
def to_json(obj: Any, indent: Optional[int]=4, sort_keys: bool=True, cls=ExtendedEncoder) -> str:
return pyjson.dumps(obj, indent=indent, sort_keys=sort_keys, cls=ExtendedEncoder)

View File

@ -1,14 +1,19 @@
__package__ = 'archivebox.index'
import os
from pathlib import Path
from datetime import datetime
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional, Union
from dataclasses import dataclass, asdict, field, fields
from ..system import get_dir_size
from ..config import OUTPUT_DIR, ARCHIVE_DIR_NAME
class ArchiveError(Exception):
def __init__(self, message, hints=None):
super().__init__(message)
@ -49,7 +54,15 @@ class ArchiveResult:
assert self.output
@classmethod
def from_json(cls, json_info):
def guess_ts(_cls, dict_info):
from ..util import parse_date
parsed_timestamp = parse_date(dict_info["timestamp"])
start_ts = parsed_timestamp
end_ts = parsed_timestamp + timedelta(seconds=int(dict_info["duration"]))
return start_ts, end_ts
@classmethod
def from_json(cls, json_info, guess=False):
from ..util import parse_date
info = {
@ -57,8 +70,25 @@ class ArchiveResult:
for key, val in json_info.items()
if key in cls.field_names()
}
if guess:
keys = info.keys()
if "start_ts" not in keys:
info["start_ts"], info["end_ts"] = cls.guess_ts(json_info)
else:
info['start_ts'] = parse_date(info['start_ts'])
info['end_ts'] = parse_date(info['end_ts'])
if "pwd" not in keys:
info["pwd"] = str(Path(OUTPUT_DIR) / ARCHIVE_DIR_NAME / json_info["timestamp"])
if "cmd_version" not in keys:
info["cmd_version"] = "Undefined"
if "cmd" not in keys:
info["cmd"] = []
else:
info['start_ts'] = parse_date(info['start_ts'])
info['end_ts'] = parse_date(info['end_ts'])
info['cmd_version'] = info.get('cmd_version')
if type(info["cmd"]) is str:
info["cmd"] = [info["cmd"]]
return cls(**info)
def to_dict(self, *keys) -> dict:
@ -95,6 +125,7 @@ class Link:
updated: Optional[datetime] = None
schema: str = 'Link'
def __str__(self) -> str:
return f'[{self.timestamp}] {self.base_url} "{self.title}"'
@ -178,7 +209,7 @@ class Link:
return info
@classmethod
def from_json(cls, json_info):
def from_json(cls, json_info, guess=False):
from ..util import parse_date
info = {
@ -196,7 +227,7 @@ class Link:
cast_history[method] = []
for json_result in method_history:
assert isinstance(json_result, dict), 'Items in Link["history"][method] must be dicts'
cast_result = ArchiveResult.from_json(json_result)
cast_result = ArchiveResult.from_json(json_result, guess)
cast_history[method].append(cast_result)
info['history'] = cast_history
@ -226,6 +257,13 @@ class Link:
from ..config import ARCHIVE_DIR_NAME
return '{}/{}'.format(ARCHIVE_DIR_NAME, self.timestamp)
@property
def archive_size(self) -> float:
try:
return get_dir_size(self.archive_path)[0]
except Exception:
return 0
### URL Helpers
@property
def url_hash(self):
@ -267,7 +305,16 @@ class Link:
@property
def bookmarked_date(self) -> Optional[str]:
from ..util import ts_to_date
return ts_to_date(self.timestamp) if self.timestamp else None
max_ts = (datetime.now() + timedelta(days=30)).timestamp()
if self.timestamp and self.timestamp.replace('.', '').isdigit():
if 0 < float(self.timestamp) < max_ts:
return ts_to_date(datetime.fromtimestamp(float(self.timestamp)))
else:
return str(self.timestamp)
return None
@property
def updated_date(self) -> Optional[str]:
@ -318,6 +365,7 @@ class Link:
'screenshot.png',
'output.html',
'media',
'singlefile.html'
)
return any(
@ -329,7 +377,7 @@ class Link:
"""get the latest output that each archive method produced for link"""
ARCHIVE_METHODS = (
'title', 'favicon', 'wget', 'warc', 'pdf',
'title', 'favicon', 'wget', 'warc', 'singlefile', 'pdf',
'screenshot', 'dom', 'git', 'media', 'archive_org',
)
latest: Dict[str, ArchiveOutput] = {}
@ -345,7 +393,6 @@ class Link:
latest[archive_method] = history[0].output
else:
latest[archive_method] = None
return latest
@ -359,6 +406,7 @@ class Link:
'google_favicon_path': 'https://www.google.com/s2/favicons?domain={}'.format(self.domain),
'wget_path': wget_output_path(self),
'warc_path': 'warc',
'singlefile_path': 'singlefile.html',
'pdf_path': 'output.pdf',
'screenshot_path': 'screenshot.png',
'dom_path': 'output.html',
@ -378,7 +426,7 @@ class Link:
'pdf_path': static_path,
'screenshot_path': static_path,
'dom_path': static_path,
'singlefile_path': static_path,
})
return canonical

View File

@ -20,31 +20,38 @@ def parse_sql_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[Link]:
for page in Snapshot.objects.all()
)
@enforce_types
def remove_from_sql_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
setup_django(out_dir, check_db=True)
from core.models import Snapshot
from django.db import transaction
with transaction.atomic():
for link in links:
Snapshot.objects.filter(url=link.url).delete()
@enforce_types
def write_sql_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
setup_django(out_dir, check_db=True)
from core.models import Snapshot
from django.db import transaction
all_urls = {link.url: link for link in links}
all_ts = {link.timestamp: link for link in links}
with transaction.atomic():
for link in links:
info = {k: v for k, v in link._asdict().items() if k in Snapshot.keys}
Snapshot.objects.update_or_create(url=link.url, defaults=info)
@enforce_types
def write_sql_link_details(link: Link, out_dir: str=OUTPUT_DIR) -> None:
setup_django(out_dir, check_db=True)
from core.models import Snapshot
from django.db import transaction
with transaction.atomic():
for snapshot in Snapshot.objects.all():
if snapshot.timestamp in all_ts:
info = {k: v for k, v in all_urls.pop(snapshot.url)._asdict().items() if k in Snapshot.keys}
snapshot.delete()
Snapshot.objects.create(**info)
if snapshot.url in all_urls:
info = {k: v for k, v in all_urls.pop(snapshot.url)._asdict().items() if k in Snapshot.keys}
snapshot.delete()
Snapshot.objects.create(**info)
else:
snapshot.delete()
for url, link in all_urls.items():
info = {k: v for k, v in link._asdict().items() if k in Snapshot.keys}
Snapshot.objects.update_or_create(url=url, defaults=info)
snap = Snapshot.objects.get(url=link.url, timestamp=link.timestamp)
snap.title = link.title
snap.tags = link.tags
snap.save()

View File

@ -1,30 +1,32 @@
__package__ = 'archivebox.cli'
__package__ = 'archivebox'
import re
import os
import sys
import time
import argparse
from multiprocessing import Process
from datetime import datetime
from dataclasses import dataclass
from multiprocessing import Process
from typing import Optional, List, Dict, Union, IO
from typing import Optional, List, Dict, Union, IO, TYPE_CHECKING
from ..index.schema import Link, ArchiveResult
from ..index.json import to_json
from ..index.csv import links_to_csv
from ..util import enforce_types
from ..config import (
if TYPE_CHECKING:
from .index.schema import Link, ArchiveResult
from .util import enforce_types
from .config import (
ConfigDict,
PYTHON_ENCODING,
ANSI,
OUTPUT_DIR,
IS_TTY,
SHOW_PROGRESS,
TERM_WIDTH,
OUTPUT_DIR,
SOURCES_DIR_NAME,
HTML_INDEX_FILENAME,
stderr,
)
@dataclass
class RuntimeStats:
"""mutable stats counter for logging archiving timing info to CLI output"""
@ -66,6 +68,7 @@ def reject_stdin(caller: str, stdin: Optional[IO]=sys.stdin) -> None:
stderr()
raise SystemExit(1)
def accept_stdin(stdin: Optional[IO]=sys.stdin) -> Optional[str]:
"""accept any standard input and return it as a string or None"""
if not stdin:
@ -80,7 +83,9 @@ class TimedProgress:
"""Show a progress bar and measure elapsed time until .end() is called"""
def __init__(self, seconds, prefix=''):
if SHOW_PROGRESS:
from .config import SHOW_PROGRESS
self.SHOW_PROGRESS = SHOW_PROGRESS
if self.SHOW_PROGRESS:
self.p = Process(target=progress_bar, args=(seconds, prefix))
self.p.start()
@ -91,28 +96,39 @@ class TimedProgress:
end_ts = datetime.now()
self.stats['end_ts'] = end_ts
if SHOW_PROGRESS:
# protect from double termination
#if p is None or not hasattr(p, 'kill'):
# return
if self.p is not None:
if self.SHOW_PROGRESS:
# terminate if we havent already terminated
self.p.terminate()
self.p.join()
self.p.close()
self.p = None
sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH()), ANSI['reset'])) # clear whole terminal line
# clear whole terminal line
try:
sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH()), ANSI['reset']))
except (IOError, BrokenPipeError):
# ignore when the parent proc has stopped listening to our stdout
pass
@enforce_types
def progress_bar(seconds: int, prefix: str='') -> None:
"""show timer in the form of progress bar, with percentage and seconds remaining"""
chunk = '' if sys.stdout.encoding == 'UTF-8' else '#'
chunks = TERM_WIDTH() - len(prefix) - 20 # number of progress chunks to show (aka max bar width)
chunk = '' if PYTHON_ENCODING == 'UTF-8' else '#'
last_width = TERM_WIDTH()
chunks = last_width - len(prefix) - 20 # number of progress chunks to show (aka max bar width)
try:
for s in range(seconds * chunks):
chunks = TERM_WIDTH() - len(prefix) - 20
max_width = TERM_WIDTH()
if max_width < last_width:
# when the terminal size is shrunk, we have to write a newline
# otherwise the progress bar will keep wrapping incorrectly
sys.stdout.write('\r\n')
sys.stdout.flush()
chunks = max_width - len(prefix) - 20
progress = s / chunks / seconds * 100
bar_width = round(progress/(100/chunks))
last_width = max_width
# ████████████████████ 0.9% (1/60sec)
sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)'.format(
@ -138,27 +154,51 @@ def progress_bar(seconds: int, prefix: str='') -> None:
seconds,
))
sys.stdout.flush()
except KeyboardInterrupt:
except (KeyboardInterrupt, BrokenPipeError):
print()
pass
def log_cli_command(subcommand: str, subcommand_args: List[str], stdin: Optional[str], pwd: str):
from .config import VERSION, ANSI
cmd = ' '.join(('archivebox', subcommand, *subcommand_args))
stdin_hint = ' < /dev/stdin' if not stdin.isatty() else ''
stderr('{black}[i] [{now}] ArchiveBox v{VERSION}: {cmd}{stdin_hint}{reset}'.format(
now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
VERSION=VERSION,
cmd=cmd,
stdin_hint=stdin_hint,
**ANSI,
))
stderr('{black} > {pwd}{reset}'.format(pwd=pwd, **ANSI))
stderr()
### Parsing Stage
def log_parsing_started(source_file: str):
start_ts = datetime.now()
_LAST_RUN_STATS.parse_start_ts = start_ts
print('\n{green}[*] [{}] Parsing new links from output/sources/{}...{reset}'.format(
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
source_file.rsplit('/', 1)[-1],
def log_importing_started(urls: Union[str, List[str]], depth: int, index_only: bool):
_LAST_RUN_STATS.parse_start_ts = datetime.now()
print('{green}[+] [{}] Adding {} links to index (crawl depth={}){}...{reset}'.format(
_LAST_RUN_STATS.parse_start_ts.strftime('%Y-%m-%d %H:%M:%S'),
len(urls) if isinstance(urls, list) else len(urls.split('\n')),
depth,
' (index only)' if index_only else '',
**ANSI,
))
def log_parsing_finished(num_parsed: int, num_new_links: int, parser_name: str):
end_ts = datetime.now()
_LAST_RUN_STATS.parse_end_ts = end_ts
print(' > Parsed {} links as {} ({} new links added)'.format(num_parsed, parser_name, num_new_links))
def log_source_saved(source_file: str):
print(' > Saved verbatim input to {}/{}'.format(SOURCES_DIR_NAME, source_file.rsplit('/', 1)[-1]))
def log_parsing_finished(num_parsed: int, parser_name: str):
_LAST_RUN_STATS.parse_end_ts = datetime.now()
print(' > Parsed {} URLs from input ({})'.format(num_parsed, parser_name))
def log_deduping_finished(num_new_links: int):
print(' > Found {} new URLs not already in index'.format(num_new_links))
def log_crawl_started(new_links):
print('{lightred}[*] Starting crawl of {} sites 1 hop out from starting point{reset}'.format(len(new_links), **ANSI))
### Indexing Stage
@ -166,20 +206,23 @@ def log_indexing_process_started(num_links: int):
start_ts = datetime.now()
_LAST_RUN_STATS.index_start_ts = start_ts
print()
print('{green}[*] [{}] Writing {} links to main index...{reset}'.format(
print('{black}[*] [{}] Writing {} links to main index...{reset}'.format(
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
num_links,
**ANSI,
))
def log_indexing_process_finished():
end_ts = datetime.now()
_LAST_RUN_STATS.index_end_ts = end_ts
def log_indexing_started(out_path: str):
if IS_TTY:
sys.stdout.write(f' > {out_path}')
def log_indexing_finished(out_path: str):
print(f'\r{out_path}')
@ -198,7 +241,7 @@ def log_archiving_started(num_links: int, resume: Optional[float]=None):
**ANSI,
))
else:
print('{green}[▶] [{}] Updating content for {} matching pages in archive...{reset}'.format(
print('{green}[▶] [{}] Collecting content for {} Snapshots in archive...{reset}'.format(
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
num_links,
**ANSI,
@ -216,8 +259,8 @@ def log_archiving_paused(num_links: int, idx: int, timestamp: str):
total=num_links,
))
print()
print(' To view your archive, open:')
print(' {}/index.html'.format(OUTPUT_DIR))
print(' {lightred}Hint:{reset} To view your archive index, open:'.format(**ANSI))
print(' {}/{}'.format(OUTPUT_DIR, HTML_INDEX_FILENAME))
print(' Continue archiving where you left off by running:')
print(' archivebox update --resume={}'.format(timestamp))
@ -227,9 +270,9 @@ def log_archiving_finished(num_links: int):
assert _LAST_RUN_STATS.archiving_start_ts is not None
seconds = end_ts.timestamp() - _LAST_RUN_STATS.archiving_start_ts.timestamp()
if seconds > 60:
duration = '{0:.2f} min'.format(seconds / 60, 2)
duration = '{0:.2f} min'.format(seconds / 60)
else:
duration = '{0:.2f} sec'.format(seconds, 2)
duration = '{0:.2f} sec'.format(seconds)
print()
print('{}[√] [{}] Update of {} pages complete ({}){}'.format(
@ -243,13 +286,13 @@ def log_archiving_finished(num_links: int):
print(' - {} links updated'.format(_LAST_RUN_STATS.succeeded))
print(' - {} links had errors'.format(_LAST_RUN_STATS.failed))
print()
print(' To view your archive, open:')
print(' {}/index.html'.format(OUTPUT_DIR))
print(' {lightred}Hint:{reset} To view your archive index, open:'.format(**ANSI))
print(' {}/{}'.format(OUTPUT_DIR, HTML_INDEX_FILENAME))
print(' Or run the built-in webserver:')
print(' archivebox server')
def log_link_archiving_started(link: Link, link_dir: str, is_new: bool):
def log_link_archiving_started(link: "Link", link_dir: str, is_new: bool):
# [*] [2019-03-22 13:46:45] "Log Structured Merge Trees - ben stopford"
# http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
# > output/archive/1478739709
@ -267,7 +310,7 @@ def log_link_archiving_started(link: Link, link_dir: str, is_new: bool):
pretty_path(link_dir),
))
def log_link_archiving_finished(link: Link, link_dir: str, is_new: bool, stats: dict):
def log_link_archiving_finished(link: "Link", link_dir: str, is_new: bool, stats: dict):
total = sum(stats.values())
if stats['failed'] > 0 :
@ -282,7 +325,7 @@ def log_archive_method_started(method: str):
print(' > {}'.format(method))
def log_archive_method_finished(result: ArchiveResult):
def log_archive_method_finished(result: "ArchiveResult"):
"""quote the argument with whitespace in a command so the user can
copy-paste the outputted string directly to run the cmd
"""
@ -331,6 +374,7 @@ def log_list_started(filter_patterns: Optional[List[str]], filter_type: str):
print(' {}'.format(' '.join(filter_patterns or ())))
def log_list_finished(links):
from .index.csv import links_to_csv
print()
print('---------------------------------------------------------------------------------------------------')
print(links_to_csv(links, cols=['timestamp', 'is_archived', 'num_outputs', 'url'], header=True, ljust=16, separator=' | '))
@ -338,7 +382,7 @@ def log_list_finished(links):
print()
def log_removal_started(links: List[Link], yes: bool, delete: bool):
def log_removal_started(links: List["Link"], yes: bool, delete: bool):
print('{lightyellow}[i] Found {} matching URLs to remove.{reset}'.format(len(links), **ANSI))
if delete:
file_counts = [link.num_outputs for link in links if os.path.exists(link.link_dir)]
@ -348,8 +392,8 @@ def log_removal_started(links: List[Link], yes: bool, delete: bool):
)
else:
print(
f' Matching links will be de-listed from the main index, but their archived content folders will remain in place on disk.\n'
f' (Pass --delete if you also want to permanently delete the data folders)'
' Matching links will be de-listed from the main index, but their archived content folders will remain in place on disk.\n'
' (Pass --delete if you also want to permanently delete the data folders)'
)
if not yes:
@ -376,7 +420,7 @@ def log_removal_finished(all_links: int, to_keep: int):
def log_shell_welcome_msg():
from . import list_subcommands
from .cli import list_subcommands
print('{green}# ArchiveBox Imports{reset}'.format(**ANSI))
print('{green}from archivebox.core.models import Snapshot, User{reset}'.format(**ANSI))
@ -412,13 +456,15 @@ def printable_filesize(num_bytes: Union[int, float]) -> str:
@enforce_types
def printable_folders(folders: Dict[str, Optional[Link]],
def printable_folders(folders: Dict[str, Optional["Link"]],
json: bool=False,
csv: Optional[str]=None) -> str:
if json:
from .index.json import to_json
return to_json(folders.values(), indent=4, sort_keys=True)
elif csv:
from .index.csv import links_to_csv
return links_to_csv(folders.values(), cols=csv.split(','), header=True)
return '\n'.join(f'{folder} {link}' for folder, link in folders.items())
@ -472,6 +518,7 @@ def printable_folder_status(name: str, folder: Dict) -> str:
@enforce_types
def printable_dependency_version(name: str, dependency: Dict) -> str:
version = None
if dependency['enabled']:
if dependency['is_valid']:
color, symbol, note, version = 'green', '', 'valid', ''

View File

@ -4,8 +4,7 @@ import os
import sys
import shutil
from typing import Dict, List, Optional, Iterable, IO
from typing import Dict, List, Optional, Iterable, IO, Union
from crontab import CronTab, CronSlices
from .cli import (
@ -17,16 +16,17 @@ from .cli import (
archive_cmds,
)
from .parsers import (
save_stdin_to_sources,
save_file_to_sources,
save_text_as_source,
save_file_as_source,
parse_links_memory,
)
from .index.schema import Link
from .util import enforce_types, docstring
from .util import enforce_types # type: ignore
from .system import get_dir_size, dedupe_cron_jobs, CRON_COMMENT
from .index import (
links_after_timestamp,
load_main_index,
import_new_links,
parse_links_from_source,
dedupe_links,
write_main_index,
link_matches_filter,
get_indexed_folders,
@ -49,14 +49,16 @@ from .index.sql import (
parse_sql_main_index,
get_admins,
apply_migrations,
remove_from_sql_main_index,
)
from .index.html import parse_html_main_index
from .extractors import archive_link
from .extractors import archive_links, archive_link, ignore_methods
from .config import (
stderr,
ConfigDict,
ANSI,
IS_TTY,
IN_DOCKER,
USER,
ARCHIVEBOX_BINARY,
ONLY_NEW,
@ -88,11 +90,11 @@ from .config import (
USER_CONFIG,
get_real_name,
)
from .cli.logging import (
from .logging_util import (
TERM_WIDTH,
TimedProgress,
log_archiving_started,
log_archiving_paused,
log_archiving_finished,
log_importing_started,
log_crawl_started,
log_removal_started,
log_removal_finished,
log_list_started,
@ -161,7 +163,7 @@ def help(out_dir: str=OUTPUT_DIR) -> None:
{lightred}Example Use:{reset}
mkdir my-archive; cd my-archive/
archivebox init
archivebox info
archivebox status
archivebox add https://example.com/some/page
archivebox add --depth=1 ~/Downloads/bookmarks_export.html
@ -177,6 +179,10 @@ def help(out_dir: str=OUTPUT_DIR) -> None:
else:
print('{green}Welcome to ArchiveBox v{}!{reset}'.format(VERSION, **ANSI))
print()
if IN_DOCKER:
print('When using Docker, you need to mount a volume to use as your data dir:')
print(' docker run -v /some/path:/data archivebox ...')
print()
print('To import an existing archive (from a previous version of ArchiveBox):')
print(' 1. cd into your data dir OUTPUT_DIR (usually ArchiveBox/output) and run:')
print(' 2. archivebox init')
@ -241,7 +247,6 @@ def run(subcommand: str,
def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
"""Initialize a new ArchiveBox collection in the current directory"""
os.makedirs(out_dir, exist_ok=True)
is_empty = not len(set(os.listdir(out_dir)) - ALLOWED_IN_OUTPUT_DIR)
existing_index = os.path.exists(os.path.join(out_dir, JSON_INDEX_FILENAME))
@ -291,15 +296,14 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
print('\n{green}[+] Building main SQL index and running migrations...{reset}'.format(**ANSI))
setup_django(out_dir, check_db=False)
from django.conf import settings
assert settings.DATABASE_FILE == os.path.join(out_dir, SQL_INDEX_FILENAME)
print(f'{settings.DATABASE_FILE}')
DATABASE_FILE = os.path.join(out_dir, SQL_INDEX_FILENAME)
print(f'{DATABASE_FILE}')
print()
for migration_line in apply_migrations(out_dir):
print(f' {migration_line}')
assert os.path.exists(settings.DATABASE_FILE)
assert os.path.exists(DATABASE_FILE)
# from django.contrib.auth.models import User
# if IS_TTY and not User.objects.filter(is_superuser=True).exists():
@ -364,7 +368,7 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
print(' X ' + '\n X '.join(f'{folder} {link}' for folder, link in invalid_folders.items()))
print()
print(' {lightred}Hint:{reset} For more information about the link data directories that were skipped, run:'.format(**ANSI))
print(' archivebox info')
print(' archivebox status')
print(' archivebox list --status=invalid')
@ -376,27 +380,31 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
else:
print('{green}[√] Done. A new ArchiveBox collection was initialized ({} links).{reset}'.format(len(all_links), **ANSI))
print()
print(' To view your archive index, open:')
print(' {}'.format(os.path.join(out_dir, HTML_INDEX_FILENAME)))
print(' {lightred}Hint:{reset} To view your archive index, run:'.format(**ANSI))
print(' archivebox server # then visit http://127.0.0.1:8000')
print()
print(' To add new links, you can run:')
print(" archivebox add 'https://example.com'")
print(" archivebox add ~/some/path/or/url/to/list_of_links.txt")
print()
print(' For more usage and examples, run:')
print(' archivebox help')
@enforce_types
def info(out_dir: str=OUTPUT_DIR) -> None:
def status(out_dir: str=OUTPUT_DIR) -> None:
"""Print out some info and statistics about the archive collection"""
check_data_folder(out_dir=out_dir)
print('{green}[*] Scanning archive collection main index...{reset}'.format(**ANSI))
print(f' {out_dir}/*')
from core.models import Snapshot
from django.contrib.auth import get_user_model
User = get_user_model()
print('{green}[*] Scanning archive main index...{reset}'.format(**ANSI))
print(ANSI['lightyellow'], f' {out_dir}/*', ANSI['reset'])
num_bytes, num_dirs, num_files = get_dir_size(out_dir, recursive=False, pattern='index.')
size = printable_filesize(num_bytes)
print(f' Size: {size} across {num_files} files')
print(f' Index size: {size} across {num_files} files')
print()
links = list(load_main_index(out_dir=out_dir))
@ -404,33 +412,23 @@ def info(out_dir: str=OUTPUT_DIR) -> None:
num_sql_links = sum(1 for link in parse_sql_main_index(out_dir=out_dir))
num_html_links = sum(1 for url in parse_html_main_index(out_dir=out_dir))
num_link_details = sum(1 for link in parse_json_links_details(out_dir=out_dir))
users = get_admins().values_list('username', flat=True)
print(f' > JSON Main Index: {num_json_links} links'.ljust(36), f'(found in {JSON_INDEX_FILENAME})')
print(f' > SQL Main Index: {num_sql_links} links'.ljust(36), f'(found in {SQL_INDEX_FILENAME})')
print(f' > HTML Main Index: {num_html_links} links'.ljust(36), f'(found in {HTML_INDEX_FILENAME})')
print(f' > JSON Link Details: {num_link_details} links'.ljust(36), f'(found in {ARCHIVE_DIR_NAME}/*/index.json)')
print(f' > Admin: {len(users)} users {", ".join(users)}'.ljust(36), f'(found in {SQL_INDEX_FILENAME})')
if num_html_links != len(links) or num_sql_links != len(links):
print()
print(' {lightred}Hint:{reset} You can fix index count differences automatically by running:'.format(**ANSI))
print(' archivebox init')
if not users:
print()
print(' {lightred}Hint:{reset} You can create an admin user by running:'.format(**ANSI))
print(' archivebox manage createsuperuser')
print()
print('{green}[*] Scanning archive collection link data directories...{reset}'.format(**ANSI))
print(f' {ARCHIVE_DIR}/*')
print('{green}[*] Scanning archive data directories...{reset}'.format(**ANSI))
print(ANSI['lightyellow'], f' {ARCHIVE_DIR}/*', ANSI['reset'])
num_bytes, num_dirs, num_files = get_dir_size(ARCHIVE_DIR)
size = printable_filesize(num_bytes)
print(f' Size: {size} across {num_files} files in {num_dirs} directories')
print()
print(ANSI['black'])
num_indexed = len(get_indexed_folders(links, out_dir=out_dir))
num_archived = len(get_archived_folders(links, out_dir=out_dir))
num_unarchived = len(get_unarchived_folders(links, out_dir=out_dir))
@ -455,82 +453,115 @@ def info(out_dir: str=OUTPUT_DIR) -> None:
print(f' > corrupted: {len(corrupted)}'.ljust(36), f'({get_corrupted_folders.__doc__})')
print(f' > unrecognized: {len(unrecognized)}'.ljust(36), f'({get_unrecognized_folders.__doc__})')
print(ANSI['reset'])
if num_indexed:
print()
print(' {lightred}Hint:{reset} You can list link data directories by status like so:'.format(**ANSI))
print(' archivebox list --status=<status> (e.g. indexed, corrupted, archived, etc.)')
if orphaned:
print()
print(' {lightred}Hint:{reset} To automatically import orphaned data directories into the main index, run:'.format(**ANSI))
print(' archivebox init')
if num_invalid:
print()
print(' {lightred}Hint:{reset} You may need to manually remove or fix some invalid data directories, afterwards make sure to run:'.format(**ANSI))
print(' archivebox init')
print()
print('{green}[*] Scanning recent archive changes and user logins:{reset}'.format(**ANSI))
print(ANSI['lightyellow'], f' {LOGS_DIR}/*', ANSI['reset'])
users = get_admins().values_list('username', flat=True)
print(f' UI users {len(users)}: {", ".join(users)}')
last_login = User.objects.order_by('last_login').last()
if last_login:
print(f' Last UI login: {last_login.username} @ {str(last_login.last_login)[:16]}')
last_updated = Snapshot.objects.order_by('updated').last()
print(f' Last changes: {str(last_updated.updated)[:16]}')
if not users:
print()
print(' {lightred}Hint:{reset} You can create an admin user by running:'.format(**ANSI))
print(' archivebox manage createsuperuser')
print()
for snapshot in Snapshot.objects.order_by('-updated')[:10]:
if not snapshot.updated:
continue
print(
ANSI['black'],
(
f' > {str(snapshot.updated)[:16]} '
f'[{snapshot.num_outputs} {("X", "")[snapshot.is_archived]} {printable_filesize(snapshot.archive_size)}] '
f'"{snapshot.title}": {snapshot.url}'
)[:TERM_WIDTH()],
ANSI['reset'],
)
print(ANSI['black'], ' ...', ANSI['reset'])
@enforce_types
def add(import_str: Optional[str]=None,
import_path: Optional[str]=None,
def oneshot(url: str, out_dir: str=OUTPUT_DIR):
"""
Create a single URL archive folder with an index.json and index.html, and all the archive method outputs.
You can run this to archive single pages without needing to create a whole collection with archivebox init.
"""
oneshot_link, _ = parse_links_memory([url])
if len(oneshot_link) > 1:
stderr(
'[X] You should pass a single url to the oneshot command',
color='red'
)
raise SystemExit(2)
methods = ignore_methods(['title'])
archive_link(oneshot_link[0], out_dir=out_dir, methods=methods, skip_index=True)
return oneshot_link
@enforce_types
def add(urls: Union[str, List[str]],
depth: int=0,
update_all: bool=not ONLY_NEW,
index_only: bool=False,
out_dir: str=OUTPUT_DIR) -> List[Link]:
"""Add a new URL or list of URLs to your archive"""
assert depth in (0, 1), 'Depth must be 0 or 1 (depth >1 is not supported yet)'
# Load list of links from the existing index
check_data_folder(out_dir=out_dir)
if import_str and import_path:
stderr(
'[X] You should pass either an import path as an argument, '
'or pass a list of links via stdin, but not both.\n',
color='red',
)
raise SystemExit(2)
elif import_str:
import_path = save_stdin_to_sources(import_str, out_dir=out_dir)
else:
import_path = save_file_to_sources(import_path, out_dir=out_dir)
check_dependencies()
# Step 1: Load list of links from the existing index
# merge in and dedupe new links from import_path
all_links: List[Link] = []
new_links: List[Link] = []
all_links = load_main_index(out_dir=out_dir)
if import_path:
all_links, new_links = import_new_links(all_links, import_path, out_dir=out_dir)
# Step 2: Write updated index with deduped old and new links back to disk
write_main_index(links=all_links, out_dir=out_dir)
log_importing_started(urls=urls, depth=depth, index_only=index_only)
if isinstance(urls, str):
# save verbatim stdin to sources
write_ahead_log = save_text_as_source(urls, filename='{ts}-import.txt', out_dir=out_dir)
elif isinstance(urls, list):
# save verbatim args to sources
write_ahead_log = save_text_as_source('\n'.join(urls), filename='{ts}-import.txt', out_dir=out_dir)
new_links += parse_links_from_source(write_ahead_log)
# If we're going one level deeper, download each link and look for more links
new_links_depth = []
if new_links and depth == 1:
log_crawl_started(new_links)
for new_link in new_links:
downloaded_file = save_file_as_source(new_link.url, filename='{ts}-crawl-{basename}.txt', out_dir=out_dir)
new_links_depth += parse_links_from_source(downloaded_file)
all_links, new_links = dedupe_links(all_links, new_links + new_links_depth)
write_main_index(links=all_links, out_dir=out_dir, finished=not new_links)
if index_only:
return all_links
# Step 3: Run the archive methods for each link
links = all_links if update_all else new_links
log_archiving_started(len(links))
idx: int = 0
link: Link = None # type: ignore
try:
for idx, link in enumerate(links):
archive_link(link, out_dir=link.link_dir)
except KeyboardInterrupt:
log_archiving_paused(len(links), idx, link.timestamp if link else '0')
raise SystemExit(0)
except:
print()
raise
log_archiving_finished(len(links))
# Run the archive methods for each link
to_archive = all_links if update_all else new_links
archive_links(to_archive, out_dir=out_dir)
# Step 4: Re-write links index with updated titles, icons, and resources
if to_archive:
all_links = load_main_index(out_dir=out_dir)
write_main_index(links=list(all_links), out_dir=out_dir, finished=True)
return all_links
@ -539,6 +570,7 @@ def add(import_str: Optional[str]=None,
def remove(filter_str: Optional[str]=None,
filter_patterns: Optional[List[str]]=None,
filter_type: str='exact',
links: Optional[List[Link]]=None,
after: Optional[float]=None,
before: Optional[float]=None,
yes: bool=False,
@ -548,6 +580,7 @@ def remove(filter_str: Optional[str]=None,
check_data_folder(out_dir=out_dir)
if links is None:
if filter_str and filter_patterns:
stderr(
'[X] You should pass either a pattern as an argument, '
@ -581,6 +614,7 @@ def remove(filter_str: Optional[str]=None,
finally:
timer.end()
if not len(links):
log_removal_finished(0, 0)
raise SystemExit(1)
@ -592,20 +626,26 @@ def remove(filter_str: Optional[str]=None,
timer = TimedProgress(360, prefix=' ')
try:
to_keep = []
to_delete = []
all_links = load_main_index(out_dir=out_dir)
for link in all_links:
should_remove = (
(after is not None and float(link.timestamp) < after)
or (before is not None and float(link.timestamp) > before)
or link_matches_filter(link, filter_patterns, filter_type)
or link_matches_filter(link, filter_patterns or [], filter_type)
or link in links
)
if not should_remove:
to_keep.append(link)
elif should_remove and delete:
if should_remove:
to_delete.append(link)
if delete:
shutil.rmtree(link.link_dir, ignore_errors=True)
else:
to_keep.append(link)
finally:
timer.end()
remove_from_sql_main_index(links=to_delete, out_dir=out_dir)
write_main_index(links=to_keep, out_dir=out_dir, finished=True)
log_removal_finished(len(all_links), len(to_keep))
@ -625,8 +665,8 @@ def update(resume: Optional[float]=None,
out_dir: str=OUTPUT_DIR) -> List[Link]:
"""Import any new links from subscriptions and retry any previously failed/skipped links"""
check_dependencies()
check_data_folder(out_dir=out_dir)
check_dependencies()
# Step 1: Load list of links from the existing index
# merge in and dedupe new links from import_path
@ -655,23 +695,8 @@ def update(resume: Optional[float]=None,
return all_links
# Step 3: Run the archive methods for each link
links = new_links if only_new else all_links
log_archiving_started(len(links), resume)
idx: int = 0
link: Link = None # type: ignore
try:
for idx, link in enumerate(links_after_timestamp(links, resume)):
archive_link(link, overwrite=overwrite, out_dir=link.link_dir)
except KeyboardInterrupt:
log_archiving_paused(len(links), idx, link.timestamp if link else '0')
raise SystemExit(0)
except:
print()
raise
log_archiving_finished(len(links))
to_archive = new_links if only_new else all_links
archive_links(to_archive, overwrite=overwrite, out_dir=out_dir)
# Step 4: Re-write links index with updated titles, icons, and resources
all_links = load_main_index(out_dir=out_dir)
@ -860,7 +885,7 @@ def config(config_options_str: Optional[str]=None,
print(' {}'.format(printable_config(side_effect_changes, prefix=' ')))
if failed_options:
stderr()
stderr('[X] These options failed to set:', color='red')
stderr('[X] These options failed to set (check for typos):', color='red')
stderr(' {}'.format('\n '.join(failed_options)))
raise SystemExit(bool(failed_options))
elif reset:
@ -974,7 +999,7 @@ def schedule(add: bool=False,
if total_runs > 60 and not quiet:
stderr()
stderr('{lightyellow}[!] With the current cron config, ArchiveBox is estimated to run >{} times per year.{reset}'.format(total_runs, **ANSI))
stderr(f' Congrats on being an enthusiastic internet archiver! 👌')
stderr(' Congrats on being an enthusiastic internet archiver! 👌')
stderr()
stderr(' Make sure you have enough storage space available to hold all the data.')
stderr(' Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.')
@ -985,32 +1010,50 @@ def schedule(add: bool=False,
def server(runserver_args: Optional[List[str]]=None,
reload: bool=False,
debug: bool=False,
init: bool=False,
out_dir: str=OUTPUT_DIR) -> None:
"""Run the ArchiveBox HTTP server"""
runserver_args = runserver_args or []
if init:
run_subcommand('init', stdin=None, pwd=out_dir)
# setup config for django runserver
from . import config
config.SHOW_PROGRESS = False
config.DEBUG = config.DEBUG or debug
check_data_folder(out_dir=out_dir)
if debug:
os.environ['DEBUG'] = 'True'
else:
runserver_args.append('--insecure')
setup_django(out_dir)
from django.core.management import call_command
from django.contrib.auth.models import User
if IS_TTY and not User.objects.filter(is_superuser=True).exists():
admin_user = User.objects.filter(is_superuser=True).order_by('date_joined').only('username').last()
print('{green}[+] Starting ArchiveBox webserver...{reset}'.format(**ANSI))
if admin_user:
print("{lightred}[i] The admin username is:{lightblue} {}{reset}".format(admin_user.username, **ANSI))
else:
print('{lightyellow}[!] No admin users exist yet, you will not be able to edit links in the UI.{reset}'.format(**ANSI))
print()
print(' To create an admin user, run:')
print(' archivebox manage createsuperuser')
print()
print('{green}[+] Starting ArchiveBox webserver...{reset}'.format(**ANSI))
# fallback to serving staticfiles insecurely with django when DEBUG=False
if not config.DEBUG:
runserver_args.append('--insecure') # TODO: serve statics w/ nginx instead
# toggle autoreloading when archivebox code changes (it's on by default)
if not reload:
runserver_args.append('--noreload')
config.SHOW_PROGRESS = False
config.DEBUG = config.DEBUG or debug
call_command("runserver", *runserver_args)
@ -1019,10 +1062,14 @@ def manage(args: Optional[List[str]]=None, out_dir: str=OUTPUT_DIR) -> None:
"""Run an ArchiveBox Django management command"""
check_data_folder(out_dir=out_dir)
setup_django(out_dir)
from django.core.management import execute_from_command_line
if (args and "createsuperuser" in args) and (IN_DOCKER and not IS_TTY):
stderr('[!] Warning: you need to pass -it to use interactive commands in docker', color='lightyellow')
stderr(' docker run -it archivebox manage {}'.format(' '.join(args or ['...'])), color='lightyellow')
stderr()
execute_from_command_line([f'{ARCHIVEBOX_BINARY} manage', *(args or ['help'])])
@ -1035,3 +1082,4 @@ def shell(out_dir: str=OUTPUT_DIR) -> None:
setup_django(OUTPUT_DIR)
from django.core.management import call_command
call_command("shell_plus")

View File

@ -3,6 +3,21 @@ import os
import sys
if __name__ == '__main__':
# if you're a developer working on archivebox, still prefer the archivebox
# versions of ./manage.py commands whenever possible. When that's not possible
# (e.g. makemigrations), you can comment out this check temporarily
if not ('makemigrations' in sys.argv or 'migrate' in sys.argv):
print("[X] Don't run ./manage.py directly, use the archivebox CLI instead e.g.:")
print(' archivebox manage createsuperuser')
print()
print(' Hint: Use these archivebox commands instead of the ./manage.py equivalents:')
print(' archivebox init (migrates the databse to latest version)')
print(' archivebox server (runs the Django web server)')
print(' archivebox shell (opens an iPython Django shell with all models imported)')
print(' archivebox manage [cmd] (any other management commands)')
raise SystemExit(2)
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'core.settings')
try:
from django.core.management import execute_from_command_line

View File

@ -9,27 +9,26 @@ __package__ = 'archivebox.parsers'
import re
import os
from io import StringIO
from typing import Tuple, List
from typing import IO, Tuple, List
from datetime import datetime
from ..index.schema import Link
from ..system import atomic_write
from ..config import (
ANSI,
OUTPUT_DIR,
SOURCES_DIR_NAME,
TIMEOUT,
check_data_folder,
)
from ..util import (
basename,
domain,
download_url,
enforce_types,
URL_REGEX,
)
from ..cli.logging import pretty_path, TimedProgress
from ..index.schema import Link
from ..logging_util import TimedProgress, log_source_saved
from .pocket_html import parse_pocket_html_export
from .pinboard_rss import parse_pinboard_rss_export
from .shaarli_rss import parse_shaarli_rss_export
@ -39,14 +38,6 @@ from .generic_rss import parse_generic_rss_export
from .generic_json import parse_generic_json_export
from .generic_txt import parse_generic_txt_export
@enforce_types
def parse_links(source_file: str) -> Tuple[List[Link], str]:
"""parse a list of URLs with their metadata from an
RSS feed, bookmarks export, or text file
"""
check_url_parsing_invariants()
PARSERS = (
# Specialized parsers
('Pocket HTML', parse_pocket_html_export),
@ -62,11 +53,50 @@ def parse_links(source_file: str) -> Tuple[List[Link], str]:
# Fallback parser
('Plain Text', parse_generic_txt_export),
)
@enforce_types
def parse_links_memory(urls: List[str]):
"""
parse a list of URLS without touching the filesystem
"""
check_url_parsing_invariants()
timer = TimedProgress(TIMEOUT * 4)
#urls = list(map(lambda x: x + "\n", urls))
file = StringIO()
file.writelines(urls)
file.name = "io_string"
output = _parse(file, timer)
if output is not None:
return output
timer.end()
return [], 'Failed to parse'
@enforce_types
def parse_links(source_file: str) -> Tuple[List[Link], str]:
"""parse a list of URLs with their metadata from an
RSS feed, bookmarks export, or text file
"""
check_url_parsing_invariants()
timer = TimedProgress(TIMEOUT * 4)
with open(source_file, 'r', encoding='utf-8') as file:
output = _parse(file, timer)
if output is not None:
return output
timer.end()
return [], 'Failed to parse'
def _parse(to_parse: IO[str], timer) -> Tuple[List[Link], str]:
for parser_name, parser_func in PARSERS:
try:
links = list(parser_func(file))
links = list(parser_func(to_parse))
if links:
timer.end()
return links, parser_name
@ -78,41 +108,24 @@ def parse_links(source_file: str) -> Tuple[List[Link], str]:
# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))
# raise
timer.end()
return [], 'Failed to parse'
@enforce_types
def save_stdin_to_sources(raw_text: str, out_dir: str=OUTPUT_DIR) -> str:
check_data_folder(out_dir=out_dir)
sources_dir = os.path.join(out_dir, SOURCES_DIR_NAME)
if not os.path.exists(sources_dir):
os.makedirs(sources_dir)
def save_text_as_source(raw_text: str, filename: str='{ts}-stdin.txt', out_dir: str=OUTPUT_DIR) -> str:
ts = str(datetime.now().timestamp()).split('.', 1)[0]
source_path = os.path.join(sources_dir, '{}-{}.txt'.format('stdin', ts))
atomic_write(raw_text, source_path)
source_path = os.path.join(out_dir, SOURCES_DIR_NAME, filename.format(ts=ts))
atomic_write(source_path, raw_text)
log_source_saved(source_file=source_path)
return source_path
@enforce_types
def save_file_to_sources(path: str, timeout: int=TIMEOUT, out_dir: str=OUTPUT_DIR) -> str:
def save_file_as_source(path: str, timeout: int=TIMEOUT, filename: str='{ts}-{basename}.txt', out_dir: str=OUTPUT_DIR) -> str:
"""download a given url's content into output/sources/domain-<timestamp>.txt"""
check_data_folder(out_dir=out_dir)
sources_dir = os.path.join(out_dir, SOURCES_DIR_NAME)
if not os.path.exists(sources_dir):
os.makedirs(sources_dir)
ts = str(datetime.now().timestamp()).split('.', 1)[0]
source_path = os.path.join(sources_dir, '{}-{}.txt'.format(basename(path), ts))
source_path = os.path.join(OUTPUT_DIR, SOURCES_DIR_NAME, filename.format(basename=basename(path), ts=ts))
if any(path.startswith(s) for s in ('http://', 'https://', 'ftp://')):
source_path = os.path.join(sources_dir, '{}-{}.txt'.format(domain(path), ts))
# Source is a URL that needs to be downloaded
print('{}[*] [{}] Downloading {}{}'.format(
ANSI['green'],
datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
@ -134,12 +147,13 @@ def save_file_to_sources(path: str, timeout: int=TIMEOUT, out_dir: str=OUTPUT_DI
raise SystemExit(1)
else:
# Source is a path to a local file on the filesystem
with open(path, 'r') as f:
raw_source_text = f.read()
atomic_write(raw_source_text, source_path)
atomic_write(source_path, raw_source_text)
print(' > {}'.format(pretty_path(source_path)))
log_source_saved(source_file=source_path)
return source_path

View File

@ -5,6 +5,7 @@ import re
from typing import IO, Iterable
from datetime import datetime
from pathlib import Path
from ..index.schema import Link
from ..util import (
@ -13,14 +14,40 @@ from ..util import (
URL_REGEX
)
@enforce_types
def parse_generic_txt_export(text_file: IO[str]) -> Iterable[Link]:
"""Parse raw links from each line in a text file"""
text_file.seek(0)
for line in text_file.readlines():
urls = re.findall(URL_REGEX, line) if line.strip() else ()
for url in urls: # type: ignore
if not line.strip():
continue
# if the line is a local file path that resolves, then we can archive it
if Path(line).exists():
yield Link(
url=line,
timestamp=str(datetime.now().timestamp()),
title=None,
tags=None,
sources=[text_file.name],
)
# otherwise look for anything that looks like a URL in the line
for url in re.findall(URL_REGEX, line):
yield Link(
url=htmldecode(url),
timestamp=str(datetime.now().timestamp()),
title=None,
tags=None,
sources=[text_file.name],
)
# look inside the URL for any sub-urls, e.g. for archive.org links
# https://web.archive.org/web/20200531203453/https://www.reddit.com/r/socialism/comments/gu24ke/nypd_officers_claim_they_are_protecting_the_rule/fsfq0sw/
# -> https://www.reddit.com/r/socialism/comments/gu24ke/nypd_officers_claim_they_are_protecting_the_rule/fsfq0sw/
for url in re.findall(URL_REGEX, line[1:]):
yield Link(
url=htmldecode(url),
timestamp=str(datetime.now().timestamp()),

View File

@ -4,96 +4,60 @@ __package__ = 'archivebox'
import os
import shutil
import json as pyjson
from json import dump
from pathlib import Path
from typing import Optional, Union, Set, Tuple
from subprocess import run as subprocess_run
from crontab import CronTab
from subprocess import (
Popen,
PIPE,
DEVNULL,
CompletedProcess,
TimeoutExpired,
CalledProcessError,
)
from atomicwrites import atomic_write as lib_atomic_write
from .util import enforce_types, ExtendedEncoder
from .config import OUTPUT_PERMISSIONS
def run(*popenargs, input=None, capture_output=False, timeout=None, check=False, **kwargs):
def run(*args, input=None, capture_output=True, text=False, **kwargs):
"""Patched of subprocess.run to fix blocking io making timeout=innefective"""
if input is not None:
if 'stdin' in kwargs:
raise ValueError('stdin and input arguments may not both be used.')
kwargs['stdin'] = PIPE
if capture_output:
if ('stdout' in kwargs) or ('stderr' in kwargs):
raise ValueError('stdout and stderr arguments may not be used '
'with capture_output.')
kwargs['stdout'] = PIPE
kwargs['stderr'] = PIPE
with Popen(*popenargs, **kwargs) as process:
try:
stdout, stderr = process.communicate(input, timeout=timeout)
except TimeoutExpired:
process.kill()
try:
stdout, stderr = process.communicate(input, timeout=2)
except:
pass
raise TimeoutExpired(popenargs[0][0], timeout)
except BaseException:
process.kill()
# We don't call process.wait() as .__exit__ does that for us.
raise
retcode = process.poll()
if check and retcode:
raise CalledProcessError(retcode, process.args,
output=stdout, stderr=stderr)
return CompletedProcess(process.args, retcode, stdout, stderr)
def atomic_write(contents: Union[dict, str, bytes], path: str) -> None:
"""Safe atomic write to filesystem by writing to temp file + atomic rename"""
try:
tmp_file = '{}.tmp'.format(path)
if isinstance(contents, bytes):
args = {'mode': 'wb+'}
else:
args = {'mode': 'w+', 'encoding': 'utf-8'}
with open(tmp_file, **args) as f:
if isinstance(contents, dict):
pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
else:
f.write(contents)
os.fsync(f.fileno())
os.rename(tmp_file, path)
chmod_file(path)
finally:
if os.path.exists(tmp_file):
os.remove(tmp_file)
return subprocess_run(*args, input=input, capture_output=capture_output, text=text, **kwargs)
@enforce_types
def chmod_file(path: str, cwd: str='.', permissions: str=OUTPUT_PERMISSIONS, timeout: int=30) -> None:
def atomic_write(path: Union[Path, str], contents: Union[dict, str, bytes], overwrite: bool=True) -> None:
"""Safe atomic write to filesystem by writing to temp file + atomic rename"""
mode = 'wb+' if isinstance(contents, bytes) else 'w'
# print('\n> Atomic Write:', mode, path, len(contents), f'overwrite={overwrite}')
with lib_atomic_write(path, mode=mode, overwrite=overwrite) as f:
if isinstance(contents, dict):
dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
elif isinstance(contents, (bytes, str)):
f.write(contents)
os.chmod(path, int(OUTPUT_PERMISSIONS, base=8))
@enforce_types
def chmod_file(path: str, cwd: str='.', permissions: str=OUTPUT_PERMISSIONS) -> None:
"""chmod -R <permissions> <cwd>/<path>"""
if not os.path.exists(os.path.join(cwd, path)):
root = Path(cwd) / path
if not root.exists():
raise Exception('Failed to chmod: {} does not exist (did the previous step fail?)'.format(path))
chmod_result = run(['chmod', '-R', permissions, path], cwd=cwd, stdout=DEVNULL, stderr=PIPE, timeout=timeout)
if chmod_result.returncode == 1:
print(' ', chmod_result.stderr.decode())
raise Exception('Failed to chmod {}/{}'.format(cwd, path))
if not root.is_dir():
os.chmod(root, int(OUTPUT_PERMISSIONS, base=8))
else:
for subpath in Path(path).glob('**/*'):
os.chmod(subpath, int(OUTPUT_PERMISSIONS, base=8))
@enforce_types
@ -104,7 +68,8 @@ def copy_and_overwrite(from_path: str, to_path: str):
shutil.copytree(from_path, to_path)
else:
with open(from_path, 'rb') as src:
atomic_write(src.read(), to_path)
contents = src.read()
atomic_write(to_path, contents)
@enforce_types
@ -132,6 +97,7 @@ def get_dir_size(path: str, recursive: bool=True, pattern: Optional[str]=None) -
CRON_COMMENT = 'archivebox_schedule'
@enforce_types
def dedupe_cron_jobs(cron: CronTab) -> CronTab:
deduped: Set[Tuple[str, str]] = set()

View File

@ -0,0 +1 @@
actions_as_select

View File

@ -2,7 +2,7 @@
{% get_current_language as LANGUAGE_CODE %}{% get_current_language_bidi as LANGUAGE_BIDI %}
<html lang="{{ LANGUAGE_CODE|default:"en-us" }}" {% if LANGUAGE_BIDI %}dir="rtl"{% endif %}>
<head>
<title>{% block title %}{% endblock %}</title>
<title>{% block title %}{% endblock %} | ArchiveBox</title>
<link rel="stylesheet" type="text/css" href="{% block stylesheet %}{% static "admin/css/base.css" %}{% endblock %}">
{% block extrastyle %}{% endblock %}
{% if LANGUAGE_BIDI %}<link rel="stylesheet" type="text/css" href="{% block stylesheet_rtl %}{% static "admin/css/rtl.css" %}{% endblock %}">{% endif %}
@ -13,12 +13,61 @@
{% if LANGUAGE_BIDI %}<link rel="stylesheet" type="text/css" href="{% static "admin/css/responsive_rtl.css" %}">{% endif %}
{% endblock %}
{% block blockbots %}<meta name="robots" content="NONE,NOARCHIVE">{% endblock %}
<link rel="stylesheet" type="text/css" href="{% static "admin.css" %}">
</head>
{% load i18n %}
<body class="{% if is_popup %}popup {% endif %}{% block bodyclass %}{% endblock %}"
data-admin-utc-offset="{% now "Z" %}">
<style nonce="{{nonce}}">
/* Loading Progress Bar */
#progress {
position: absolute;
z-index: 1000;
top: 0px;
left: -6px;
width: 2%;
opacity: 1;
height: 2px;
background: #1a1a1a;
border-radius: 1px;
transition: width 4s ease-out, opacity 400ms linear;
}
@-moz-keyframes bugfix { from { padding-right: 1px ; } to { padding-right: 0; } }
</style>
<script>
// Page Loading Bar
window.loadStart = function(distance) {
var distance = distance || 0;
// only add progrstess bar if not already present
if (django.jQuery("#loading-bar").length == 0) {
django.jQuery("body").add("<div id=\"loading-bar\"></div>");
}
if (django.jQuery("#progress").length === 0) {
django.jQuery("body").append(django.jQuery("<div></div>").attr("id", "progress"));
let last_distance = (distance || (30 + (Math.random() * 30)))
django.jQuery("#progress").width(last_distance + "%");
setInterval(function() {
last_distance += Math.random()
django.jQuery("#progress").width(last_distance + "%");
}, 1000)
}
};
window.loadFinish = function() {
django.jQuery("#progress").width("101%").delay(200).fadeOut(400, function() {
django.jQuery(this).remove();
});
};
window.loadStart();
window.addEventListener('beforeunload', function() {window.loadStart(27)});
document.addEventListener('DOMContentLoaded', function() {window.loadFinish()});
</script>
<!-- Container -->
<div id="container">
@ -26,14 +75,22 @@
<!-- Header -->
<div id="header">
<div id="branding">
{% block branding %}{% endblock %}
<h1 id="site-name">
<a href="{% url 'Home' %}">
<img src="{% static 'archive.png' %}" id="logo">
ArchiveBox
</a>
</h1>
</div>
{% block usertools %}
{% if has_permission %}
<div id="user-tools">
<a href="/add/">Add Links</a> /
<a href="/">Main Index</a> /
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
<a href="{% url 'admin:Add' %}">Add </a> /
<a href="{% url 'Home' %}">Snapshots</a> /
<a href="/admin/auth/user/">Users</a> /
<a href="{% url 'OldHome' %}">Old UI</a> /
<a href="{% url 'Docs' %}">Docs</a>
&nbsp; &nbsp;
{% block welcome-msg %}
{% trans 'User' %}
@ -76,7 +133,7 @@
<!-- Content -->
<div id="content" class="{% block coltype %}colM{% endblock %}">
{% block pretitle %}{% endblock %}
{% block content_title %}{% if title %}<h1>{{ title }}</h1>{% endif %}{% endblock %}
{% block content_title %}{# {% if title %}<h1>{{ title }}</h1>{% endif %} #}{% endblock %}
{% block content %}
{% block object-tools %}{% endblock %}
{{ content }}
@ -90,5 +147,42 @@
</div>
<!-- END Container -->
<script>
(function ($) {
$.fn.reverse = [].reverse;
function fix_actions() {
var container = $('div.actions');
if (container.find('option').length < 10) {
container.find('label, button').hide();
var buttons = $('<div></div>')
.prependTo(container)
.css('display', 'inline')
.addClass('class', 'action-buttons');
container.find('option:gt(0)').reverse().each(function () {
const name = this.value
$('<button>')
.appendTo(buttons)
.attr('name', this.value)
.addClass('button')
.text(this.text)
.click(function () {
container.find('select')
.find(':selected').attr('selected', '').end()
.find('[value=' + this.name + ']').attr('selected', 'selected');
$('#changelist-form button[name="index"]').click();
document.querySelector('#logo').outerHTML = '<div class="loader"></div>'
});
});
}
};
$(function () {
fix_actions();
});
})(django.jQuery);
</script>
</body>
</html>

View File

@ -11,7 +11,7 @@
{% block usertools %}
<br/>
<a href="/">Back to Main Index</a>
<a href="{% url 'Home' %}">Back to Main Index</a>
{% endblock %}
{% block nav-global %}{% endblock %}

View File

@ -1,209 +1,100 @@
{% load static %}
{% extends "admin/index.html" %}
{% load i18n %}
<!DOCTYPE html>
<html lang="en">
<head>
<title>Archived Sites</title>
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
{% block breadcrumbs %}
<div class="breadcrumbs">
<a href="{% url 'admin:index' %}">{% trans 'Home' %}</a>
{% if title %} &rsaquo; {{ title }}{% endif %}
</div>
{% endblock %}
{% block content %}
<style>
html, body {
.dashboard #content {
width: 100%;
height: 100%;
margin-right: 0px;
margin-left: 0px;
}
#submit {
border: 1px solid rgba(0,0,0,0.2);
padding: 10px;
border-radius: 4px;
background-color: #f5dd5d;
color: #333;
font-size: 18px;
font-weight: 200;
text-align: center;
margin: 0px;
padding: 0px;
font-family: "Gill Sans", Helvetica, sans-serif;
font-weight: 800;
}
.header-top small {
font-weight: 200;
color: #efefef;
#add-form button[role=submit]:hover {
background-color: #e5cd4d;
}
.header-top {
#add-form label {
display: block;
font-size: 16px;
}
#add-form textarea {
width: 100%;
height: auto;
min-height: 40px;
margin: 0px;
text-align: center;
color: white;
font-size: calc(11px + 0.84vw);
font-weight: 200;
padding: 4px 4px;
border-bottom: 3px solid #aa1e55;
background-color: #aa1e55;
min-height: 300px;
}
input[type=search] {
width: 22vw;
#delay-warning div {
border: 1px solid red;
border-radius: 4px;
border: 1px solid #aeaeae;
padding: 3px 5px;
margin: 10px;
padding: 10px;
font-size: 15px;
background-color: #F5DD5D;
}
.nav > div {
min-height: 30px;
}
.header-top a {
text-decoration: none;
color: rgba(0,0,0,0.6);
}
.header-top a:hover {
text-decoration: none;
color: rgba(0,0,0,0.9);
}
.header-top .col-lg-4 {
text-align: center;
padding-top: 4px;
padding-bottom: 4px;
}
.header-archivebox img {
display: inline-block;
margin-right: 3px;
height: 30px;
margin-left: 12px;
margin-top: -4px;
margin-bottom: 2px;
}
.header-archivebox img:hover {
opacity: 0.5;
}
#table-bookmarks_length, #table-bookmarks_filter {
padding-top: 12px;
opacity: 0.8;
padding-left: 24px;
padding-right: 22px;
margin-bottom: -16px;
}
table {
padding: 6px;
width: 100%;
}
table thead th {
font-weight: 400;
}
table tr {
height: 35px;
}
tbody tr:nth-child(odd) {
background-color: #ffebeb !important;
}
table tr td {
white-space: nowrap;
overflow: hidden;
/*padding-bottom: 0.4em;*/
/*padding-top: 0.4em;*/
padding-left: 2px;
text-align: center;
}
table tr td a {
text-decoration: none;
}
table tr td img, table tr td object {
display: inline-block;
margin: auto;
height: 24px;
width: 24px;
padding: 0px;
padding-right: 5px;
vertical-align: middle;
margin-left: 4px;
}
#table-bookmarks {
width: 100%;
overflow-y: scroll;
table-layout: fixed;
}
.dataTables_wrapper {
background-color: #fafafa;
}
table tr a span[data-archived~=False] {
opacity: 0.4;
}
.files-spinner {
height: 15px;
width: auto;
opacity: 0.5;
vertical-align: -2px;
}
.in-progress {
display: none;
}
body[data-status~=finished] .files-spinner {
display: none;
}
/*body[data-status~=running] .in-progress {
display: inline-block;
}*/
tr td a.favicon img {
padding-left: 6px;
padding-right: 12px;
vertical-align: -4px;
}
tr td a.title {
font-size: 1.4em;
text-decoration:none;
color:black;
}
tr td a.title small {
background-color: #efefef;
#stdout {
background-color: #ded;
padding: 10px 10px;
border-radius: 4px;
float:right
}
input[type=search]::-webkit-search-cancel-button {
-webkit-appearance: searchfield-cancel-button;
}
.title-col {
text-align: left;
}
.title-col a {
color: black;
white-space: normal;
}
</style>
<link rel="stylesheet" href="{% static 'bootstrap.min.css' %}">
<link rel="stylesheet" href="{% static 'jquery.dataTables.min.css' %}"/>
<script src="{% static 'jquery.min.js' %}"></script>
<script src="{% static 'jquery.dataTables.min.js' %}"></script>
<script>
document.addEventListener('error', function(e) {
e.target.style.opacity = 0;
}, true)
jQuery(document).ready(function() {
jQuery('#table-bookmarks').DataTable({
stateSave: true, // save state (filtered input, number of entries shown, etc) in localStorage
dom: '<lf<t>ip>', // how to show the table and its helpers (filter, etc) in the DOM
order: [[0, 'desc']],
iDisplayLength: 100,
});
});
</script>
</head>
<body data-status="finished">
<header>
<div class="header-top container-fluid">
<div class="row nav">
<div class="col-sm-2">
<a href="/" class="header-archivebox" title="Last updated: {{updated}}">
<img src="{% static 'archive.png' %}" alt="Logo"/>
ArchiveBox: Add
</a>
</div>
<div class="col-sm-10" style="text-align: right">
<a href="/">Main Index</a> &nbsp; | &nbsp;
<a href="/admin/">Admin</a> &nbsp; | &nbsp;
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
</div>
</div>
</div>
</header>
<center>
<div style="max-width: 550px; margin: auto; float: none">
<br/><br/>
<form action="?" method="POST">{% csrf_token %}
Add new links...<br/>
<input type="text" name="url" placeholder="URL of page or feed..."/><br/>
<button role="submit">Add</button>
</form>
{% if stdout %}
<h1>Add new URLs to your archive: results</h1>
<pre id="stdout">
{{ stdout | safe }}
<br/><br/>
</pre>
<br/>
<center>
<a href="/add" id="submit">&nbsp; Add more URLs </a>
</center>
{% else %}
<form id="add-form" action="?" method="POST" class="p-form">{% csrf_token %}
<h1>Add new URLs to your archive</h1>
<br/>
{{ form.as_p }}
<center>
<button role="submit" id="submit">&nbsp; Add URLs and archive </button>
</center>
</form>
<br/><br/><br/>
<center id="delay-warning" style="display: none">
<b><i>This page will be unresponsive until the process is completely finished.</i></b>
<br/><br/>
<div>
Warning: it may take several minutes to finish adding!<br/>
<br/>
Progress will be displayed in the <code>archivebox server</code> stdout,<br/>
and on this page once the archiving process completes.<br/>
<br/>
<small>(it's safe to leave this page, adding will continue in the background)</small>
</div>
</center>
<script>
document.getElementById('add-form').addEventListener('submit', function(event) {
setTimeout(function() {
document.getElementById('add-form').innerHTML = '<center><h3>Adding URLs to index and running archive methods...<h3><br/><div class="loader"></div><br/>(see terminal for progress)</center>'
document.getElementById('delay-warning').style.display = 'block'
}, 200)
return true
})
</script>
{% endif %}
</div>
{% endblock %}
</body>
</html>
{% block sidebar %}{% endblock %}

View File

@ -6,6 +6,37 @@
<title>Archived Sites</title>
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
<style>
:root {
--bg-main: #efefef;
--accent-1: #aa1e55;
--accent-2: #ffebeb;
--accent-3: #efefef;
--text-1: #1c1c1c;
--text-2: #eaeaea;
--text-main: #1a1a1a;
--font-main: "Gill Sans", Helvetica, sans-serif;
}
/* Dark Mode (WIP) */
/*
@media (prefers-color-scheme: dark) {
:root {
--accent-2: hsl(160, 100%, 96%);
--text-1: #eaeaea;
--text-2: #1a1a1a;
--bg-main: #101010;
}
#table-bookmarks_wrapper,
#table-bookmarks_wrapper img,
tbody td:nth-child(3),
tbody td:nth-child(3) span,
footer {
filter: invert(100%);
}
}*/
html, body {
width: 100%;
height: 100%;
@ -14,11 +45,12 @@
text-align: center;
margin: 0px;
padding: 0px;
font-family: "Gill Sans", Helvetica, sans-serif;
font-family: var(--font-main);
}
.header-top small {
font-weight: 200;
color: #efefef;
color: var(--accent-3);
}
.header-top {
@ -31,8 +63,8 @@
font-size: calc(11px + 0.84vw);
font-weight: 200;
padding: 4px 4px;
border-bottom: 3px solid #aa1e55;
background-color: #aa1e55;
border-bottom: 3px solid var(--accent-1);
background-color: var(--accent-1);
}
input[type=search] {
width: 22vw;
@ -86,7 +118,7 @@
height: 35px;
}
tbody tr:nth-child(odd) {
background-color: #ffebeb !important;
background-color: var(--accent-2) !important;
}
table tr td {
white-space: nowrap;
@ -146,7 +178,7 @@
color:black;
}
tr td a.title small {
background-color: #efefef;
background-color: var(--accent-3);
border-radius: 4px;
float:right
}
@ -190,7 +222,7 @@
</div>
<div class="col-sm-10" style="text-align: right">
<a href="/add/">Add Links</a> &nbsp; | &nbsp;
<a href="/admin/core/page/">Admin</a> &nbsp; | &nbsp;
<a href="/admin/core/snapshot/">Admin</a> &nbsp; | &nbsp;
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
</div>
</div>
@ -216,7 +248,7 @@
<a href="archive/{{link.timestamp}}/index.html"><img src="{% static 'spinner.gif' %}" class="link-favicon" decoding="async"></a>
{% endif %}
<a href="archive/{{link.timestamp}}/{{link.canonical_outputs.wget_path}}" title="{{link.title}}">
<span data-title-for="{{link.url}}" data-archived="{{link.is_archived}}">{{link.title}}</span>
<span data-title-for="{{link.url}}" data-archived="{{link.is_archived}}">{{link.title|default:'Loading...'}}</span>
<small style="float:right">{{link.tags|default:''}}</small>
</a>
</td>

View File

@ -0,0 +1,224 @@
#logo {
height: 30px;
vertical-align: -6px;
padding-right: 5px;
}
#site-name:hover a {
opacity: 0.9;
}
#site-name .loader {
height: 25px;
width: 25px;
display: inline-block;
border-width: 3px;
vertical-align: -3px;
margin-right: 5px;
margin-top: 2px;
}
#branding h1, #branding h1 a:link, #branding h1 a:visited {
color: mintcream;
}
#header {
background: #aa1e55;
padding: 6px 14px;
}
#content {
padding: 8px 8px;
}
#user-tools {
font-size: 13px;
}
div.breadcrumbs {
background: #772948;
color: #f5dd5d;
padding: 6px 15px;
}
body.model-snapshot.change-list div.breadcrumbs,
body.model-snapshot.change-list #content .object-tools {
display: none;
}
.module h2, .module caption, .inline-group h2 {
background: #772948;
}
#content .object-tools {
margin-top: -35px;
margin-right: -10px;
float: right;
}
#content .object-tools a:link, #content .object-tools a:visited {
border-radius: 0px;
background-color: #f5dd5d;
color: #333;
font-size: 12px;
font-weight: 800;
}
#content .object-tools a.addlink {
background-blend-mode: difference;
}
#content #changelist #toolbar {
padding: 0px;
background: none;
margin-bottom: 10px;
border-top: 0px;
border-bottom: 0px;
}
#content #changelist #toolbar form input[type="submit"] {
border-color: #aa1e55;
}
#content #changelist-filter li.selected a {
color: #aa1e55;
}
/*#content #changelist .actions {
position: fixed;
bottom: 0px;
z-index: 800;
}*/
#content #changelist .actions {
float: right;
margin-top: -34px;
padding: 0px;
background: none;
margin-right: 0px;
}
#content #changelist .actions .button {
border-radius: 2px;
background-color: #f5dd5d;
color: #333;
font-size: 12px;
font-weight: 800;
margin-right: 4px;
box-shadow: 4px 4px 4px rgba(0,0,0,0.02);
border: 1px solid rgba(0,0,0,0.08);
}
#content #changelist .actions .button:hover {
border: 1px solid rgba(0,0,0,0.2);
opacity: 0.9;
}
#content #changelist .actions .button[name=verify_snapshots], #content #changelist .actions .button[name=update_titles] {
background-color: #dedede;
color: #333;
}
#content #changelist .actions .button[name=update_snapshots] {
background-color:lightseagreen;
color: #333;
}
#content #changelist .actions .button[name=overwrite_snapshots] {
background-color: #ffaa31;
color: #333;
}
#content #changelist .actions .button[name=delete_snapshots] {
background-color: #f91f74;
color: rgb(255 248 252 / 64%);
}
#content #changelist-filter h2 {
border-radius: 4px 4px 0px 0px;
}
@media (min-width: 767px) {
#content #changelist-filter {
top: 35px;
width: 110px;
margin-bottom: 35px;
}
.change-list .filtered .results,
.change-list .filtered .paginator,
.filtered #toolbar,
.filtered div.xfull {
margin-right: 115px;
}
}
@media (max-width: 1127px) {
#content #changelist .actions {
position: fixed;
bottom: 6px;
left: 10px;
float: left;
z-index: 1000;
}
}
#content a img.favicon {
height: 20px;
width: 20px;
vertical-align: -5px;
padding-right: 6px;
}
#content td, #content th {
vertical-align: middle;
padding: 4px;
}
#content #changelist table input {
vertical-align: -2px;
}
#content thead th .text a {
padding: 8px 4px;
}
#content th.field-added, #content td.field-updated {
word-break: break-word;
min-width: 128px;
white-space: normal;
}
#content th.field-title_str {
min-width: 300px;
}
#content td.field-files {
white-space: nowrap;
}
#content td.field-files .exists-True {
opacity: 1;
}
#content td.field-files .exists-False {
opacity: 0.1;
filter: grayscale(100%);
}
#content td.field-size {
white-space: nowrap;
}
#content td.field-url_str {
word-break: break-all;
min-width: 200px;
}
#content tr b.status-pending {
font-weight: 200;
opacity: 0.6;
}
.loader {
border: 16px solid #f3f3f3; /* Light grey */
border-top: 16px solid #3498db; /* Blue */
border-radius: 50%;
width: 30px;
height: 30px;
box-sizing: border-box;
animation: spin 2s linear infinite;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

Before

Width:  |  Height:  |  Size: 1.6 KiB

After

Width:  |  Height:  |  Size: 1.6 KiB

View File

Before

Width:  |  Height:  |  Size: 158 B

After

Width:  |  Height:  |  Size: 158 B

View File

Before

Width:  |  Height:  |  Size: 201 B

After

Width:  |  Height:  |  Size: 201 B

View File

Before

Width:  |  Height:  |  Size: 157 B

After

Width:  |  Height:  |  Size: 157 B

View File

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 11 KiB

View File

@ -79,6 +79,7 @@
.card {
overflow: hidden;
box-shadow: 2px 3px 14px 0px rgba(0,0,0,0.02);
margin-top: 10px;
}
.card h4 {
font-size: 1.4vw;
@ -335,6 +336,18 @@
</div>
</div>
</div>
<div class="col-lg-2">
<div class="card">
<iframe class="card-img-top" src="$singlefile_path" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
<div class="card-body">
<a href="$singlefile_path" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
<img src="../../static/external.png" class="external"/>
</a>
<a href="$singlefile_path" target="preview"><h4 class="card-title">SingleFile</h4></a>
<p class="card-text">archive/singlefile.html</p>
</div>
</div>
</div>
<div class="col-lg-2">
<div class="card">
<iframe class="card-img-top pdf-frame" src="$pdf_path" scrolling="no"></iframe>
@ -359,18 +372,6 @@
</div>
</div>
</div>
<div class="col-lg-2">
<div class="card">
<iframe class="card-img-top" src="$url" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
<div class="card-body">
<a href="$url" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
<img src="../../static/external.png" class="external"/>
</a>
<a href="$url" target="preview"><h4 class="card-title">Original</h4></a>
<p class="card-text">$domain</p>
</div>
</div>
</div>
<div class="col-lg-2">
<div class="card">
<iframe class="card-img-top" src="$archive_org_path" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
@ -383,6 +384,18 @@
</div>
</div>
</div>
<div class="col-lg-2">
<div class="card">
<iframe class="card-img-top" src="$url" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
<div class="card-body">
<a href="$url" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
<img src="../../static/external.png" class="external"/>
</a>
<a href="$url" target="preview"><h4 class="card-title">Original</h4></a>
<p class="card-text">$domain</p>
</div>
</div>
</div>
</div>
</div>
</header>

View File

@ -1,26 +1,27 @@
__package__ = 'archivebox'
import re
import ssl
import json as pyjson
from typing import List, Optional, Any
from inspect import signature
from functools import wraps
from hashlib import sha256
from urllib.request import Request, urlopen
from urllib.parse import urlparse, quote, unquote
from html import escape, unescape
from datetime import datetime
from dateparser import parse as dateparser
import requests
from base32_crockford import encode as base32_encode # type: ignore
import json as pyjson
from w3lib.encoding import html_body_declared_encoding, http_content_type_encoding
from .config import (
TIMEOUT,
STATICFILE_EXTENSIONS,
CHECK_SSL_VALIDITY,
WGET_USER_AGENT,
CHROME_OPTIONS,
)
try:
import chardet
detect_encoding = lambda rawdata: chardet.detect(rawdata)["encoding"]
except ImportError:
detect_encoding = lambda rawdata: "utf-8"
### Parsing Helpers
@ -42,7 +43,6 @@ base_url = lambda url: without_scheme(url) # uniq base url used to dedupe links
without_www = lambda url: url.replace('://www.', '://', 1)
without_trailing_slash = lambda url: url[:-1] if url[-1] == '/' else url.replace('/?', '?')
hashurl = lambda url: base32_encode(int(sha256(base_url(url).encode('utf-8')).hexdigest(), 16))[:20]
is_static_file = lambda url: extension(url).lower() in STATICFILE_EXTENSIONS # TODO: the proper way is with MIME type detection, not using extension
urlencode = lambda s: s and quote(s, encoding='utf-8', errors='replace')
urldecode = lambda s: s and unquote(s)
@ -63,6 +63,13 @@ URL_REGEX = re.compile(
re.IGNORECASE,
)
COLOR_REGEX = re.compile(r'\[(?P<arg_1>\d+)(;(?P<arg_2>\d+)(;(?P<arg_3>\d+))?)?m')
def is_static_file(url: str):
# TODO: the proper way is with MIME type detection + ext, not only extension
from .config import STATICFILE_EXTENSIONS
return extension(url).lower() in STATICFILE_EXTENSIONS
def enforce_types(func):
"""
@ -140,74 +147,38 @@ def parse_date(date: Any) -> Optional[datetime]:
date = str(date)
if isinstance(date, str):
if date.replace('.', '').isdigit():
# this is a brittle attempt at unix timestamp parsing (which is
# notoriously hard to do). It may lead to dates being off by
# anything from hours to decades, depending on which app, OS,
# and sytem time configuration was used for the original timestamp
# more info: https://github.com/pirate/ArchiveBox/issues/119
# Note: always always always store the original timestamp string
# somewhere indepentendly of the parsed datetime, so that later
# bugs dont repeatedly misparse and rewrite increasingly worse dates.
# the correct date can always be re-derived from the timestamp str
timestamp = float(date)
EARLIEST_POSSIBLE = 473403600.0 # 1985
LATEST_POSSIBLE = 1735707600.0 # 2025
if EARLIEST_POSSIBLE < timestamp < LATEST_POSSIBLE:
# number is seconds
return datetime.fromtimestamp(timestamp)
elif EARLIEST_POSSIBLE * 1000 < timestamp < LATEST_POSSIBLE * 1000:
# number is milliseconds
return datetime.fromtimestamp(timestamp / 1000)
elif EARLIEST_POSSIBLE * 1000*1000 < timestamp < LATEST_POSSIBLE * 1000*1000:
# number is microseconds
return datetime.fromtimestamp(timestamp / (1000*1000))
else:
# continue to the end and raise a parsing failed error.
# we dont want to even attempt parsing timestamp strings that
# arent within these ranges
pass
if '-' in date:
# 2019-04-07T05:44:39.227520
try:
return datetime.fromisoformat(date)
except Exception:
pass
try:
return datetime.strptime(date, '%Y-%m-%d %H:%M')
except Exception:
pass
return dateparser(date)
raise ValueError('Tried to parse invalid date! {}'.format(date))
@enforce_types
def download_url(url: str, timeout: int=TIMEOUT) -> str:
def download_url(url: str, timeout: int=None) -> str:
"""Download the contents of a remote url and return the text"""
from .config import TIMEOUT, CHECK_SSL_VALIDITY, WGET_USER_AGENT
timeout = timeout or TIMEOUT
response = requests.get(
url,
headers={'User-Agent': WGET_USER_AGENT},
verify=CHECK_SSL_VALIDITY,
timeout=timeout,
)
req = Request(url, headers={'User-Agent': WGET_USER_AGENT})
content_type = response.headers.get('Content-Type', '')
encoding = http_content_type_encoding(content_type) or html_body_declared_encoding(response.text)
if CHECK_SSL_VALIDITY:
resp = urlopen(req, timeout=timeout)
else:
insecure = ssl._create_unverified_context()
resp = urlopen(req, timeout=timeout, context=insecure)
if encoding is not None:
response.encoding = encoding
encoding = resp.headers.get_content_charset() or 'utf-8' # type: ignore
return resp.read().decode(encoding)
return response.text
@enforce_types
def chrome_args(**options) -> List[str]:
"""helper to build up a chrome shell command with arguments"""
from .config import CHROME_OPTIONS
options = {**CHROME_OPTIONS, **options}
cmd_args = [options['CHROME_BINARY']]
@ -216,8 +187,16 @@ def chrome_args(**options) -> List[str]:
cmd_args += ('--headless',)
if not options['CHROME_SANDBOX']:
# dont use GPU or sandbox when running inside docker container
cmd_args += ('--no-sandbox', '--disable-gpu')
# assume this means we are running inside a docker container
# in docker, GPU support is limited, sandboxing is unecessary,
# and SHM is limited to 64MB by default (which is too low to be usable).
cmd_args += (
'--no-sandbox',
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-software-rasterizer',
)
if not options['CHECK_SSL_VALIDITY']:
cmd_args += ('--disable-web-security', '--ignore-certificate-errors')
@ -236,6 +215,46 @@ def chrome_args(**options) -> List[str]:
return cmd_args
def ansi_to_html(text):
"""
Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html
"""
from .config import COLOR_DICT
TEMPLATE = '<span style="color: rgb{}"><br>'
text = text.replace('[m', '</span>')
def single_sub(match):
argsdict = match.groupdict()
if argsdict['arg_3'] is None:
if argsdict['arg_2'] is None:
_, color = 0, argsdict['arg_1']
else:
_, color = argsdict['arg_1'], argsdict['arg_2']
else:
_, color = argsdict['arg_3'], argsdict['arg_2']
return TEMPLATE.format(COLOR_DICT[color][0])
return COLOR_REGEX.sub(single_sub, text)
class AttributeDict(dict):
"""Helper to allow accessing dict values via Example.key or Example['key']"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Recursively convert nested dicts to AttributeDicts (optional):
# for key, val in self.items():
# if isinstance(val, dict) and type(val) is not AttributeDict:
# self[key] = AttributeDict(val)
def __getattr__(self, attr: str) -> Any:
return dict.__getitem__(self, attr)
def __setattr__(self, attr: str, value: Any) -> None:
return dict.__setitem__(self, attr, value)
class ExtendedEncoder(pyjson.JSONEncoder):
"""

7
bin/archive Executable file
View File

@ -0,0 +1,7 @@
#!/bin/sh
echo "[X] This method of running ArchiveBox is deprecated as of >= v0.4."
echo " You should 'pip install archivebox' and use the installed 'archivebox' binary instead."
echo " For more info, see the Quickstart section of the README.md:"
echo " https://github.com/pirate/ArchiveBox#Quickstart"
exit 2

View File

@ -1 +0,0 @@
../archivebox/__main__.py

34
bin/docker_entrypoint.sh Executable file
View File

@ -0,0 +1,34 @@
#!/usr/bin/env bash
# Autodetect UID,GID of host user based on ownership of files in the data volume
DATA_DIR="${DATA_DIR:-/data}"
ARCHIVEBOX_USER="${ARCHIVEBOX_USER:-archivebox}"
USID=$(stat --format="%u" "$DATA_DIR")
GRID=$(stat --format="%g" "$DATA_DIR")
# If user is not root, modify the archivebox user+files to have the same uid,gid
if [[ "$USID" != 0 && "$GRID" != 0 ]]; then
usermod -u "$USID" "$ARCHIVEBOX_USER"
groupmod -g "$GRID" "$ARCHIVEBOX_USER"
chown -R "$USID":"$GRID" "/home/$ARCHIVEBOX_USER"
chown "$USID":"$GRID" "$DATA_DIR"
chown "$USID":"$GRID" "$DATA_DIR/*" > /dev/null 2>&1 || true
fi
# Run commands as the new archivebox user in Docker.
# Any files touched will have the same uid & gid
# inside Docker and outside on the host machine.
if [[ "$1" == /* || "$1" == "echo" || "$1" == "archivebox" ]]; then
# arg 1 is a binary, execute it verbatim
# e.g. "archivebox init"
# "/bin/bash"
# "echo"
gosu "$ARCHIVEBOX_USER" bash -c "$*"
else
# no command given, assume args were meant to be passed to archivebox cmd
# e.g. "add https://example.com"
# "manage createsupseruser"
# "server 0.0.0.0:8000"
gosu "$ARCHIVEBOX_USER" bash -c "archivebox $*"
fi

View File

@ -35,3 +35,19 @@ if [[ "$1" == "--firefox" ]]; then
echo "Firefox history exported to:"
echo " output/sources/firefox_history.json"
fi
if [[ "$1" == "--safari" ]]; then
# Safari
if [[ -e "$2" ]]; then
cp "$2" "$REPO_DIR/output/sources/safari_history.db.tmp"
else
default="~/Library/Safari/History.db"
echo "Defaulting to history db: $default"
echo "Optionally specify the path to a different sqlite history database as the 2nd argument."
cp "$default" "$REPO_DIR/output/sources/safari_history.db.tmp"
fi
sqlite3 "$REPO_DIR/output/sources/safari_history.db.tmp" "select url from history_items" > "$REPO_DIR/output/sources/safari_history.json"
rm "$REPO_DIR"/output/sources/safari_history.db.*
echo "Safari history exported to:"
echo " output/sources/safari_history.json"
fi

23
bin/lint.sh Executable file
View File

@ -0,0 +1,23 @@
#!/usr/bin/env bash
### Bash Environment Setup
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
# set -o xtrace
set -o errexit
set -o errtrace
set -o nounset
set -o pipefail
IFS=$'\n'
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
source "$DIR/.venv/bin/activate"
echo "[*] Running flake8..."
flake8 archivebox && echo "√ No errors found."
echo
echo "[*] Running mypy..."
echo "(skipping for now, run 'mypy archivebox' to run it manually)"

80
bin/release.sh Executable file
View File

@ -0,0 +1,80 @@
#!/usr/bin/env bash
### Bash Environment Setup
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
# set -o xtrace
set -o errexit
set -o errtrace
set -o nounset
set -o pipefail
IFS=$'\n'
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
VERSION_FILE="$DIR/archivebox/VERSION"
function bump_semver {
echo "$1" | awk -F. '{$NF = $NF + 1;} 1' | sed 's/ /./g'
}
source "$DIR/.venv/bin/activate"
cd "$DIR"
OLD_VERSION="$(cat "$VERSION_FILE")"
NEW_VERSION="$(bump_semver "$OLD_VERSION")"
echo "[*] Fetching latest docs version"
cd "$DIR/docs"
git pull
cd "$DIR"
echo "[+] Building docs"
sphinx-apidoc -o docs archivebox
cd "$DIR/docs"
make html
cd "$DIR"
if [ -z "$(git status --porcelain)" ] && [[ "$(git branch --show-current)" == "master" ]]; then
git pull
else
echo "[X] Commit your changes and make sure git is checked out on clean master."
exit 4
fi
echo "[*] Bumping VERSION from $OLD_VERSION to $NEW_VERSION"
echo "$NEW_VERSION" > "$VERSION_FILE"
git add "$VERSION_FILE"
git commit -m "$NEW_VERSION release"
git tag -a "v$NEW_VERSION" -m "v$NEW_VERSION"
git push origin master
git push origin --tags
echo "[*] Cleaning up build dirs"
cd "$DIR"
rm -Rf build dist
echo "[+] Building sdist and bdist_wheel"
python3 setup.py sdist bdist_wheel
echo "[^] Uploading to test.pypi.org"
python3 -m twine upload --repository testpypi dist/*
echo "[^] Uploading to pypi.org"
python3 -m twine upload --repository pypi dist/*
echo "[+] Building docker image"
docker build . -t archivebox \
-t archivebox:latest \
-t archivebox:$NEW_VERSION \
-t docker.io/nikisweeting/archivebox:latest \
-t docker.io/nikisweeting/archivebox:$NEW_VERSION \
-t docker.pkg.github.com/pirate/archivebox/archivebox:latest \
-t docker.pkg.github.com/pirate/archivebox/archivebox:$NEW_VERSION
echo "[^] Uploading docker image"
# docker login --username=nikisweeting
# docker login docker.pkg.github.com --username=pirate
docker push docker.io/nikisweeting/archivebox
docker push docker.pkg.github.com/pirate/archivebox/archivebox
echo "[√] Done. Published version v$NEW_VERSION"

17
bin/test.sh Executable file
View File

@ -0,0 +1,17 @@
#!/usr/bin/env bash
### Bash Environment Setup
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
# set -o xtrace
set -o errexit
set -o errtrace
set -o nounset
set -o pipefail
IFS=$'\n'
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
source "$DIR/.venv/bin/activate"
pytest

View File

@ -1,32 +1,75 @@
# This docker-compose config for ArchiveBox runs the following containers:
# - ArchiveBox (it creates the initial archive, then sleeps forever to allow commands to be run with exec to add links)
# - nginx webserver running on https://127.0.0.1:8098
# Usage:
# docker-compose up -d
# echo "https://example.com" | docker-compose exec -T archivebox /bin/archive
# docker-compose exec archivebox /bin/archive https://example.com/some/feed.rss
# docker-compose run archivebox init
# echo "https://example.com" | docker-compose run archivebox archivebox add
# docker-compose run archivebox add --depth=1 https://example.com/some/feed.rss
# docker-compose run archivebox config --set PUBLIC_INDEX=True
# Documentation:
# https://github.com/pirate/ArchiveBox/wiki/Docker#docker-compose
version: '3'
version: '3.7'
services:
archivebox:
build: .
# build: .
image: nikisweeting/archivebox:latest
command: server 0.0.0.0:8000
stdin_open: true
tty: true
# env_file: path/to/your/ArchiveBox.conf
ports:
- 8000:8000
environment:
- USE_COLOR=False
- USE_COLOR=True
- SHOW_PROGRESS=False
volumes:
- ./data:/data
command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null'
nginx:
image: 'nginx'
ports:
- '8098:80'
volumes:
- ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
- ./data:/var/www
# Optional Addons: tweak these examples as needed for your specific use case
# Example: Run scheduled imports in a docker instead of using cron on the
# host machine, add tasks and see more info with archivebox schedule --help
# scheduler:
# image: nikisweeting/archivebox:latest
# command: schedule --foreground
# environment:
# - USE_COLOR=True
# - SHOW_PROGRESS=False
# volumes:
# - ./data:/data
# Example: Put Nginx in front of the ArchiveBox server for SSL termination
# nginx:
# image: nginx:alpine
# ports:
# - 443:443
# - 80:80
# volumes:
# - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
# - ./data:/var/www
# Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel
# wireguard:
# image: linuxserver/wireguard
# network_mode: 'service:archivebox'
# cap_add:
# - NET_ADMIN
# - SYS_MODULE
# sysctls:
# - net.ipv4.conf.all.rp_filter=2
# - net.ipv4.conf.all.src_valid_mark=1
# volumes:
# - /lib/modules:/lib/modules
# - ./wireguard.conf:/config/wg0.conf:ro
# Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox
# pywb:
# image: webrecorder/pywb:latest
# entrypoint: /bin/sh 'wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback --proxy;'
# environment:
# - INIT_COLLECTION=archivebox
# ports:
# - 8080:8080
# volumes:
# ./data:/archivebox
# ./data/wayback:/webarchive

Some files were not shown because too many files have changed in this diff Show More