Merge branch 'master' into v0.5.0
|
|
@ -1,6 +1,16 @@
|
|||
output
|
||||
__pycache__
|
||||
.DS_Store
|
||||
venv
|
||||
.venv
|
||||
data
|
||||
._*
|
||||
*.pyc
|
||||
__pycache__/
|
||||
.mypy_cache/
|
||||
|
||||
venv/
|
||||
.venv/
|
||||
.docker-venv/
|
||||
|
||||
*.egg-info/
|
||||
build/
|
||||
dist/
|
||||
|
||||
data/
|
||||
output/
|
||||
|
|
|
|||
6
.flake8
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
[flake8]
|
||||
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E131,E241,E252,E266,E272,E701,E731,W293,W503,W291,W391
|
||||
select = F,E9,W
|
||||
max-line-length = 130
|
||||
max-complexity = 10
|
||||
exclude = migrations,tests,node_modules,vendor,venv,.venv,.venv2,.docker-venv
|
||||
41
.github/CONTRIBUTING.md
vendored
|
|
@ -1 +1,40 @@
|
|||
Make sure check in with me first or confirm your desired features line up with our roadmap: https://github.com/pirate/ArchiveBox#roadmap
|
||||
# Contribution Process
|
||||
|
||||
1. Confirm your desired features fit into our bigger project goals roadmap: https://github.com/pirate/ArchiveBox#roadmap
|
||||
2. Open an issue with your planned implementation to discuss
|
||||
3. Check in with me before starting development to make sure your work wont conflict with or duplicate existing work
|
||||
4. Setup your dev environment, make some changes, and test using the test input files
|
||||
5. Commit, push, and submit a PR and wait for review feedback
|
||||
6. Have patience, don't abandon your PR! We love contributors but we all have day jobs and don't always have time to respond to notifications instantly. If you want a faster response, ping @theSquashSH on twitter or Patreon.
|
||||
|
||||
**Useful links:**
|
||||
|
||||
- https://github.com/pirate/ArchiveBox/issues
|
||||
- https://github.com/pirate/ArchiveBox/pulls
|
||||
- https://github.com/pirate/ArchiveBox/wiki/Roadmap
|
||||
- https://github.com/pirate/ArchiveBox/wiki/Install#manual-setup
|
||||
|
||||
### Development Setup
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pirate/ArchiveBox
|
||||
cd ArchiveBox
|
||||
# Optionally create a virtualenv
|
||||
pip install -r requirements.txt
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
./bin/archive tests/*
|
||||
# look for errors in stdout/stderr
|
||||
# then confirm output html looks right
|
||||
|
||||
# if on >v0.4 run the django test suite:
|
||||
archivebox manage test
|
||||
```
|
||||
|
||||
### Getting Help
|
||||
|
||||
Open issues on Github or contact me https://sweeting.me/#contact.
|
||||
|
|
|
|||
3
.github/FUNDING.yml
vendored
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
github: pirate
|
||||
patreon: theSquashSH
|
||||
custom: ["https://paypal.me/NicholasSweeting", "https://www.blockchain.com/eth/address/0x5D4c34D4a121Fe08d1dDB7969F07550f2dB9f471", "https://www.blockchain.com/btc/address/1HuxXriPE2Bbnag3jJrqa3bkNHrs297dYH"]
|
||||
29
.github/ISSUE_TEMPLATE/bug_report.md
vendored
|
|
@ -1,30 +1,41 @@
|
|||
---
|
||||
name: 🐞 Bug report
|
||||
about: Create a report to help us improve
|
||||
title: ''
|
||||
labels: ''
|
||||
title: 'Bugfix: ...'
|
||||
labels: 'changes: bugfixes'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
(please fill out the following information, feel free to delete sections if they're not applicable)
|
||||
<!--
|
||||
Please fill out the following information,
|
||||
feel free to delete sections if they're not applicable
|
||||
or if long issue templates annoy you :)
|
||||
-->
|
||||
|
||||
## Describe the bug
|
||||
A description of what the bug is, what you expected to happen,
|
||||
#### Describe the bug
|
||||
<!--
|
||||
A description of what the bug is,
|
||||
what you expected to happen,
|
||||
and any relevant context about issue.
|
||||
-->
|
||||
|
||||
## Steps to reproduce
|
||||
|
||||
#### Steps to reproduce
|
||||
<!--
|
||||
For example:
|
||||
1. Ran ArchiveBox with the following config '...'
|
||||
2. Saw this output during archiving '....'
|
||||
3. UI didn't show the thing I was expecting '....'
|
||||
-->
|
||||
|
||||
## Screenshots or log output
|
||||
#### Screenshots or log output
|
||||
|
||||
<!--
|
||||
If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox.
|
||||
If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**.
|
||||
-->
|
||||
|
||||
## Software versions
|
||||
#### Software versions
|
||||
|
||||
- OS: ([e.g. macOS 10.14] the operating system you're running ArchiveBox on)
|
||||
- ArchiveBox version: (`git rev-parse HEAD | head -c7` [e.g. d798117] commit ID of the version you're running)
|
||||
|
|
|
|||
|
|
@ -1,15 +1,16 @@
|
|||
---
|
||||
name: 📑 Documentation change
|
||||
about: Submit a suggestion for the Wiki documentation
|
||||
title: ''
|
||||
title: 'Documentation: Improvement request ...'
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
## Wiki Page URL
|
||||
<!-- e.g. https://github.com/pirate/ArchiveBox/wiki/Configuration#use_color -->
|
||||
|
||||
|
||||
## Suggested Edit
|
||||
<!-- e.g. Please add more example usages, or please fix `xyz` typo to be `abc`. -->
|
||||
|
||||
...
|
||||
|
|
|
|||
28
.github/ISSUE_TEMPLATE/feature_request.md
vendored
|
|
@ -1,38 +1,50 @@
|
|||
---
|
||||
name: 💡 Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: ''
|
||||
labels: ''
|
||||
title: 'Feature Request: ...'
|
||||
labels: 'changes: behavior,status: idea phase'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
(feel free to delete this template and write your own issue description if you don't find it helpful)
|
||||
<!--
|
||||
Please fill out the following information,
|
||||
feel free to delete sections if they're not applicable
|
||||
or if long issue templates annoy you :)
|
||||
-->
|
||||
|
||||
## Type
|
||||
|
||||
- [ ] General Question or Disussion
|
||||
- [ ] General question or discussion
|
||||
- [ ] Propose a brand new feature
|
||||
- [ ] Request modification of existing behavior or design
|
||||
|
||||
## What is the problem that your feature request solves
|
||||
<!--
|
||||
e.g. I need to be able to archive spanish and french subtitle files
|
||||
from a particular <example.com> movie site that's going down soon.
|
||||
-->
|
||||
|
||||
## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
|
||||
e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages.
|
||||
<!--
|
||||
e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages.
|
||||
The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving.
|
||||
-->
|
||||
|
||||
## What hacks or alternative solutions have you tried to solve the problem?
|
||||
A clear and concise description of any alternative solutions or features you've considered.
|
||||
<!--
|
||||
A clear and concise description of any alternative solutions,
|
||||
workarounds, or other software you've considered using to fix the problem.
|
||||
-->
|
||||
|
||||
## How badly do you want this new feature?
|
||||
|
||||
- [ ] It's an urgent deal-breaker, I cant live without it
|
||||
- [ ] It's an urgent deal-breaker, I can't live without it
|
||||
- [ ] It's important to add it in the near-mid term future
|
||||
- [ ] It would be nice to have eventually
|
||||
|
||||
---
|
||||
|
||||
- [ ] I'm willing to contribute to development / fixing this issue
|
||||
- [ ] I'm willing to contribute dev time / money to fix this issue
|
||||
- [ ] I like ArchiveBox so far / would recommend it to a friend
|
||||
- [ ] I've had a lot of difficulty getting ArchiveBox set up
|
||||
|
|
|
|||
9
.github/ISSUE_TEMPLATE/question_or_discussion.md
vendored
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
---
|
||||
name: 💬 Question, discussion, or support request
|
||||
about: Start a discussion or ask a question about ArchiveBox
|
||||
title: 'Question: ...'
|
||||
labels: ''
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
145
.github/workflows/test.yml
vendored
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
name: Test workflow
|
||||
on: [push]
|
||||
|
||||
env:
|
||||
MAX_LINE_LENGTH: 110
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v1
|
||||
with:
|
||||
python-version: 3.8
|
||||
architecture: x64
|
||||
|
||||
- name: Install flake8
|
||||
run: |
|
||||
pip install flake8
|
||||
|
||||
- name: Lint with flake8
|
||||
run: |
|
||||
# one pass for show-stopper syntax errors or undefined names
|
||||
flake8 archivebox --count --show-source --statistics
|
||||
# one pass for small stylistic things
|
||||
flake8 archivebox --count --max-line-length="$MAX_LINE_LENGTH" --statistics
|
||||
|
||||
test:
|
||||
runs-on: ${{ matrix.os }}
|
||||
|
||||
strategy:
|
||||
matrix:
|
||||
os: [ubuntu-latest, macos-latest]
|
||||
python: [3.7, 3.8]
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
with:
|
||||
fetch-depth: 1
|
||||
|
||||
- uses: actions/checkout@v2
|
||||
with:
|
||||
fetch-depth: 1
|
||||
repository: "gildas-lormeau/SingleFile"
|
||||
ref: "master"
|
||||
path: "singlefile"
|
||||
|
||||
- name: Install npm requirements for singlefile
|
||||
run: npm install --prefix singlefile/cli
|
||||
|
||||
- name: Give singlefile execution permissions
|
||||
run: chmod +x singlefile/cli/single-file
|
||||
|
||||
- name: Set SINGLEFILE_BINARY
|
||||
run: echo "::set-env name=SINGLEFILE_BINARY::$GITHUB_WORKSPACE/singlefile/cli/single-file"
|
||||
|
||||
- name: Set up Python ${{ matrix.python }}
|
||||
uses: actions/setup-python@v1
|
||||
with:
|
||||
python-version: ${{ matrix.python }}
|
||||
architecture: x64
|
||||
|
||||
- name: Get pip cache dir
|
||||
id: pip-cache
|
||||
run: |
|
||||
echo "::set-output name=dir::$(pip cache dir)"
|
||||
|
||||
- name: Cache pip
|
||||
uses: actions/cache@v2
|
||||
id: cache-pip
|
||||
with:
|
||||
path: ${{ steps.pip-cache.outputs.dir }}
|
||||
key: ${{ runner.os }}-${{ matrix.python }}-venv-${{ hashFiles('setup.py') }}
|
||||
restore-keys: |
|
||||
${{ runner.os }}-${{ matrix.python }}-venv-
|
||||
|
||||
- name: Use nodejs 14.7.0
|
||||
uses: actions/setup-node@v1
|
||||
with:
|
||||
node-version: 14.7.0
|
||||
|
||||
- name: Debug
|
||||
run: ls ./
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
python -m pip install .
|
||||
python -m pip install pytest bottle
|
||||
|
||||
- name: Test built package with pytest
|
||||
run: |
|
||||
python -m pytest -s
|
||||
|
||||
docker-test:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
with:
|
||||
fetch-depth: 1
|
||||
|
||||
- uses: satackey/action-docker-layer-caching@v0.0.4
|
||||
|
||||
- name: Build image
|
||||
run: |
|
||||
docker build . -t archivebox
|
||||
|
||||
- name: Init data dir
|
||||
run: |
|
||||
mkdir data
|
||||
docker run -v "$PWD"/data:/data archivebox init
|
||||
|
||||
- name: Run test server
|
||||
run: |
|
||||
sudo bash -c 'echo "127.0.0.1 www.test-nginx-1.local www.test-nginx-2.local" >> /etc/hosts'
|
||||
docker run --name www-nginx -p 80:80 -d nginx
|
||||
|
||||
- name: Add link
|
||||
run: |
|
||||
docker run -v "$PWD"/data:/data --network host archivebox add http://www.test-nginx-1.local
|
||||
|
||||
- name: Add stdin link
|
||||
run: |
|
||||
echo "http://www.test-nginx-2.local" | docker run -i -v "$PWD"/data:/data archivebox add
|
||||
|
||||
- name: List links
|
||||
run: |
|
||||
docker run -v "$PWD"/data:/data archivebox list | grep -q "www.test-nginx-1.local" || { echo "The site 1 isn't in the list"; exit 1; }
|
||||
docker run -v "$PWD"/data:/data archivebox list | grep -q "www.test-nginx-2.local" || { echo "The site 2 isn't in the list"; exit 1; }
|
||||
|
||||
- name: Start docker-compose stack
|
||||
run: |
|
||||
docker-compose run archivebox init
|
||||
docker-compose up -d
|
||||
sleep 5
|
||||
curl --silent --location 'http://127.0.0.1:8000' | grep 'ArchiveBox'
|
||||
curl --silent --location 'http://127.0.0.1:8000/static/admin/js/jquery.init.js' | grep 'django.jQuery'
|
||||
|
||||
- name: Check added urls show up in index
|
||||
run: |
|
||||
docker-compose run archivebox add 'http://example.com/#test_docker' --index-only
|
||||
curl --silent --location 'http://127.0.0.1:8000' | grep 'http://example.com/#test_docker'
|
||||
docker-compose down || true
|
||||
27
.gitignore
vendored
|
|
@ -1,27 +1,16 @@
|
|||
# OS cruft
|
||||
.DS_Store
|
||||
._*
|
||||
|
||||
# python
|
||||
*.pyc
|
||||
__pycache__/
|
||||
.mypy_cache/
|
||||
venv
|
||||
.venv
|
||||
archivebox/.venv
|
||||
archivebox/venv
|
||||
archivebox/docs/_build
|
||||
|
||||
# vim
|
||||
.swp*
|
||||
venv/
|
||||
.venv/
|
||||
.docker-venv/
|
||||
|
||||
# output artifacts
|
||||
output
|
||||
output/
|
||||
data
|
||||
data/
|
||||
archivebox/output
|
||||
archivebox/data
|
||||
|
||||
archivebox.egg-info/
|
||||
*.egg-info/
|
||||
build/
|
||||
dist/
|
||||
|
||||
data/
|
||||
output/
|
||||
|
|
|
|||
129
Dockerfile
|
|
@ -1,71 +1,82 @@
|
|||
# This Dockerfile for ArchiveBox installs the following in a container:
|
||||
# - curl, wget, python3, youtube-dl, google-chrome-beta
|
||||
# - ArchiveBox
|
||||
# This is the Dockerfile for ArchiveBox, it includes the following major pieces:
|
||||
# git, curl, wget, python3, youtube-dl, google-chrome-stable, ArchiveBox
|
||||
# Usage:
|
||||
# docker build github.com/pirate/ArchiveBox -t archivebox
|
||||
# echo 'https://example.com' | docker run -i --mount type=bind,source=./data,target=/data archivebox /bin/archive
|
||||
# docker run --mount type=bind,source=./data,target=/data archivebox /bin/archive 'https://example.com/some/rss/feed.xml'
|
||||
# docker build . -t archivebox
|
||||
# docker run -v "$PWD/data":/data archivebox init
|
||||
# docker run -v "$PWD/data":/data archivebox add 'https://example.com'
|
||||
# Documentation:
|
||||
# https://github.com/pirate/ArchiveBox/wiki/Docker#docker
|
||||
|
||||
FROM node:11-slim
|
||||
LABEL maintainer="Nick Sweeting <archivebox-git@sweeting.me>"
|
||||
FROM python:3.8-slim-buster
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -yq --no-install-recommends \
|
||||
git zlib1g-dev wget curl youtube-dl gnupg2 libgconf-2-4 python3 python3-pip \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
LABEL name="archivebox" \
|
||||
maintainer="Nick Sweeting <archivebox-git@sweeting.me>" \
|
||||
description="All-in-one personal internet archiving container"
|
||||
|
||||
# Install latest chrome package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
|
||||
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
|
||||
&& sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
|
||||
&& apt-get update \
|
||||
&& apt-get install -y google-chrome-beta fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
|
||||
--no-install-recommends \
|
||||
&& rm -rf /var/lib/apt/lists/* \
|
||||
&& rm -rf /src/*.deb
|
||||
|
||||
# It's a good idea to use dumb-init to help prevent zombie chrome processes.
|
||||
ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
|
||||
RUN chmod +x /usr/local/bin/dumb-init
|
||||
|
||||
# Uncomment to skip the chromium download when installing puppeteer. If you do,
|
||||
# you'll need to launch puppeteer with:
|
||||
# browser.launch({executablePath: 'google-chrome-beta'})
|
||||
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
|
||||
|
||||
# Install puppeteer so it's available in the container.
|
||||
RUN npm i puppeteer
|
||||
|
||||
# Add user so we don't need --no-sandbox.
|
||||
RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
|
||||
&& mkdir -p /home/pptruser/Downloads \
|
||||
&& chown -R pptruser:pptruser /home/pptruser \
|
||||
&& chown -R pptruser:pptruser /node_modules
|
||||
|
||||
# Install the ArchiveBox repository and pip requirements
|
||||
RUN git clone https://github.com/pirate/ArchiveBox /home/pptruser/app \
|
||||
&& mkdir -p /data \
|
||||
&& chown -R pptruser:pptruser /data \
|
||||
&& ln -s /data /home/pptruser/app/archivebox/output \
|
||||
&& ln -s /home/pptruser/app/bin/* /bin/ \
|
||||
&& ln -s /home/pptruser/app/bin/archivebox /bin/archive \
|
||||
&& chown -R pptruser:pptruser /home/pptruser/app/archivebox
|
||||
# && pip3 install -r /home/pptruser/app/archivebox/requirements.txt
|
||||
|
||||
VOLUME /data
|
||||
|
||||
ENV LANG=C.UTF-8 \
|
||||
ENV TZ=UTC \
|
||||
LANGUAGE=en_US:en \
|
||||
LC_ALL=C.UTF-8 \
|
||||
LANG=C.UTF-8 \
|
||||
PYTHONIOENCODING=UTF-8 \
|
||||
CHROME_SANDBOX=False \
|
||||
CHROME_BINARY=google-chrome-beta \
|
||||
OUTPUT_DIR=/data
|
||||
PYTHONUNBUFFERED=1 \
|
||||
APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE=1 \
|
||||
CODE_PATH=/app \
|
||||
VENV_PATH=/venv \
|
||||
DATA_PATH=/data \
|
||||
EXTRA_PATH=/extra
|
||||
|
||||
# First install CLI utils and base deps, then Chrome + Fons + nodejs
|
||||
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections \
|
||||
&& apt-get update -qq \
|
||||
&& apt-get install -qq -y --no-install-recommends \
|
||||
apt-transport-https ca-certificates apt-utils gnupg gosu gnupg2 libgconf-2-4 zlib1g-dev \
|
||||
dumb-init jq git wget curl youtube-dl ffmpeg \
|
||||
&& curl -sSL "https://dl.google.com/linux/linux_signing_key.pub" | apt-key add - \
|
||||
&& echo "deb https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list \
|
||||
&& curl -sL https://deb.nodesource.com/setup_14.x | bash - \
|
||||
&& apt-get update -qq \
|
||||
&& apt-get install -qq -y --no-install-recommends \
|
||||
google-chrome-stable \
|
||||
fontconfig \
|
||||
fonts-ipafont-gothic \
|
||||
fonts-wqy-zenhei \
|
||||
fonts-thai-tlwg \
|
||||
fonts-kacst \
|
||||
fonts-symbola \
|
||||
fonts-noto \
|
||||
fonts-freefont-ttf \
|
||||
nodejs \
|
||||
unzip \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Clone singlefile and move it to the /bin folder so archivebox can find it
|
||||
|
||||
WORKDIR "$EXTRA_PATH"
|
||||
RUN wget -qO - https://github.com/gildas-lormeau/SingleFile/archive/master.zip > SingleFile.zip \
|
||||
&& unzip -q SingleFile.zip \
|
||||
&& npm install --prefix SingleFile-master/cli --production > /dev/null 2>&1 \
|
||||
&& chmod +x SingleFile-master/cli/single-file
|
||||
|
||||
# Run everything from here on out as non-privileged user
|
||||
USER pptruser
|
||||
WORKDIR /home/pptruser/app
|
||||
RUN groupadd --system archivebox \
|
||||
&& useradd --system --create-home --gid archivebox --groups audio,video archivebox
|
||||
|
||||
ENTRYPOINT ["dumb-init", "--"]
|
||||
CMD ["/bin/archive"]
|
||||
ADD . "$CODE_PATH"
|
||||
WORKDIR "$CODE_PATH"
|
||||
ENV PATH="${PATH}:$VENV_PATH/bin"
|
||||
RUN python -m venv --clear --symlinks "$VENV_PATH" \
|
||||
&& pip install --upgrade pip setuptools \
|
||||
&& pip install -e .
|
||||
|
||||
VOLUME "$DATA_PATH"
|
||||
WORKDIR "$DATA_PATH"
|
||||
EXPOSE 8000
|
||||
ENV IN_DOCKER=True \
|
||||
CHROME_BINARY=google-chrome \
|
||||
CHROME_SANDBOX=False \
|
||||
SINGLEFILE_BINARY="$EXTRA_PATH/SingleFile-master/cli/single-file"
|
||||
|
||||
RUN env ALLOW_ROOT=True archivebox version
|
||||
|
||||
ENTRYPOINT ["dumb-init", "--", "/app/bin/docker_entrypoint.sh"]
|
||||
CMD ["archivebox", "server", "0.0.0.0:8000"]
|
||||
|
|
|
|||
10
MANIFEST.in
|
|
@ -1,8 +1,4 @@
|
|||
include LICENSE
|
||||
include README.md
|
||||
include archivebox/VERSION
|
||||
graft archivebox/themes
|
||||
graft archivebox/themes/static
|
||||
graft archivebox/themes/admin
|
||||
graft archivebox/themes/default
|
||||
graft archivebox/themes/default/static
|
||||
graft archivebox/themes/legacy
|
||||
graft archivebox/themes/legacy/static
|
||||
recursive-include archivebox/themes *
|
||||
|
|
|
|||
26
Pipfile
|
|
@ -3,26 +3,10 @@ name = "pypi"
|
|||
url = "https://pypi.org/simple"
|
||||
verify_ssl = true
|
||||
|
||||
[dev-packages]
|
||||
ipdb = "*"
|
||||
flake8 = "*"
|
||||
mypy = "*"
|
||||
django-stubs = "*"
|
||||
setuptools = "*"
|
||||
sphinx = "*"
|
||||
recommonmark = "*"
|
||||
sphinx-rtd-theme = "*"
|
||||
|
||||
[packages]
|
||||
dataclasses = "*"
|
||||
base32-crockford = "*"
|
||||
django = "*"
|
||||
django-extensions = "*"
|
||||
youtube-dl = "*"
|
||||
python-crontab = "*"
|
||||
croniter = "*"
|
||||
ipython = "*"
|
||||
mypy-extensions = "*"
|
||||
# see setup.py for package dependency list
|
||||
"e1839a8" = {path = ".", editable = true}
|
||||
|
||||
[requires]
|
||||
python_version = "3.7"
|
||||
[dev-packages]
|
||||
# see setup.py for dev package dependency list
|
||||
"e1839a8" = {path = ".", extras = ["dev"], editable = true}
|
||||
|
|
|
|||
644
Pipfile.lock
generated
|
|
@ -1,644 +0,0 @@
|
|||
{
|
||||
"_meta": {
|
||||
"hash": {
|
||||
"sha256": "8ac4f9e5cd266406a861a283b321b9eee0ca469638f838e93467403ef2f0594d"
|
||||
},
|
||||
"pipfile-spec": 6,
|
||||
"requires": {
|
||||
"python_version": "3.7"
|
||||
},
|
||||
"sources": [
|
||||
{
|
||||
"name": "pypi",
|
||||
"url": "https://pypi.org/simple",
|
||||
"verify_ssl": true
|
||||
}
|
||||
]
|
||||
},
|
||||
"default": {
|
||||
"appnope": {
|
||||
"hashes": [
|
||||
"sha256:5b26757dc6f79a3b7dc9fab95359328d5747fcb2409d331ea66d0272b90ab2a0",
|
||||
"sha256:8b995ffe925347a2138d7ac0fe77155e4311a0ea6d6da4f5128fe4b3cbe5ed71"
|
||||
],
|
||||
"markers": "sys_platform == 'darwin'",
|
||||
"version": "==0.1.0"
|
||||
},
|
||||
"backcall": {
|
||||
"hashes": [
|
||||
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
|
||||
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
|
||||
],
|
||||
"version": "==0.1.0"
|
||||
},
|
||||
"base32-crockford": {
|
||||
"hashes": [
|
||||
"sha256:115f5bd32ae32b724035cb02eb65069a8824ea08c08851eb80c8b9f63443a969",
|
||||
"sha256:295ef5ffbf6ed96b6e739ffd36be98fa7e90a206dd18c39acefb15777eedfe6e"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.3.0"
|
||||
},
|
||||
"croniter": {
|
||||
"hashes": [
|
||||
"sha256:0d905dbe6f131a910fd3dde792f0129788cd2cb3a8048c5f7aaa212670b0cef2",
|
||||
"sha256:538adeb3a7f7816c3cdec6db974c441620d764c25ff4ed0146ee7296b8a50590"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.3.30"
|
||||
},
|
||||
"dataclasses": {
|
||||
"hashes": [
|
||||
"sha256:454a69d788c7fda44efd71e259be79577822f5e3f53f029a22d08004e951dc9f",
|
||||
"sha256:6988bd2b895eef432d562370bb707d540f32f7360ab13da45340101bc2307d84"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.6"
|
||||
},
|
||||
"decorator": {
|
||||
"hashes": [
|
||||
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
|
||||
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
|
||||
],
|
||||
"version": "==4.4.0"
|
||||
},
|
||||
"django": {
|
||||
"hashes": [
|
||||
"sha256:7c3543e4fb070d14e10926189a7fcf42ba919263b7473dceaefce34d54e8a119",
|
||||
"sha256:a2814bffd1f007805b19194eb0b9a331933b82bd5da1c3ba3d7b7ba16e06dc4b"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==2.2"
|
||||
},
|
||||
"django-extensions": {
|
||||
"hashes": [
|
||||
"sha256:109004f80b6f45ad1f56addaa59debca91d94aa0dc1cb19678b9364b4fe9b6f4",
|
||||
"sha256:307766e5e6c1caffe76c5d99239d8115d14ae3f7cab2cd991fcffd763dad904b"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==2.1.6"
|
||||
},
|
||||
"ipython": {
|
||||
"hashes": [
|
||||
"sha256:54c5a8aa1eadd269ac210b96923688ccf01ebb2d0f21c18c3c717909583579a8",
|
||||
"sha256:e840810029224b56cd0d9e7719dc3b39cf84d577f8ac686547c8ba7a06eeab26"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==7.5.0"
|
||||
},
|
||||
"ipython-genutils": {
|
||||
"hashes": [
|
||||
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
|
||||
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
|
||||
],
|
||||
"version": "==0.2.0"
|
||||
},
|
||||
"jedi": {
|
||||
"hashes": [
|
||||
"sha256:2bb0603e3506f708e792c7f4ad8fc2a7a9d9c2d292a358fbbd58da531695595b",
|
||||
"sha256:2c6bcd9545c7d6440951b12b44d373479bf18123a401a52025cf98563fbd826c"
|
||||
],
|
||||
"version": "==0.13.3"
|
||||
},
|
||||
"mypy-extensions": {
|
||||
"hashes": [
|
||||
"sha256:37e0e956f41369209a3d5f34580150bcacfabaa57b33a15c0b25f4b5725e0812",
|
||||
"sha256:b16cabe759f55e3409a7d231ebd2841378fb0c27a5d1994719e340e4f429ac3e"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.4.1"
|
||||
},
|
||||
"parso": {
|
||||
"hashes": [
|
||||
"sha256:17cc2d7a945eb42c3569d4564cdf49bde221bc2b552af3eca9c1aad517dcdd33",
|
||||
"sha256:2e9574cb12e7112a87253e14e2c380ce312060269d04bd018478a3c92ea9a376"
|
||||
],
|
||||
"version": "==0.4.0"
|
||||
},
|
||||
"pexpect": {
|
||||
"hashes": [
|
||||
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
|
||||
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
|
||||
],
|
||||
"markers": "sys_platform != 'win32'",
|
||||
"version": "==4.7.0"
|
||||
},
|
||||
"pickleshare": {
|
||||
"hashes": [
|
||||
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
|
||||
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
|
||||
],
|
||||
"version": "==0.7.5"
|
||||
},
|
||||
"prompt-toolkit": {
|
||||
"hashes": [
|
||||
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
|
||||
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
|
||||
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
|
||||
],
|
||||
"version": "==2.0.9"
|
||||
},
|
||||
"ptyprocess": {
|
||||
"hashes": [
|
||||
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
|
||||
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
|
||||
],
|
||||
"version": "==0.6.0"
|
||||
},
|
||||
"pygments": {
|
||||
"hashes": [
|
||||
"sha256:5ffada19f6203563680669ee7f53b64dabbeb100eb51b61996085e99c03b284a",
|
||||
"sha256:e8218dd399a61674745138520d0d4cf2621d7e032439341bc3f647bff125818d"
|
||||
],
|
||||
"version": "==2.3.1"
|
||||
},
|
||||
"python-crontab": {
|
||||
"hashes": [
|
||||
"sha256:91ce4b245ee5e5c117aa0b21b485bc43f2d80df854a36e922b707643f50d7923"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==2.3.6"
|
||||
},
|
||||
"python-dateutil": {
|
||||
"hashes": [
|
||||
"sha256:7e6584c74aeed623791615e26efd690f29817a27c73085b78e4bad02493df2fb",
|
||||
"sha256:c89805f6f4d64db21ed966fda138f8a5ed7a4fdbc1a8ee329ce1b74e3c74da9e"
|
||||
],
|
||||
"version": "==2.8.0"
|
||||
},
|
||||
"pytz": {
|
||||
"hashes": [
|
||||
"sha256:303879e36b721603cc54604edcac9d20401bdbe31e1e4fdee5b9f98d5d31dfda",
|
||||
"sha256:d747dd3d23d77ef44c6a3526e274af6efeb0a6f1afd5a69ba4d5be4098c8e141"
|
||||
],
|
||||
"version": "==2019.1"
|
||||
},
|
||||
"six": {
|
||||
"hashes": [
|
||||
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
|
||||
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
|
||||
],
|
||||
"version": "==1.12.0"
|
||||
},
|
||||
"sqlparse": {
|
||||
"hashes": [
|
||||
"sha256:40afe6b8d4b1117e7dff5504d7a8ce07d9a1b15aeeade8a2d10f130a834f8177",
|
||||
"sha256:7c3dca29c022744e95b547e867cee89f4fce4373f3549ccd8797d8eb52cdb873"
|
||||
],
|
||||
"version": "==0.3.0"
|
||||
},
|
||||
"traitlets": {
|
||||
"hashes": [
|
||||
"sha256:9c4bd2d267b7153df9152698efb1050a5d84982d3384a37b2c1f7723ba3e7835",
|
||||
"sha256:c6cb5e6f57c5a9bdaa40fa71ce7b4af30298fbab9ece9815b5d995ab6217c7d9"
|
||||
],
|
||||
"version": "==4.3.2"
|
||||
},
|
||||
"wcwidth": {
|
||||
"hashes": [
|
||||
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
|
||||
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
|
||||
],
|
||||
"version": "==0.1.7"
|
||||
},
|
||||
"youtube-dl": {
|
||||
"hashes": [
|
||||
"sha256:46f6e30c673ba71de84748dad4c264d1b6fb30beebf1ef834846a651b4524a78",
|
||||
"sha256:b20d110e1bed8d16f5771bb938ab6e5da67f08af62b599af65301cca290f2e15"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==2019.4.24"
|
||||
}
|
||||
},
|
||||
"develop": {
|
||||
"alabaster": {
|
||||
"hashes": [
|
||||
"sha256:446438bdcca0e05bd45ea2de1668c1d9b032e1a9154c2c259092d77031ddd359",
|
||||
"sha256:a661d72d58e6ea8a57f7a86e37d86716863ee5e92788398526d58b26a4e4dc02"
|
||||
],
|
||||
"version": "==0.7.12"
|
||||
},
|
||||
"appnope": {
|
||||
"hashes": [
|
||||
"sha256:5b26757dc6f79a3b7dc9fab95359328d5747fcb2409d331ea66d0272b90ab2a0",
|
||||
"sha256:8b995ffe925347a2138d7ac0fe77155e4311a0ea6d6da4f5128fe4b3cbe5ed71"
|
||||
],
|
||||
"markers": "sys_platform == 'darwin'",
|
||||
"version": "==0.1.0"
|
||||
},
|
||||
"babel": {
|
||||
"hashes": [
|
||||
"sha256:6778d85147d5d85345c14a26aada5e478ab04e39b078b0745ee6870c2b5cf669",
|
||||
"sha256:8cba50f48c529ca3fa18cf81fa9403be176d374ac4d60738b839122dfaaa3d23"
|
||||
],
|
||||
"version": "==2.6.0"
|
||||
},
|
||||
"backcall": {
|
||||
"hashes": [
|
||||
"sha256:38ecd85be2c1e78f77fd91700c76e14667dc21e2713b63876c0eb901196e01e4",
|
||||
"sha256:bbbf4b1e5cd2bdb08f915895b51081c041bac22394fdfcfdfbe9f14b77c08bf2"
|
||||
],
|
||||
"version": "==0.1.0"
|
||||
},
|
||||
"certifi": {
|
||||
"hashes": [
|
||||
"sha256:59b7658e26ca9c7339e00f8f4636cdfe59d34fa37b9b04f6f9e9926b3cece1a5",
|
||||
"sha256:b26104d6835d1f5e49452a26eb2ff87fe7090b89dfcaee5ea2212697e1e1d7ae"
|
||||
],
|
||||
"version": "==2019.3.9"
|
||||
},
|
||||
"chardet": {
|
||||
"hashes": [
|
||||
"sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae",
|
||||
"sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691"
|
||||
],
|
||||
"version": "==3.0.4"
|
||||
},
|
||||
"commonmark": {
|
||||
"hashes": [
|
||||
"sha256:9f6dda7876b2bb88dd784440166f4bc8e56cb2b2551264051123bacb0b6c1d8a",
|
||||
"sha256:abcbc854e0eae5deaf52ae5e328501b78b4a0758bf98ac8bb792fce993006084"
|
||||
],
|
||||
"version": "==0.8.1"
|
||||
},
|
||||
"decorator": {
|
||||
"hashes": [
|
||||
"sha256:86156361c50488b84a3f148056ea716ca587df2f0de1d34750d35c21312725de",
|
||||
"sha256:f069f3a01830ca754ba5258fde2278454a0b5b79e0d7f5c13b3b97e57d4acff6"
|
||||
],
|
||||
"version": "==4.4.0"
|
||||
},
|
||||
"django-stubs": {
|
||||
"hashes": [
|
||||
"sha256:9c06a4b28fc8c18f6abee4f199f8ee29cb5cfcecf349e912ded31cb3526ea2b6",
|
||||
"sha256:9ef230843a24b5d74f2ebd4c60f9bea09c21911bc119d0325e8bb47e2f495e70"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.12.1"
|
||||
},
|
||||
"docutils": {
|
||||
"hashes": [
|
||||
"sha256:02aec4bd92ab067f6ff27a38a38a41173bf01bed8f89157768c1573f53e474a6",
|
||||
"sha256:51e64ef2ebfb29cae1faa133b3710143496eca21c530f3f71424d77687764274",
|
||||
"sha256:7a4bd47eaf6596e1295ecb11361139febe29b084a87bf005bf899f9a42edc3c6"
|
||||
],
|
||||
"version": "==0.14"
|
||||
},
|
||||
"entrypoints": {
|
||||
"hashes": [
|
||||
"sha256:589f874b313739ad35be6e0cd7efde2a4e9b6fea91edcc34e58ecbb8dbe56d19",
|
||||
"sha256:c70dd71abe5a8c85e55e12c19bd91ccfeec11a6e99044204511f9ed547d48451"
|
||||
],
|
||||
"version": "==0.3"
|
||||
},
|
||||
"flake8": {
|
||||
"hashes": [
|
||||
"sha256:859996073f341f2670741b51ec1e67a01da142831aa1fdc6242dbf88dffbe661",
|
||||
"sha256:a796a115208f5c03b18f332f7c11729812c8c3ded6c46319c59b53efd3819da8"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==3.7.7"
|
||||
},
|
||||
"future": {
|
||||
"hashes": [
|
||||
"sha256:67045236dcfd6816dc439556d009594abf643e5eb48992e36beac09c2ca659b8"
|
||||
],
|
||||
"version": "==0.17.1"
|
||||
},
|
||||
"idna": {
|
||||
"hashes": [
|
||||
"sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407",
|
||||
"sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c"
|
||||
],
|
||||
"version": "==2.8"
|
||||
},
|
||||
"imagesize": {
|
||||
"hashes": [
|
||||
"sha256:3f349de3eb99145973fefb7dbe38554414e5c30abd0c8e4b970a7c9d09f3a1d8",
|
||||
"sha256:f3832918bc3c66617f92e35f5d70729187676313caa60c187eb0f28b8fe5e3b5"
|
||||
],
|
||||
"version": "==1.1.0"
|
||||
},
|
||||
"ipdb": {
|
||||
"hashes": [
|
||||
"sha256:dce2112557edfe759742ca2d0fee35c59c97b0cc7a05398b791079d78f1519ce"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.12"
|
||||
},
|
||||
"ipython": {
|
||||
"hashes": [
|
||||
"sha256:54c5a8aa1eadd269ac210b96923688ccf01ebb2d0f21c18c3c717909583579a8",
|
||||
"sha256:e840810029224b56cd0d9e7719dc3b39cf84d577f8ac686547c8ba7a06eeab26"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==7.5.0"
|
||||
},
|
||||
"ipython-genutils": {
|
||||
"hashes": [
|
||||
"sha256:72dd37233799e619666c9f639a9da83c34013a73e8bbc79a7a6348d93c61fab8",
|
||||
"sha256:eb2e116e75ecef9d4d228fdc66af54269afa26ab4463042e33785b887c628ba8"
|
||||
],
|
||||
"version": "==0.2.0"
|
||||
},
|
||||
"jedi": {
|
||||
"hashes": [
|
||||
"sha256:2bb0603e3506f708e792c7f4ad8fc2a7a9d9c2d292a358fbbd58da531695595b",
|
||||
"sha256:2c6bcd9545c7d6440951b12b44d373479bf18123a401a52025cf98563fbd826c"
|
||||
],
|
||||
"version": "==0.13.3"
|
||||
},
|
||||
"jinja2": {
|
||||
"hashes": [
|
||||
"sha256:065c4f02ebe7f7cf559e49ee5a95fb800a9e4528727aec6f24402a5374c65013",
|
||||
"sha256:14dd6caf1527abb21f08f86c784eac40853ba93edb79552aa1e4b8aef1b61c7b"
|
||||
],
|
||||
"version": "==2.10.1"
|
||||
},
|
||||
"markupsafe": {
|
||||
"hashes": [
|
||||
"sha256:00bc623926325b26bb9605ae9eae8a215691f33cae5df11ca5424f06f2d1f473",
|
||||
"sha256:09027a7803a62ca78792ad89403b1b7a73a01c8cb65909cd876f7fcebd79b161",
|
||||
"sha256:09c4b7f37d6c648cb13f9230d847adf22f8171b1ccc4d5682398e77f40309235",
|
||||
"sha256:1027c282dad077d0bae18be6794e6b6b8c91d58ed8a8d89a89d59693b9131db5",
|
||||
"sha256:24982cc2533820871eba85ba648cd53d8623687ff11cbb805be4ff7b4c971aff",
|
||||
"sha256:29872e92839765e546828bb7754a68c418d927cd064fd4708fab9fe9c8bb116b",
|
||||
"sha256:43a55c2930bbc139570ac2452adf3d70cdbb3cfe5912c71cdce1c2c6bbd9c5d1",
|
||||
"sha256:46c99d2de99945ec5cb54f23c8cd5689f6d7177305ebff350a58ce5f8de1669e",
|
||||
"sha256:500d4957e52ddc3351cabf489e79c91c17f6e0899158447047588650b5e69183",
|
||||
"sha256:535f6fc4d397c1563d08b88e485c3496cf5784e927af890fb3c3aac7f933ec66",
|
||||
"sha256:62fe6c95e3ec8a7fad637b7f3d372c15ec1caa01ab47926cfdf7a75b40e0eac1",
|
||||
"sha256:6dd73240d2af64df90aa7c4e7481e23825ea70af4b4922f8ede5b9e35f78a3b1",
|
||||
"sha256:717ba8fe3ae9cc0006d7c451f0bb265ee07739daf76355d06366154ee68d221e",
|
||||
"sha256:79855e1c5b8da654cf486b830bd42c06e8780cea587384cf6545b7d9ac013a0b",
|
||||
"sha256:7c1699dfe0cf8ff607dbdcc1e9b9af1755371f92a68f706051cc8c37d447c905",
|
||||
"sha256:88e5fcfb52ee7b911e8bb6d6aa2fd21fbecc674eadd44118a9cc3863f938e735",
|
||||
"sha256:8defac2f2ccd6805ebf65f5eeb132adcf2ab57aa11fdf4c0dd5169a004710e7d",
|
||||
"sha256:98c7086708b163d425c67c7a91bad6e466bb99d797aa64f965e9d25c12111a5e",
|
||||
"sha256:9add70b36c5666a2ed02b43b335fe19002ee5235efd4b8a89bfcf9005bebac0d",
|
||||
"sha256:9bf40443012702a1d2070043cb6291650a0841ece432556f784f004937f0f32c",
|
||||
"sha256:ade5e387d2ad0d7ebf59146cc00c8044acbd863725f887353a10df825fc8ae21",
|
||||
"sha256:b00c1de48212e4cc9603895652c5c410df699856a2853135b3967591e4beebc2",
|
||||
"sha256:b1282f8c00509d99fef04d8ba936b156d419be841854fe901d8ae224c59f0be5",
|
||||
"sha256:b2051432115498d3562c084a49bba65d97cf251f5a331c64a12ee7e04dacc51b",
|
||||
"sha256:ba59edeaa2fc6114428f1637ffff42da1e311e29382d81b339c1817d37ec93c6",
|
||||
"sha256:c8716a48d94b06bb3b2524c2b77e055fb313aeb4ea620c8dd03a105574ba704f",
|
||||
"sha256:cd5df75523866410809ca100dc9681e301e3c27567cf498077e8551b6d20e42f",
|
||||
"sha256:e249096428b3ae81b08327a63a485ad0878de3fb939049038579ac0ef61e17e7"
|
||||
],
|
||||
"version": "==1.1.1"
|
||||
},
|
||||
"mccabe": {
|
||||
"hashes": [
|
||||
"sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42",
|
||||
"sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"
|
||||
],
|
||||
"version": "==0.6.1"
|
||||
},
|
||||
"mypy": {
|
||||
"hashes": [
|
||||
"sha256:2afe51527b1f6cdc4a5f34fc90473109b22bf7f21086ba3e9451857cf11489e6",
|
||||
"sha256:56a16df3e0abb145d8accd5dbb70eba6c4bd26e2f89042b491faa78c9635d1e2",
|
||||
"sha256:5764f10d27b2e93c84f70af5778941b8f4aa1379b2430f85c827e0f5464e8714",
|
||||
"sha256:5bbc86374f04a3aa817622f98e40375ccb28c4836f36b66706cf3c6ccce86eda",
|
||||
"sha256:6a9343089f6377e71e20ca734cd8e7ac25d36478a9df580efabfe9059819bf82",
|
||||
"sha256:6c9851bc4a23dc1d854d3f5dfd5f20a016f8da86bcdbb42687879bb5f86434b0",
|
||||
"sha256:b8e85956af3fcf043d6f87c91cbe8705073fc67029ba6e22d3468bfee42c4823",
|
||||
"sha256:b9a0af8fae490306bc112229000aa0c2ccc837b49d29a5c42e088c132a2334dd",
|
||||
"sha256:bbf643528e2a55df2c1587008d6e3bda5c0445f1240dfa85129af22ae16d7a9a",
|
||||
"sha256:c46ab3438bd21511db0f2c612d89d8344154c0c9494afc7fbc932de514cf8d15",
|
||||
"sha256:f7a83d6bd805855ef83ec605eb01ab4fa42bcef254b13631e451cbb44914a9b0"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.701"
|
||||
},
|
||||
"mypy-extensions": {
|
||||
"hashes": [
|
||||
"sha256:37e0e956f41369209a3d5f34580150bcacfabaa57b33a15c0b25f4b5725e0812",
|
||||
"sha256:b16cabe759f55e3409a7d231ebd2841378fb0c27a5d1994719e340e4f429ac3e"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.4.1"
|
||||
},
|
||||
"packaging": {
|
||||
"hashes": [
|
||||
"sha256:0c98a5d0be38ed775798ece1b9727178c4469d9c3b4ada66e8e6b7849f8732af",
|
||||
"sha256:9e1cbf8c12b1f1ce0bb5344b8d7ecf66a6f8a6e91bcb0c84593ed6d3ab5c4ab3"
|
||||
],
|
||||
"version": "==19.0"
|
||||
},
|
||||
"parso": {
|
||||
"hashes": [
|
||||
"sha256:17cc2d7a945eb42c3569d4564cdf49bde221bc2b552af3eca9c1aad517dcdd33",
|
||||
"sha256:2e9574cb12e7112a87253e14e2c380ce312060269d04bd018478a3c92ea9a376"
|
||||
],
|
||||
"version": "==0.4.0"
|
||||
},
|
||||
"pexpect": {
|
||||
"hashes": [
|
||||
"sha256:2094eefdfcf37a1fdbfb9aa090862c1a4878e5c7e0e7e7088bdb511c558e5cd1",
|
||||
"sha256:9e2c1fd0e6ee3a49b28f95d4b33bc389c89b20af6a1255906e90ff1262ce62eb"
|
||||
],
|
||||
"markers": "sys_platform != 'win32'",
|
||||
"version": "==4.7.0"
|
||||
},
|
||||
"pickleshare": {
|
||||
"hashes": [
|
||||
"sha256:87683d47965c1da65cdacaf31c8441d12b8044cdec9aca500cd78fc2c683afca",
|
||||
"sha256:9649af414d74d4df115d5d718f82acb59c9d418196b7b4290ed47a12ce62df56"
|
||||
],
|
||||
"version": "==0.7.5"
|
||||
},
|
||||
"prompt-toolkit": {
|
||||
"hashes": [
|
||||
"sha256:11adf3389a996a6d45cc277580d0d53e8a5afd281d0c9ec71b28e6f121463780",
|
||||
"sha256:2519ad1d8038fd5fc8e770362237ad0364d16a7650fb5724af6997ed5515e3c1",
|
||||
"sha256:977c6583ae813a37dc1c2e1b715892461fcbdaa57f6fc62f33a528c4886c8f55"
|
||||
],
|
||||
"version": "==2.0.9"
|
||||
},
|
||||
"ptyprocess": {
|
||||
"hashes": [
|
||||
"sha256:923f299cc5ad920c68f2bc0bc98b75b9f838b93b599941a6b63ddbc2476394c0",
|
||||
"sha256:d7cc528d76e76342423ca640335bd3633420dc1366f258cb31d05e865ef5ca1f"
|
||||
],
|
||||
"version": "==0.6.0"
|
||||
},
|
||||
"pycodestyle": {
|
||||
"hashes": [
|
||||
"sha256:95a2219d12372f05704562a14ec30bc76b05a5b297b21a5dfe3f6fac3491ae56",
|
||||
"sha256:e40a936c9a450ad81df37f549d676d127b1b66000a6c500caa2b085bc0ca976c"
|
||||
],
|
||||
"version": "==2.5.0"
|
||||
},
|
||||
"pyflakes": {
|
||||
"hashes": [
|
||||
"sha256:17dbeb2e3f4d772725c777fabc446d5634d1038f234e77343108ce445ea69ce0",
|
||||
"sha256:d976835886f8c5b31d47970ed689944a0262b5f3afa00a5a7b4dc81e5449f8a2"
|
||||
],
|
||||
"version": "==2.1.1"
|
||||
},
|
||||
"pygments": {
|
||||
"hashes": [
|
||||
"sha256:5ffada19f6203563680669ee7f53b64dabbeb100eb51b61996085e99c03b284a",
|
||||
"sha256:e8218dd399a61674745138520d0d4cf2621d7e032439341bc3f647bff125818d"
|
||||
],
|
||||
"version": "==2.3.1"
|
||||
},
|
||||
"pyparsing": {
|
||||
"hashes": [
|
||||
"sha256:1873c03321fc118f4e9746baf201ff990ceb915f433f23b395f5580d1840cb2a",
|
||||
"sha256:9b6323ef4ab914af344ba97510e966d64ba91055d6b9afa6b30799340e89cc03"
|
||||
],
|
||||
"version": "==2.4.0"
|
||||
},
|
||||
"pytz": {
|
||||
"hashes": [
|
||||
"sha256:303879e36b721603cc54604edcac9d20401bdbe31e1e4fdee5b9f98d5d31dfda",
|
||||
"sha256:d747dd3d23d77ef44c6a3526e274af6efeb0a6f1afd5a69ba4d5be4098c8e141"
|
||||
],
|
||||
"version": "==2019.1"
|
||||
},
|
||||
"recommonmark": {
|
||||
"hashes": [
|
||||
"sha256:a520b8d25071a51ae23a27cf6252f2fe387f51bdc913390d83b2b50617f5bb48",
|
||||
"sha256:c85228b9b7aea7157662520e74b4e8791c5eacd375332ec68381b52bf10165be"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.5.0"
|
||||
},
|
||||
"requests": {
|
||||
"hashes": [
|
||||
"sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e",
|
||||
"sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b"
|
||||
],
|
||||
"version": "==2.21.0"
|
||||
},
|
||||
"six": {
|
||||
"hashes": [
|
||||
"sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c",
|
||||
"sha256:d16a0141ec1a18405cd4ce8b4613101da75da0e9a7aec5bdd4fa804d0e0eba73"
|
||||
],
|
||||
"version": "==1.12.0"
|
||||
},
|
||||
"snowballstemmer": {
|
||||
"hashes": [
|
||||
"sha256:919f26a68b2c17a7634da993d91339e288964f93c274f1343e3bbbe2096e1128",
|
||||
"sha256:9f3bcd3c401c3e862ec0ebe6d2c069ebc012ce142cce209c098ccb5b09136e89"
|
||||
],
|
||||
"version": "==1.2.1"
|
||||
},
|
||||
"sphinx": {
|
||||
"hashes": [
|
||||
"sha256:423280646fb37944dd3c85c58fb92a20d745793a9f6c511f59da82fa97cd404b",
|
||||
"sha256:de930f42600a4fef993587633984cc5027dedba2464bcf00ddace26b40f8d9ce"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==2.0.1"
|
||||
},
|
||||
"sphinx-rtd-theme": {
|
||||
"hashes": [
|
||||
"sha256:00cf895504a7895ee433807c62094cf1e95f065843bf3acd17037c3e9a2becd4",
|
||||
"sha256:728607e34d60456d736cc7991fd236afb828b21b82f956c5ea75f94c8414040a"
|
||||
],
|
||||
"index": "pypi",
|
||||
"version": "==0.4.3"
|
||||
},
|
||||
"sphinxcontrib-applehelp": {
|
||||
"hashes": [
|
||||
"sha256:edaa0ab2b2bc74403149cb0209d6775c96de797dfd5b5e2a71981309efab3897",
|
||||
"sha256:fb8dee85af95e5c30c91f10e7eb3c8967308518e0f7488a2828ef7bc191d0d5d"
|
||||
],
|
||||
"version": "==1.0.1"
|
||||
},
|
||||
"sphinxcontrib-devhelp": {
|
||||
"hashes": [
|
||||
"sha256:6c64b077937330a9128a4da74586e8c2130262f014689b4b89e2d08ee7294a34",
|
||||
"sha256:9512ecb00a2b0821a146736b39f7aeb90759834b07e81e8cc23a9c70bacb9981"
|
||||
],
|
||||
"version": "==1.0.1"
|
||||
},
|
||||
"sphinxcontrib-htmlhelp": {
|
||||
"hashes": [
|
||||
"sha256:4670f99f8951bd78cd4ad2ab962f798f5618b17675c35c5ac3b2132a14ea8422",
|
||||
"sha256:d4fd39a65a625c9df86d7fa8a2d9f3cd8299a3a4b15db63b50aac9e161d8eff7"
|
||||
],
|
||||
"version": "==1.0.2"
|
||||
},
|
||||
"sphinxcontrib-jsmath": {
|
||||
"hashes": [
|
||||
"sha256:2ec2eaebfb78f3f2078e73666b1415417a116cc848b72e5172e596c871103178",
|
||||
"sha256:a9925e4a4587247ed2191a22df5f6970656cb8ca2bd6284309578f2153e0c4b8"
|
||||
],
|
||||
"version": "==1.0.1"
|
||||
},
|
||||
"sphinxcontrib-qthelp": {
|
||||
"hashes": [
|
||||
"sha256:513049b93031beb1f57d4daea74068a4feb77aa5630f856fcff2e50de14e9a20",
|
||||
"sha256:79465ce11ae5694ff165becda529a600c754f4bc459778778c7017374d4d406f"
|
||||
],
|
||||
"version": "==1.0.2"
|
||||
},
|
||||
"sphinxcontrib-serializinghtml": {
|
||||
"hashes": [
|
||||
"sha256:c0efb33f8052c04fd7a26c0a07f1678e8512e0faec19f4aa8f2473a8b81d5227",
|
||||
"sha256:db6615af393650bf1151a6cd39120c29abaf93cc60db8c48eb2dddbfdc3a9768"
|
||||
],
|
||||
"version": "==1.1.3"
|
||||
},
|
||||
"traitlets": {
|
||||
"hashes": [
|
||||
"sha256:9c4bd2d267b7153df9152698efb1050a5d84982d3384a37b2c1f7723ba3e7835",
|
||||
"sha256:c6cb5e6f57c5a9bdaa40fa71ce7b4af30298fbab9ece9815b5d995ab6217c7d9"
|
||||
],
|
||||
"version": "==4.3.2"
|
||||
},
|
||||
"typed-ast": {
|
||||
"hashes": [
|
||||
"sha256:04894d268ba6eab7e093d43107869ad49e7b5ef40d1a94243ea49b352061b200",
|
||||
"sha256:16616ece19daddc586e499a3d2f560302c11f122b9c692bc216e821ae32aa0d0",
|
||||
"sha256:252fdae740964b2d3cdfb3f84dcb4d6247a48a6abe2579e8029ab3be3cdc026c",
|
||||
"sha256:2af80a373af123d0b9f44941a46df67ef0ff7a60f95872412a145f4500a7fc99",
|
||||
"sha256:2c88d0a913229a06282b285f42a31e063c3bf9071ff65c5ea4c12acb6977c6a7",
|
||||
"sha256:2ea99c029ebd4b5a308d915cc7fb95b8e1201d60b065450d5d26deb65d3f2bc1",
|
||||
"sha256:3d2e3ab175fc097d2a51c7a0d3fda442f35ebcc93bb1d7bd9b95ad893e44c04d",
|
||||
"sha256:4766dd695548a15ee766927bf883fb90c6ac8321be5a60c141f18628fb7f8da8",
|
||||
"sha256:56b6978798502ef66625a2e0f80cf923da64e328da8bbe16c1ff928c70c873de",
|
||||
"sha256:5cddb6f8bce14325b2863f9d5ac5c51e07b71b462361fd815d1d7706d3a9d682",
|
||||
"sha256:644ee788222d81555af543b70a1098f2025db38eaa99226f3a75a6854924d4db",
|
||||
"sha256:64cf762049fc4775efe6b27161467e76d0ba145862802a65eefc8879086fc6f8",
|
||||
"sha256:68c362848d9fb71d3c3e5f43c09974a0ae319144634e7a47db62f0f2a54a7fa7",
|
||||
"sha256:6c1f3c6f6635e611d58e467bf4371883568f0de9ccc4606f17048142dec14a1f",
|
||||
"sha256:b213d4a02eec4ddf622f4d2fbc539f062af3788d1f332f028a2e19c42da53f15",
|
||||
"sha256:bb27d4e7805a7de0e35bd0cb1411bc85f807968b2b0539597a49a23b00a622ae",
|
||||
"sha256:c9d414512eaa417aadae7758bc118868cd2396b0e6138c1dd4fda96679c079d3",
|
||||
"sha256:f0937165d1e25477b01081c4763d2d9cdc3b18af69cb259dd4f640c9b900fe5e",
|
||||
"sha256:fb96a6e2c11059ecf84e6741a319f93f683e440e341d4489c9b161eca251cf2a",
|
||||
"sha256:fc71d2d6ae56a091a8d94f33ec9d0f2001d1cb1db423d8b4355debfe9ce689b7"
|
||||
],
|
||||
"version": "==1.3.4"
|
||||
},
|
||||
"typing-extensions": {
|
||||
"hashes": [
|
||||
"sha256:07b2c978670896022a43c4b915df8958bec4a6b84add7f2c87b2b728bda3ba64",
|
||||
"sha256:f3f0e67e1d42de47b5c67c32c9b26641642e9170fe7e292991793705cd5fef7c",
|
||||
"sha256:fb2cd053238d33a8ec939190f30cfd736c00653a85a2919415cecf7dc3d9da71"
|
||||
],
|
||||
"version": "==3.7.2"
|
||||
},
|
||||
"urllib3": {
|
||||
"hashes": [
|
||||
"sha256:4c291ca23bbb55c76518905869ef34bdd5f0e46af7afe6861e8375643ffee1a0",
|
||||
"sha256:9a247273df709c4fedb38c711e44292304f73f39ab01beda9f6b9fc375669ac3"
|
||||
],
|
||||
"version": "==1.24.2"
|
||||
},
|
||||
"wcwidth": {
|
||||
"hashes": [
|
||||
"sha256:3df37372226d6e63e1b1e1eda15c594bca98a22d33a23832a90998faa96bc65e",
|
||||
"sha256:f4ebe71925af7b40a864553f761ed559b43544f8f71746c2d756c7fe788ade7c"
|
||||
],
|
||||
"version": "==0.1.7"
|
||||
}
|
||||
}
|
||||
}
|
||||
271
README.md
|
|
@ -3,7 +3,7 @@
|
|||
<h1>ArchiveBox<br/><sub>The open-source self-hosted web archive.</sub></h1>
|
||||
|
||||
▶️ <a href="https://github.com/pirate/ArchiveBox/wiki/Quickstart">Quickstart</a> |
|
||||
<a href="https://archive.sweeting.me">Demo</a> |
|
||||
<a href="https://archivebox.zervice.io/">Demo</a> |
|
||||
<a href="https://github.com/pirate/ArchiveBox">Github</a> |
|
||||
<a href="https://github.com/pirate/ArchiveBox/wiki">Documentation</a> |
|
||||
<a href="#background--motivation">Info & Motivation</a> |
|
||||
|
|
@ -14,35 +14,41 @@
|
|||
"Your own personal internet archive" (网站存档 / 爬虫)
|
||||
</pre>
|
||||
|
||||
<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>
|
||||
<!--<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>-->
|
||||
|
||||
<a href="https://github.com/pirate/ArchiveBox/blob/master/LICENSE"><img src="https://img.shields.io/badge/Open_source-MIT-green.svg?logo=git&logoColor=green"/></a>
|
||||
<a href="https://github.com/pirate/ArchiveBox/commits/dev"><img src="https://img.shields.io/github/last-commit/pirate/ArchiveBox.svg?logo=Sublime+Text&logoColor=green&label=Active"/></a>
|
||||
<a href="https://github.com/pirate/ArchiveBox"><img src="https://img.shields.io/github/stars/pirate/ArchiveBox.svg?logo=github&label=Stars&logoColor=blue"/></a>
|
||||
<a href="https://test.pypi.org/project/archivebox/"><img src="https://img.shields.io/badge/Python-%3E%3D3.5-yellow.svg?logo=python&logoColor=yellow"/></a>
|
||||
<a href="https://test.pypi.org/project/archivebox/"><img src="https://img.shields.io/badge/Python-%3E%3D3.7-yellow.svg?logo=python&logoColor=yellow"/></a>
|
||||
<a href="https://github.com/pirate/ArchiveBox/wiki/Install#dependencies"><img src="https://img.shields.io/badge/Chromium-%3E%3D59-orange.svg?logo=Google+Chrome&logoColor=orange"/></a>
|
||||
<a href="https://hub.docker.com/r/nikisweeting/archivebox"><img src="https://img.shields.io/badge/Docker-all%20platforms-lightblue.svg?logo=docker&logoColor=lightblue"/></a>
|
||||
|
||||
<hr/>
|
||||
</div>
|
||||
|
||||
**ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).**
|
||||
**ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).**
|
||||
|
||||
You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, autheticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.
|
||||
You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, authenticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.
|
||||
|
||||
#### How does it work?
|
||||
|
||||
```bash
|
||||
echo 'http://example.com' | ./archive
|
||||
mkdir data && cd data
|
||||
archivebox init
|
||||
archivebox add 'https://example.com'
|
||||
archivebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
|
||||
archivebox server
|
||||
```
|
||||
After installing the dependencies, just pipe some new links into the `./archive` command to start your archive.
|
||||
|
||||
ArchiveBox is written in Python 3.5 and uses wget, Chrome headless, youtube-dl, pywb, and other common unix tools to save each page you add in multiple redundant formats. It doesn't require a constantly running server or backend, just open the generated `output/index.html` in a browser to view the archive. It can import and export links as JSON (among other formats), so it's easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).
|
||||
After installing archivebox, just pass some new links to the `archivebox add` command to start your collection.
|
||||
|
||||
ArchiveBox is written in Python 3.7 and uses wget, Chrome headless, youtube-dl, pywb, and other common UNIX tools to save each page you add in multiple redundant formats. It doesn't require a constantly running server or backend (though it does include an optional one), just open the generated `data/index.html` in a browser to view the archive or run `archivebox server` to use the interactive Web UI. It can import and export links as JSON (among other formats), so it's easy to script or hook up to other APIs. If you run it on a schedule and import from browser history or bookmarks regularly, you can sleep soundly knowing that the slice of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).
|
||||
|
||||
<div align="center">
|
||||
|
||||
<img src="https://i.imgur.com/3tBL7PU.png" width="30%" alt="CLI Screenshot" align="top">
|
||||
<img src="https://i.imgur.com/viklZNG.png" width="30%" alt="Desktop index screenshot" align="top">
|
||||
<img src="https://i.imgur.com/RefWsXB.jpg" width="30%" alt="Desktop details page Screenshot"/><br/>
|
||||
<img src="https://i.imgur.com/3tBL7PU.png" width="22%" alt="CLI Screenshot" align="top">
|
||||
<img src="https://i.imgur.com/viklZNG.png" width="22%" alt="Desktop index screenshot" align="top">
|
||||
<img src="https://i.imgur.com/RefWsXB.jpg" width="22%" alt="Desktop details page Screenshot"/>
|
||||
<img src="https://i.imgur.com/M6HhzVx.png" width="22%" alt="Desktop details page Screenshot"/><br/>
|
||||
<sup><a href="https://archive.sweeting.me/">Demo</a> | <a href="https://github.com/pirate/ArchiveBox/wiki/Usage">Usage</a> | <a href="#screenshots">Screenshots</a></sup>
|
||||
<br/>
|
||||
<sub>. . . . . . . . . . . . . . . . . . . . . . . . . . . .</sub>
|
||||
|
|
@ -50,26 +56,56 @@ ArchiveBox is written in Python 3.5 and uses wget, Chrome headless, youtube-dl,
|
|||
|
||||
## Quickstart
|
||||
|
||||
ArchiveBox has [3 main dependencies](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies) beyond `python3`: `wget`, `chromium`, and `youtube-dl`.
|
||||
To get started, you can [install them manually](https://github.com/pirate/ArchiveBox/wiki/Install) using your system's package manager, use the [automated helper script](https://github.com/pirate/ArchiveBox/wiki/Quickstart), or use the official [Docker](https://github.com/pirate/ArchiveBox/wiki/Docker) container. All three dependencies are optional if [disabled](https://github.com/pirate/ArchiveBox/wiki/Configuration#archive-method-toggles) in settings.
|
||||
ArchiveBox is written in `python3.7` and has [3 main binary dependencies](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies): `wget`, `chromium`, and `youtube-dl`.
|
||||
To get started, you can [install them manually](https://github.com/pirate/ArchiveBox/wiki/Install) using your system's package manager, use the [automated helper script](https://github.com/pirate/ArchiveBox/wiki/Quickstart), or use the official [Docker](https://github.com/pirate/ArchiveBox/wiki/Docker) container. All three dependencies are optional if [disabled](https://github.com/pirate/ArchiveBox/wiki/Configuration#archive-method-toggles) in settings.
|
||||
|
||||
```bash
|
||||
# 1. Install dependencies (use apt on ubuntu, brew on mac, or pkg on BSD)
|
||||
apt install python3 python3-pip git curl wget youtube-dl chromium-browser
|
||||
|
||||
# 2. Download ArchiveBox
|
||||
git clone https://github.com/pirate/ArchiveBox.git && cd ArchiveBox
|
||||
|
||||
# 3. Add your first links to your archive
|
||||
echo 'https://example.com' | ./archive # pass URLs to archive via stdin
|
||||
|
||||
./archive https://getpocket.com/users/example/feed/all # or import an RSS/JSON/XML/TXT feed
|
||||
# Docker
|
||||
mkdir data && cd data
|
||||
docker run -v $PWD:/data nikisweeting/archivebox init
|
||||
docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
|
||||
docker run -v $PWD:/data -it nikisweeting/archivebox manage createsuperuser
|
||||
docker run -v $PWD:/data -p 8000:8000 nikisweeting/archivebox server 0.0.0.0:8000
|
||||
open http://127.0.0.1:8000
|
||||
```
|
||||
|
||||
One you've added your first links, open `output/index.html` in a browser to view the archive. [DEMO: archive.sweeting.me](https://archive.sweeting.me)
|
||||
For more information, see the [full Quickstart guide](https://github.com/pirate/ArchiveBox/wiki/Quickstart), [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage), and [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration) docs.
|
||||
```bash
|
||||
# Docker Compose
|
||||
# first download: https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml
|
||||
docker-compose run archivebox init
|
||||
docker-compose run archivebox add 'https://example.com'
|
||||
docker-compose run archivebox manage createsuperuser
|
||||
docker-compose up
|
||||
open http://127.0.0.1:8000
|
||||
```
|
||||
|
||||
*(`pip install archivebox` will be available in the near future, follow our [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) for progress)*
|
||||
```bash
|
||||
# Bare Metal
|
||||
# Use apt on Ubuntu/Debian, brew on mac, or pkg on BSD
|
||||
apt install python3 python3-pip git curl wget youtube-dl chromium-browser
|
||||
|
||||
pip install archivebox # install archivebox
|
||||
|
||||
mkdir data && cd data # (doesn't have to be called data)
|
||||
archivebox init
|
||||
archivebox add 'https://example.com' # add URLs via args or stdin
|
||||
|
||||
# or import an RSS/JSON/XML/TXT feed/list of links
|
||||
archivebox add https://getpocket.com/users/USERNAME/feed/all --depth=1
|
||||
```
|
||||
|
||||
Once you've added your first links, open `data/index.html` in a browser to view the static archive.
|
||||
|
||||
You can also start it as a server with a full web UI to manage your links:
|
||||
```bash
|
||||
archivebox manage createsuperuser
|
||||
archivebox server
|
||||
```
|
||||
|
||||
You can visit `http://127.0.0.1:8000` in your browser to access it.
|
||||
|
||||
[DEMO: archivebox.zervice.io/](https://archivebox.zervice.io)
|
||||
For more information, see the [full Quickstart guide](https://github.com/pirate/ArchiveBox/wiki/Quickstart), [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage), and [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration) docs.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -79,72 +115,73 @@ For more information, see the [full Quickstart guide](https://github.com/pirate/
|
|||
|
||||
# Overview
|
||||
|
||||
Because modern websites are complicated and often rely on dynamic content,
|
||||
ArchiveBox archives the sites in **several different formats** beyond what public
|
||||
archiving services like Archive.org and Archive.is are capable of saving. Using multiple
|
||||
methods and the market-dominant browser to execute JS ensures we can save even the most
|
||||
Because modern websites are complicated and often rely on dynamic content,
|
||||
ArchiveBox archives the sites in **several different formats** beyond what public
|
||||
archiving services like Archive.org and Archive.is are capable of saving. Using multiple
|
||||
methods and the market-dominant browser to execute JS ensures we can save even the most
|
||||
complex, finicky websites in at least a few high-quality, long-term data formats.
|
||||
|
||||
ArchiveBox imports a list of URLs from stdin, remote URL, or file, then adds the pages to a local archive folder using wget to create a browsable HTML clone, youtube-dl to extract media, and a full instance of Chrome headless for PDF, Screenshot, and DOM dumps, and more...
|
||||
|
||||
Running `./archive` adds only new, unique links into `output/` on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to [run on a timer](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save `None` and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/pirate/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs).
|
||||
Running `archivebox add` adds only new, unique links into your collection on each run. Because it will ignore duplicates and only archive each link the first time you add it, you can schedule it to [run on a timer](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) and re-import all your feeds multiple times a day. It will run quickly even if the feeds are large, because it's only archiving the newest links since the last run. For each link, it runs through all the archive methods. Methods that fail will save `None` and be automatically retried on the next run, methods that succeed save their output into the data folder and are never retried/overwritten by subsequent runs. Support for saving multiple snapshots of each site over time will be [added soon](https://github.com/pirate/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs).
|
||||
|
||||
All the archived links are stored by date bookmarked in `output/archive/<timestamp>`, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.
|
||||
All the archived links are stored by date bookmarked in `./archive/<timestamp>`, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.
|
||||
|
||||
#### Can import links from many formats:
|
||||
|
||||
```bash
|
||||
echo 'http://example.com' | ./archive
|
||||
./archive ~/Downloads/firefox_bookmarks_export.html
|
||||
./archive https://example.com/some/rss/feed.xml
|
||||
echo 'http://example.com' | archivebox add
|
||||
archivebox add ~/Downloads/firefox_bookmarks_export.html --depth=1
|
||||
archivebox add https://example.com/some/rss/feed.xml --depth=1
|
||||
```
|
||||
- <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> Browser history or bookmarks exports (Chrome, Firefox, Safari, IE, Opera, and more)
|
||||
- <img src="https://nicksweeting.com/images/rss.svg" height="22px"/> RSS, XML, JSON, CSV, SQL, HTML, Markdown, TXT, or any other text-based format
|
||||
- <img src="https://getpocket.com/favicon.ico" height="22px"/> Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, OneTab, and more
|
||||
|
||||
- <img src="https://nicksweeting.com/images/bookmarks.png" height="22px"/> Browser history or bookmarks exports (Chrome, Firefox, Safari, IE, Opera, and more)
|
||||
- <img src="https://nicksweeting.com/images/rss.svg" height="22px"/> RSS, XML, JSON, CSV, SQL, HTML, Markdown, TXT, or any other text-based format
|
||||
- <img src="https://getpocket.com/favicon.ico" height="22px"/> Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved Posts, Wallabag, Unmark.it, OneTab, and more
|
||||
|
||||
See the [Usage: CLI](https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage) page for documentation and examples.
|
||||
|
||||
#### Saves lots of useful stuff for each imported link:
|
||||
|
||||
```bash
|
||||
ls output/archive/<timestamp>/
|
||||
ls ./archive/<timestamp>/
|
||||
```
|
||||
|
||||
- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details
|
||||
- **Title:** `title` title of the site
|
||||
- **Favicon:** `favicon.ico` favicon of the site
|
||||
- **WGET Clone:** `example.com/page-name.html` wget clone of the site, with .html appended if not present
|
||||
- **WARC:** `warc/<timestamp>.gz` gzipped WARC of all the resources fetched while archiving
|
||||
- **PDF:** `output.pdf` Printed PDF of site using headless chrome
|
||||
- **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome
|
||||
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
|
||||
- **URL to Archive.org:** `archive.org.txt` A link to the saved site on archive.org
|
||||
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl
|
||||
- **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links
|
||||
- *More coming soon! See the [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)...*
|
||||
- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details
|
||||
- **Title:** `title` title of the site
|
||||
- **Favicon:** `favicon.ico` favicon of the site
|
||||
- **WGET Clone:** `example.com/page-name.html` wget clone of the site, with .html appended if not present
|
||||
- **WARC:** `warc/<timestamp>.gz` gzipped WARC of all the resources fetched while archiving
|
||||
- **PDF:** `output.pdf` Printed PDF of site using headless chrome
|
||||
- **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome
|
||||
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
|
||||
- **URL to Archive.org:** `archive.org.txt` A link to the saved site on archive.org
|
||||
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl
|
||||
- **Source Code:** `git/` clone of any repository found on github, bitbucket, or gitlab links
|
||||
- _More coming soon! See the [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)..._
|
||||
|
||||
It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/pirate/ArchiveBox/wiki/Configuration) via environment variables or config file.
|
||||
|
||||
If you're importing URLs with secret tokens in them (e.g Google Docs, CodiMD notepads, etc), you may want to disable some of these methods to avoid leaking private URLs to 3rd party APIs during the archiving process. See the [Security Overview](https://github.com/pirate/ArchiveBox/wiki/Security-Overview#stealth-mode) page for more details.
|
||||
If you're importing URLs with secret tokens in them (e.g Google Docs, CodiMD notepads, etc), you may want to disable some of these methods to avoid leaking private URLs to 3rd party APIs during the archiving process. See the [Security Overview](https://github.com/pirate/ArchiveBox/wiki/Security-Overview#stealth-mode) page for more details.
|
||||
|
||||
## Key Features
|
||||
|
||||
- [**Free & open source**](https://github.com/pirate/ArchiveBox/blob/master/LICENSE), doesn't require signing up for anything, stores all data locally
|
||||
- [**Few dependencies**](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies) and [simple command line interface](https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage)
|
||||
- [**Comprehensive documentation**](https://github.com/pirate/ArchiveBox/wiki), [active development](https://github.com/pirate/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
- **Doesn't require a constantly-running server**, proxy, or native app
|
||||
- Easy to set up **[scheduled importing](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) from multiple sources**
|
||||
- Uses common, **durable, [long-term formats](#saves-lots-of-useful-stuff-for-each-imported-link)** like HTML, JSON, PDF, PNG, and WARC
|
||||
- **Suitable for paywalled / [authenticated content](https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_user_data_dir)** (can use your cookies)
|
||||
- Can [**run scripts during archiving**](https://github.com/pirate/ArchiveBox/issues/51) to [scroll pages](https://github.com/pirate/ArchiveBox/issues/80), [close modals](https://github.com/pirate/ArchiveBox/issues/175), expand comment threads, etc.
|
||||
- Can also [**mirror content to 3rd-party archiving services**](https://github.com/pirate/ArchiveBox/wiki/Configuration#submit_archive_dot_org) automatically for redundancy
|
||||
- [**Free & open source**](https://github.com/pirate/ArchiveBox/blob/master/LICENSE), doesn't require signing up for anything, stores all data locally
|
||||
- [**Few dependencies**](https://github.com/pirate/ArchiveBox/wiki/Install#dependencies) and [simple command line interface](https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage)
|
||||
- [**Comprehensive documentation**](https://github.com/pirate/ArchiveBox/wiki), [active development](https://github.com/pirate/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
- **Doesn't require a constantly-running server**, proxy, or native app
|
||||
- Easy to set up **[scheduled importing](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving) from multiple sources**
|
||||
- Uses common, **durable, [long-term formats](#saves-lots-of-useful-stuff-for-each-imported-link)** like HTML, JSON, PDF, PNG, and WARC
|
||||
- ~~**Suitable for paywalled / [authenticated content](https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_user_data_dir)** (can use your cookies)~~ (do not do this until v0.5 is released with some security fixes)
|
||||
- Can [**run scripts during archiving**](https://github.com/pirate/ArchiveBox/issues/51) to [scroll pages](https://github.com/pirate/ArchiveBox/issues/80), [close modals](https://github.com/pirate/ArchiveBox/issues/175), expand comment threads, etc.
|
||||
- Can also [**mirror content to 3rd-party archiving services**](https://github.com/pirate/ArchiveBox/wiki/Configuration#submit_archive_dot_org) automatically for redundancy
|
||||
|
||||
## Background & Motivation
|
||||
|
||||
Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.
|
||||
Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.
|
||||
|
||||
Whether it's to resist censorship by saving articles before they get taken down or edited, or
|
||||
just to save a collection of early 2010's flash games you love to play, having the tools to
|
||||
just to save a collection of early 2010's flash games you love to play, having the tools to
|
||||
archive internet content enables to you save the stuff you care most about before it disappears.
|
||||
|
||||
<div align="center">
|
||||
|
|
@ -152,10 +189,9 @@ archive internet content enables to you save the stuff you care most about befor
|
|||
<sup><i>Image from <a href="https://digiday.com/media/wtf-link-rot/">WTF is Link Rot?</a>...</i><br/></sup>
|
||||
</div>
|
||||
|
||||
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful.
|
||||
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful.
|
||||
I don't think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.
|
||||
|
||||
|
||||
## Comparison to Other Projects
|
||||
|
||||
▶ **Check out our [community page](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community) for an index of web archiving initiatives and projects.**
|
||||
|
|
@ -164,41 +200,39 @@ I don't think everything should be preserved in an automated fashion, making all
|
|||
|
||||
#### User Interface & Intended Purpose
|
||||
|
||||
ArchiveBox differentiates itself from [similar projects](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects) by being a simple, one-shot CLI inferface for users to ingest built feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI.
|
||||
|
||||
An alternative tool [pywb](https://github.com/webrecorder/pywb) allows you to run a browser through an always-running archiving proxy which records the traffic to WARC files. ArchiveBox intends to support this style of live proxy-archiving using `pywb` in the future, but for now it only ingests lists of links at a time via browser history, bookmarks, RSS, etc.
|
||||
ArchiveBox differentiates itself from [similar projects](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects) by being a simple, one-shot CLI interface for users to ingest bulk feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI. However, we also have the option to add urls via a web interface through our Django frontend.
|
||||
|
||||
#### Private Local Archives vs Centralized Public Archives
|
||||
|
||||
Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, including private/authenticated content that you wouldn't otherwise share with a centralized service. Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.
|
||||
Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, ~~including private/authenticated content that you wouldn't otherwise share with a centralized service~~ (do not do this until v0.5 is released with some security fixes). Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.
|
||||
|
||||
#### Storage Requirements
|
||||
|
||||
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting `SAVE_MEDIA=False` to skip audio & video files.
|
||||
Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting `SAVE_MEDIA=False` to skip audio & video files.
|
||||
|
||||
## Learn more
|
||||
|
||||
▶ **Join out our [community chat](http://webchat.freenode.net?channels=ArchiveBox&uio=d4) hosted on IRC freenode.net:`#ArchiveBox`!**
|
||||
<!--▶ **Join out our [community chat](http://webchat.freenode.net?channels=ArchiveBox&uio=d4) hosted on IRC freenode.net:`#ArchiveBox`!**-->
|
||||
|
||||
Whether you want learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!
|
||||
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!
|
||||
|
||||
<img src="https://i.imgur.com/0ZOmOvN.png" width="14%" align="right"/>
|
||||
|
||||
- [Community Wiki](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
+ [The Master Lists](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#The-Master-Lists)
|
||||
*Community-maintained indexes of archiving tools and institutions.*
|
||||
+ [Web Archiving Software](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects)
|
||||
*Open source tools and projects in the internet archiving space.*
|
||||
+ [Reading List](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Reading-List)
|
||||
*Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.*
|
||||
+ [Communities](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Communities)
|
||||
*A collection of the most active internet archiving communities and initiatives.*
|
||||
- Check out the ArchiveBox [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) and [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog)
|
||||
- Learn why archiving the internet is important by reading the "[On the Importance of Web Archiving](https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/)" blog post.
|
||||
- Or reach out to me for questions and comments via [@theSquashSH](https://twitter.com/thesquashSH) on Twitter.
|
||||
|
||||
- [Community Wiki](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
- [The Master Lists](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#The-Master-Lists)
|
||||
_Community-maintained indexes of archiving tools and institutions._
|
||||
- [Web Archiving Software](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects)
|
||||
_Open source tools and projects in the internet archiving space._
|
||||
- [Reading List](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Reading-List)
|
||||
_Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._
|
||||
- [Communities](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Communities)
|
||||
_A collection of the most active internet archiving communities and initiatives._
|
||||
- Check out the ArchiveBox [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap) and [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog)
|
||||
- Learn why archiving the internet is important by reading the "[On the Importance of Web Archiving](https://parameters.ssrc.org/2018/09/on-the-importance-of-web-archiving/)" blog post.
|
||||
- Or reach out to me for questions and comments via [@theSquashSH](https://twitter.com/thesquashSH) on Twitter.
|
||||
|
||||
---
|
||||
|
||||
|
||||
# Documentation
|
||||
|
||||
<img src="https://read-the-docs-guidelines.readthedocs-hosted.com/_images/logo-dark.png" width="13%" align="right"/>
|
||||
|
|
@ -208,6 +242,7 @@ We use the [Github wiki system](https://github.com/pirate/ArchiveBox/wiki) and [
|
|||
You can also access the docs locally by looking in the [`ArchiveBox/docs/`](https://github.com/pirate/ArchiveBox/wiki/Home) folder.
|
||||
|
||||
You can build the docs by running:
|
||||
|
||||
```python
|
||||
cd ArchiveBox
|
||||
pipenv install --dev
|
||||
|
|
@ -219,41 +254,29 @@ make html
|
|||
|
||||
## Getting Started
|
||||
|
||||
- [Quickstart](https://github.com/pirate/ArchiveBox/wiki/Quickstart)
|
||||
- [Install](https://github.com/pirate/ArchiveBox/wiki/Install)
|
||||
- [Docker](https://github.com/pirate/ArchiveBox/wiki/Docker)
|
||||
- [Quickstart](https://github.com/pirate/ArchiveBox/wiki/Quickstart)
|
||||
- [Install](https://github.com/pirate/ArchiveBox/wiki/Install)
|
||||
- [Docker](https://github.com/pirate/ArchiveBox/wiki/Docker)
|
||||
|
||||
## Reference
|
||||
|
||||
- [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage)
|
||||
- [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration)
|
||||
- [Supported Sources](https://github.com/pirate/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)
|
||||
- [Supported Outputs](https://github.com/pirate/ArchiveBox/wiki#can-save-these-things-for-each-site)
|
||||
- [Scheduled Archiving](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving)
|
||||
- [Publishing Your Archive](https://github.com/pirate/ArchiveBox/wiki/Publishing-Your-Archive)
|
||||
- [Chromium Install](https://github.com/pirate/ArchiveBox/wiki/Install-Chromium)
|
||||
- [Security Overview](https://github.com/pirate/ArchiveBox/wiki/Security-Overview)
|
||||
- [Troubleshooting](https://github.com/pirate/ArchiveBox/wiki/Troubleshooting)
|
||||
- [Usage](https://github.com/pirate/ArchiveBox/wiki/Usage)
|
||||
- [Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration)
|
||||
- [Supported Sources](https://github.com/pirate/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)
|
||||
- [Supported Outputs](https://github.com/pirate/ArchiveBox/wiki#can-save-these-things-for-each-site)
|
||||
- [Scheduled Archiving](https://github.com/pirate/ArchiveBox/wiki/Scheduled-Archiving)
|
||||
- [Publishing Your Archive](https://github.com/pirate/ArchiveBox/wiki/Publishing-Your-Archive)
|
||||
- [Chromium Install](https://github.com/pirate/ArchiveBox/wiki/Install-Chromium)
|
||||
- [Security Overview](https://github.com/pirate/ArchiveBox/wiki/Security-Overview)
|
||||
- [Troubleshooting](https://github.com/pirate/ArchiveBox/wiki/Troubleshooting)
|
||||
|
||||
## More Info
|
||||
|
||||
- [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)
|
||||
- [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog)
|
||||
- [Donations](https://github.com/pirate/ArchiveBox/wiki/Donations)
|
||||
- [Background & Motivation](https://github.com/pirate/ArchiveBox#background--motivation)
|
||||
- [Web Archiving Community](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
|
||||
---
|
||||
|
||||
|
||||
# Screenshots
|
||||
|
||||
<div align="center">
|
||||
<img src="https://i.imgur.com/biVfFYr.png" width="18%" alt="CLI Screenshot" align="top">
|
||||
<img src="https://i.imgur.com/viklZNG.png" width="40%" alt="Desktop index screenshot" align="top">
|
||||
<img src="https://i.imgur.com/wnpdAVM.jpg" width="30%" alt="Desktop details page Screenshot" align="top">
|
||||
<img src="https://i.imgur.com/mW2dITg.png" width="8%" alt="Mobile details page screenshot" align="top">
|
||||
</div>
|
||||
- [Roadmap](https://github.com/pirate/ArchiveBox/wiki/Roadmap)
|
||||
- [Changelog](https://github.com/pirate/ArchiveBox/wiki/Changelog)
|
||||
- [Donations](https://github.com/pirate/ArchiveBox/wiki/Donations)
|
||||
- [Background & Motivation](https://github.com/pirate/ArchiveBox#background--motivation)
|
||||
- [Web Archiving Community](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -261,24 +284,18 @@ make html
|
|||
<br/><br/>
|
||||
<img src="https://raw.githubusercontent.com/Monadical-SAS/redux-time/HEAD/examples/static/jeremy.jpg" height="40px"/>
|
||||
<br/>
|
||||
<sub><i>This project is maintained mostly in <a href="https://nicksweeting.com/blog#About">my spare time</a> with the help from generous contributors.</i></sub>
|
||||
<sub><i>This project is maintained mostly in <a href="https://nicksweeting.com/blog#About">my spare time</a> with the help from generous contributors and Monadical.com.</i></sub>
|
||||
<br/><br/>
|
||||
Contributor Spotlight:<br/><br/>
|
||||
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/0"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/0"></a>
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/1"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/1"></a>
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/2"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/2"></a>
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/3"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/3"></a>
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/4"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/4"></a>
|
||||
<a href="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/links/5"><img src="https://sourcerer.io/fame/pirate/pirate/ArchiveBox/images/5"></a>
|
||||
|
||||
<br/>
|
||||
<a href="https://github.com/sponsors/pirate">Sponsor us on Github</a>
|
||||
<br>
|
||||
<br>
|
||||
<a href="https://www.patreon.com/theSquashSH"><img src="https://img.shields.io/badge/Donate_to_support_development-via_Patreon-%23DD5D76.svg?style=flat"/></a>
|
||||
<br/>
|
||||
<br/>
|
||||
|
||||
<a href="https://twitter.com/thesquashSH"><img src="https://img.shields.io/badge/Tweet-%40theSquashSH-blue.svg?style=flat"/></a>
|
||||
<a href="https://github.com/pirate/ArchiveBox"><img src="https://img.shields.io/github/stars/pirate/ArchiveBox.svg?style=flat&label=Star+on+Github"/></a>
|
||||
<a href="http://webchat.freenode.net?channels=ArchiveBox&uio=d4"><img src="https://img.shields.io/badge/Community_chat-IRC-%2328A745.svg"/></a>
|
||||
|
||||
<br/><br/>
|
||||
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
theme: jekyll-theme-merlot
|
||||
theme: jekyll-theme-minimal
|
||||
|
|
@ -1,4 +1,6 @@
|
|||
[flake8]
|
||||
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E127,E131,E241,E252,E266,E272,E701,E731,W293,W503
|
||||
select = F,E9
|
||||
exclude = migrations,util_scripts,node_modules,venv
|
||||
ignore = D100,D101,D102,D103,D104,D105,D202,D203,D205,D400,E131,E241,E252,E266,E272,E701,E731,W293,W503,W291,W391
|
||||
select = F,E9,W
|
||||
max-line-length = 130
|
||||
max-complexity = 10
|
||||
exclude = migrations,tests,node_modules,vendor,static,venv,.venv,.venv2,.docker-venv
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
0.4.0
|
||||
0.4.13
|
||||
|
|
|
|||
|
|
@ -1,6 +1 @@
|
|||
__package__ = 'archivebox'
|
||||
|
||||
from . import core
|
||||
from . import cli
|
||||
|
||||
from .main import *
|
||||
|
|
|
|||
|
|
@ -3,13 +3,9 @@
|
|||
__package__ = 'archivebox'
|
||||
|
||||
import sys
|
||||
from .cli import archivebox
|
||||
|
||||
|
||||
def main():
|
||||
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
|
||||
from .cli import main
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
|
||||
|
||||
main(args=sys.argv[1:], stdin=sys.stdin)
|
||||
|
|
|
|||
|
|
@ -1,8 +1,14 @@
|
|||
__package__ = 'archivebox.cli'
|
||||
__command__ = 'archivebox'
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
from typing import Optional, Dict, List, IO
|
||||
|
||||
from ..config import OUTPUT_DIR
|
||||
|
||||
from typing import Dict, List, Optional, IO
|
||||
from importlib import import_module
|
||||
|
||||
CLI_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
|
|
@ -24,6 +30,7 @@ is_valid_cli_module = lambda module, subcommand: (
|
|||
and module.__command__.split(' ')[-1] == subcommand
|
||||
)
|
||||
|
||||
|
||||
def list_subcommands() -> Dict[str, str]:
|
||||
"""find and import all valid archivebox_<subcommand>.py files in CLI_DIR"""
|
||||
|
||||
|
|
@ -57,6 +64,69 @@ def run_subcommand(subcommand: str,
|
|||
|
||||
SUBCOMMANDS = list_subcommands()
|
||||
|
||||
class NotProvided:
|
||||
pass
|
||||
|
||||
|
||||
def main(args: Optional[List[str]]=NotProvided, stdin: Optional[IO]=NotProvided, pwd: Optional[str]=None) -> None:
|
||||
args = sys.argv[1:] if args is NotProvided else args
|
||||
stdin = sys.stdin if stdin is NotProvided else stdin
|
||||
|
||||
subcommands = list_subcommands()
|
||||
parser = argparse.ArgumentParser(
|
||||
prog=__command__,
|
||||
description='ArchiveBox: The self-hosted internet archive',
|
||||
add_help=False,
|
||||
)
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument(
|
||||
'--help', '-h',
|
||||
action='store_true',
|
||||
help=subcommands['help'],
|
||||
)
|
||||
group.add_argument(
|
||||
'--version',
|
||||
action='store_true',
|
||||
help=subcommands['version'],
|
||||
)
|
||||
group.add_argument(
|
||||
"subcommand",
|
||||
type=str,
|
||||
help= "The name of the subcommand to run",
|
||||
nargs='?',
|
||||
choices=subcommands.keys(),
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument(
|
||||
"subcommand_args",
|
||||
help="Arguments for the subcommand",
|
||||
nargs=argparse.REMAINDER,
|
||||
)
|
||||
command = parser.parse_args(args or ())
|
||||
|
||||
if command.help or command.subcommand is None:
|
||||
command.subcommand = 'help'
|
||||
elif command.version:
|
||||
command.subcommand = 'version'
|
||||
|
||||
if command.subcommand not in ('help', 'version', 'status'):
|
||||
from ..logging_util import log_cli_command
|
||||
|
||||
log_cli_command(
|
||||
subcommand=command.subcommand,
|
||||
subcommand_args=command.subcommand_args,
|
||||
stdin=stdin,
|
||||
pwd=pwd or OUTPUT_DIR
|
||||
)
|
||||
|
||||
run_subcommand(
|
||||
subcommand=command.subcommand,
|
||||
subcommand_args=command.subcommand_args,
|
||||
stdin=stdin,
|
||||
pwd=pwd or OUTPUT_DIR,
|
||||
)
|
||||
|
||||
|
||||
__all__ = (
|
||||
'SUBCOMMANDS',
|
||||
'list_subcommands',
|
||||
|
|
|
|||
|
|
@ -1,63 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
# archivebox [command]
|
||||
|
||||
__package__ = 'archivebox.cli'
|
||||
__command__ = 'archivebox'
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from . import list_subcommands, run_subcommand
|
||||
from ..config import OUTPUT_DIR
|
||||
|
||||
|
||||
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
|
||||
subcommands = list_subcommands()
|
||||
parser = argparse.ArgumentParser(
|
||||
prog=__command__,
|
||||
description='ArchiveBox: The self-hosted internet archive',
|
||||
add_help=False,
|
||||
)
|
||||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument(
|
||||
'--help', '-h',
|
||||
action='store_true',
|
||||
help=subcommands['help'],
|
||||
)
|
||||
group.add_argument(
|
||||
'--version',
|
||||
action='store_true',
|
||||
help=subcommands['version'],
|
||||
)
|
||||
group.add_argument(
|
||||
"subcommand",
|
||||
type=str,
|
||||
help= "The name of the subcommand to run",
|
||||
nargs='?',
|
||||
choices=subcommands.keys(),
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument(
|
||||
"subcommand_args",
|
||||
help="Arguments for the subcommand",
|
||||
nargs=argparse.REMAINDER,
|
||||
)
|
||||
command = parser.parse_args(args or ())
|
||||
|
||||
if command.help or command.subcommand is None:
|
||||
command.subcommand = 'help'
|
||||
if command.version:
|
||||
command.subcommand = 'version'
|
||||
|
||||
run_subcommand(
|
||||
subcommand=command.subcommand,
|
||||
subcommand_args=command.subcommand_args,
|
||||
stdin=stdin,
|
||||
pwd=pwd or OUTPUT_DIR,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main(args=sys.argv[1:], stdin=sys.stdin)
|
||||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import List, Optional, IO
|
||||
|
||||
from ..main import add, docstring
|
||||
from ..main import add
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR, ONLY_NEW
|
||||
from .logging import SmartFormatter, accept_stdin
|
||||
from ..logging_util import SmartFormatter, accept_stdin, stderr
|
||||
|
||||
|
||||
@docstring(add.__doc__)
|
||||
|
|
@ -33,23 +34,39 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
|
|||
help="Add the links to the main index without archiving them",
|
||||
)
|
||||
parser.add_argument(
|
||||
'import_path',
|
||||
nargs='?',
|
||||
'urls',
|
||||
nargs='*',
|
||||
type=str,
|
||||
default=None,
|
||||
help=(
|
||||
'URL or path to local file containing a list of links to import. e.g.:\n'
|
||||
'URLs or paths to archive e.g.:\n'
|
||||
' https://getpocket.com/users/USERNAME/feed/all\n'
|
||||
' https://example.com/some/rss/feed.xml\n'
|
||||
' https://example.com\n'
|
||||
' ~/Downloads/firefox_bookmarks_export.html\n'
|
||||
' ~/Desktop/sites_list.csv\n'
|
||||
)
|
||||
)
|
||||
parser.add_argument(
|
||||
"--depth",
|
||||
action="store",
|
||||
default=0,
|
||||
choices=[0, 1],
|
||||
type=int,
|
||||
help="Recursively archive all linked pages up to this many hops away"
|
||||
)
|
||||
command = parser.parse_args(args or ())
|
||||
import_str = accept_stdin(stdin)
|
||||
urls = command.urls
|
||||
stdin_urls = accept_stdin(stdin)
|
||||
if (stdin_urls and urls) or (not stdin and not urls):
|
||||
stderr(
|
||||
'[X] You must pass URLs/paths to add via stdin or CLI arguments.\n',
|
||||
color='red',
|
||||
)
|
||||
raise SystemExit(2)
|
||||
add(
|
||||
import_str=import_str,
|
||||
import_path=command.import_path,
|
||||
urls=stdin_urls or urls,
|
||||
depth=command.depth,
|
||||
update_all=command.update_all,
|
||||
index_only=command.index_only,
|
||||
out_dir=pwd or OUTPUT_DIR,
|
||||
|
|
@ -63,12 +80,6 @@ if __name__ == '__main__':
|
|||
# TODO: Implement these
|
||||
#
|
||||
# parser.add_argument(
|
||||
# '--depth', #'-d',
|
||||
# type=int,
|
||||
# help='Recursively archive all linked pages up to this many hops away',
|
||||
# default=0,
|
||||
# )
|
||||
# parser.add_argument(
|
||||
# '--mirror', #'-m',
|
||||
# action='store_true',
|
||||
# help='Archive an entire site (finding all linked pages below it on the same domain)',
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import config, docstring
|
||||
from ..main import config
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, accept_stdin
|
||||
from ..logging_util import SmartFormatter, accept_stdin
|
||||
|
||||
|
||||
@docstring(config.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import help, docstring
|
||||
from ..main import help
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(help.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import init, docstring
|
||||
from ..main import init
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(init.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,7 +8,8 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import list_all, docstring
|
||||
from ..main import list_all
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from ..index import (
|
||||
get_indexed_folders,
|
||||
|
|
@ -22,7 +23,7 @@ from ..index import (
|
|||
get_corrupted_folders,
|
||||
get_unrecognized_folders,
|
||||
)
|
||||
from .logging import SmartFormatter, accept_stdin
|
||||
from ..logging_util import SmartFormatter, accept_stdin
|
||||
|
||||
|
||||
@docstring(list_all.__doc__)
|
||||
|
|
|
|||
|
|
@ -7,7 +7,8 @@ import sys
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import manage, docstring
|
||||
from ..main import manage
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
|
||||
|
||||
|
|
|
|||
62
archivebox/cli/archivebox_oneshot.py
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
__package__ = 'archivebox.cli'
|
||||
__command__ = 'archivebox oneshot'
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
from pathlib import Path
|
||||
from typing import List, Optional, IO
|
||||
|
||||
from ..main import oneshot
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from ..logging_util import SmartFormatter, accept_stdin, stderr
|
||||
|
||||
|
||||
@docstring(oneshot.__doc__)
|
||||
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
prog=__command__,
|
||||
description=oneshot.__doc__,
|
||||
add_help=True,
|
||||
formatter_class=SmartFormatter,
|
||||
)
|
||||
parser.add_argument(
|
||||
'url',
|
||||
type=str,
|
||||
default=None,
|
||||
help=(
|
||||
'URLs or paths to archive e.g.:\n'
|
||||
' https://getpocket.com/users/USERNAME/feed/all\n'
|
||||
' https://example.com/some/rss/feed.xml\n'
|
||||
' https://example.com\n'
|
||||
' ~/Downloads/firefox_bookmarks_export.html\n'
|
||||
' ~/Desktop/sites_list.csv\n'
|
||||
)
|
||||
)
|
||||
parser.add_argument(
|
||||
'--out-dir',
|
||||
type=str,
|
||||
default=OUTPUT_DIR,
|
||||
help= "Path to save the single archive folder to, e.g. ./example.com_archive"
|
||||
)
|
||||
command = parser.parse_args(args or ())
|
||||
url = command.url
|
||||
stdin_url = accept_stdin(stdin)
|
||||
if (stdin_url and url) or (not stdin and not url):
|
||||
stderr(
|
||||
'[X] You must pass a URL/path to add via stdin or CLI arguments.\n',
|
||||
color='red',
|
||||
)
|
||||
raise SystemExit(2)
|
||||
|
||||
oneshot(
|
||||
url=stdin_url or url,
|
||||
out_dir=str(Path(command.out_dir).absolute()),
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main(args=sys.argv[1:], stdin=sys.stdin)
|
||||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import remove, docstring
|
||||
from ..main import remove
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, accept_stdin
|
||||
from ..logging_util import SmartFormatter, accept_stdin
|
||||
|
||||
|
||||
@docstring(remove.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import schedule, docstring
|
||||
from ..main import schedule
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(schedule.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import server, docstring
|
||||
from ..main import server
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(server.__doc__)
|
||||
|
|
@ -38,6 +39,11 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
|
|||
action='store_true',
|
||||
help='Enable DEBUG=True mode with more verbose errors',
|
||||
)
|
||||
parser.add_argument(
|
||||
'--init',
|
||||
action='store_true',
|
||||
help='Run archivebox init before starting the server',
|
||||
)
|
||||
command = parser.parse_args(args or ())
|
||||
reject_stdin(__command__, stdin)
|
||||
|
||||
|
|
@ -45,6 +51,7 @@ def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional
|
|||
runserver_args=command.runserver_args,
|
||||
reload=command.reload,
|
||||
debug=command.debug,
|
||||
init=command.init,
|
||||
out_dir=pwd or OUTPUT_DIR,
|
||||
)
|
||||
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import shell, docstring
|
||||
from ..main import shell
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(shell.__doc__)
|
||||
|
|
|
|||
|
|
@ -1,30 +1,31 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
__package__ = 'archivebox.cli'
|
||||
__command__ = 'archivebox info'
|
||||
__command__ = 'archivebox status'
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import info, docstring
|
||||
from ..main import status
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(info.__doc__)
|
||||
@docstring(status.__doc__)
|
||||
def main(args: Optional[List[str]]=None, stdin: Optional[IO]=None, pwd: Optional[str]=None) -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
prog=__command__,
|
||||
description=info.__doc__,
|
||||
description=status.__doc__,
|
||||
add_help=True,
|
||||
formatter_class=SmartFormatter,
|
||||
)
|
||||
parser.parse_args(args or ())
|
||||
reject_stdin(__command__, stdin)
|
||||
|
||||
info(out_dir=pwd or OUTPUT_DIR)
|
||||
status(out_dir=pwd or OUTPUT_DIR)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
|
@ -8,7 +8,8 @@ import argparse
|
|||
|
||||
from typing import List, Optional, IO
|
||||
|
||||
from ..main import update, docstring
|
||||
from ..main import update
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from ..index import (
|
||||
get_indexed_folders,
|
||||
|
|
@ -22,7 +23,7 @@ from ..index import (
|
|||
get_corrupted_folders,
|
||||
get_unrecognized_folders,
|
||||
)
|
||||
from .logging import SmartFormatter, accept_stdin
|
||||
from ..logging_util import SmartFormatter, accept_stdin
|
||||
|
||||
|
||||
@docstring(update.__doc__)
|
||||
|
|
|
|||
|
|
@ -8,9 +8,10 @@ import argparse
|
|||
|
||||
from typing import Optional, List, IO
|
||||
|
||||
from ..main import version, docstring
|
||||
from ..main import version
|
||||
from ..util import docstring
|
||||
from ..config import OUTPUT_DIR
|
||||
from .logging import SmartFormatter, reject_stdin
|
||||
from ..logging_util import SmartFormatter, reject_stdin
|
||||
|
||||
|
||||
@docstring(version.__doc__)
|
||||
|
|
|
|||
|
|
@ -198,7 +198,7 @@ class TestRemove(unittest.TestCase):
|
|||
|
||||
def test_remove_regex(self):
|
||||
with output_hidden():
|
||||
archivebox_remove.main(['--yes', '--delete', '--filter-type=regex', 'http(s)?:\/\/(.+\.)?(example\d\.com)'])
|
||||
archivebox_remove.main(['--yes', '--delete', '--filter-type=regex', r'http(s)?:\/\/(.+\.)?(example\d\.com)'])
|
||||
|
||||
all_links = load_main_index(out_dir=OUTPUT_DIR)
|
||||
assert len(all_links) == 4
|
||||
|
|
|
|||
|
|
@ -9,9 +9,11 @@ import getpass
|
|||
import shutil
|
||||
|
||||
from hashlib import md5
|
||||
from pathlib import Path
|
||||
from typing import Optional, Type, Tuple, Dict
|
||||
from subprocess import run, PIPE, DEVNULL
|
||||
from configparser import ConfigParser
|
||||
from collections import defaultdict
|
||||
|
||||
from .stubs import (
|
||||
SimpleConfigValueDict,
|
||||
|
|
@ -21,6 +23,14 @@ from .stubs import (
|
|||
ConfigDefaultDict,
|
||||
)
|
||||
|
||||
# precedence order for config:
|
||||
# 1. cli args
|
||||
# 2. shell environment vars
|
||||
# 3. config file
|
||||
# 4. defaults
|
||||
|
||||
# env USE_COLO=false archivebox add '...'
|
||||
# env SHOW_PROGRESS=1 archivebox add '...'
|
||||
|
||||
# ******************************************************************************
|
||||
# Documentation: https://github.com/pirate/ArchiveBox/wiki/Configuration
|
||||
|
|
@ -35,6 +45,8 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
|
|||
'IS_TTY': {'type': bool, 'default': lambda _: sys.stdout.isatty()},
|
||||
'USE_COLOR': {'type': bool, 'default': lambda c: c['IS_TTY']},
|
||||
'SHOW_PROGRESS': {'type': bool, 'default': lambda c: c['IS_TTY']},
|
||||
'IN_DOCKER': {'type': bool, 'default': False},
|
||||
# TODO: 'SHOW_HINTS': {'type: bool, 'default': True},
|
||||
},
|
||||
|
||||
'GENERAL_CONFIG': {
|
||||
|
|
@ -44,21 +56,33 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
|
|||
'TIMEOUT': {'type': int, 'default': 60},
|
||||
'MEDIA_TIMEOUT': {'type': int, 'default': 3600},
|
||||
'OUTPUT_PERMISSIONS': {'type': str, 'default': '755'},
|
||||
'FOOTER_INFO': {'type': str, 'default': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.'},
|
||||
'RESTRICT_FILE_NAMES': {'type': str, 'default': 'windows'},
|
||||
'URL_BLACKLIST': {'type': str, 'default': None},
|
||||
},
|
||||
|
||||
'SERVER_CONFIG': {
|
||||
'SECRET_KEY': {'type': str, 'default': None},
|
||||
'ALLOWED_HOSTS': {'type': str, 'default': '*'},
|
||||
'DEBUG': {'type': bool, 'default': False},
|
||||
'PUBLIC_INDEX': {'type': bool, 'default': True},
|
||||
'PUBLIC_SNAPSHOTS': {'type': bool, 'default': True},
|
||||
'FOOTER_INFO': {'type': str, 'default': 'Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.'},
|
||||
'ACTIVE_THEME': {'type': str, 'default': 'default'},
|
||||
},
|
||||
|
||||
'ARCHIVE_METHOD_TOGGLES': {
|
||||
'SAVE_TITLE': {'type': bool, 'default': True, 'aliases': ('FETCH_TITLE',)},
|
||||
'SAVE_FAVICON': {'type': bool, 'default': True, 'aliases': ('FETCH_FAVICON',)},
|
||||
'SAVE_WGET': {'type': bool, 'default': True, 'aliases': ('FETCH_WGET',)},
|
||||
'SAVE_WGET_REQUISITES': {'type': bool, 'default': True, 'aliases': ('FETCH_WGET_REQUISITES',)},
|
||||
'SAVE_SINGLEFILE': {'type': bool, 'default': True, 'aliases': ('FETCH_SINGLEFILE',)},
|
||||
'SAVE_PDF': {'type': bool, 'default': True, 'aliases': ('FETCH_PDF',)},
|
||||
'SAVE_SCREENSHOT': {'type': bool, 'default': True, 'aliases': ('FETCH_SCREENSHOT',)},
|
||||
'SAVE_DOM': {'type': bool, 'default': True, 'aliases': ('FETCH_DOM',)},
|
||||
'SAVE_WARC': {'type': bool, 'default': True, 'aliases': ('FETCH_WARC',)},
|
||||
'SAVE_GIT': {'type': bool, 'default': True, 'aliases': ('FETCH_GIT',)},
|
||||
'SAVE_MEDIA': {'type': bool, 'default': True, 'aliases': ('FETCH_MEDIA',)},
|
||||
'SAVE_PLAYLISTS': {'type': bool, 'default': True, 'aliases': ('FETCH_PLAYLISTS',)},
|
||||
'SAVE_ARCHIVE_DOT_ORG': {'type': bool, 'default': True, 'aliases': ('SUBMIT_ARCHIVE_DOT_ORG',)},
|
||||
},
|
||||
|
||||
|
|
@ -67,6 +91,7 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
|
|||
'GIT_DOMAINS': {'type': str, 'default': 'github.com,bitbucket.org,gitlab.com'},
|
||||
'CHECK_SSL_VALIDITY': {'type': bool, 'default': True},
|
||||
|
||||
'CURL_USER_AGENT': {'type': str, 'default': 'ArchiveBox/{VERSION} (+https://github.com/pirate/ArchiveBox/) curl/{CURL_VERSION}'},
|
||||
'WGET_USER_AGENT': {'type': str, 'default': 'ArchiveBox/{VERSION} (+https://github.com/pirate/ArchiveBox/) wget/{WGET_VERSION}'},
|
||||
'CHROME_USER_AGENT': {'type': str, 'default': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'},
|
||||
|
||||
|
|
@ -75,11 +100,13 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
|
|||
|
||||
'CHROME_HEADLESS': {'type': bool, 'default': True},
|
||||
'CHROME_SANDBOX': {'type': bool, 'default': True},
|
||||
|
||||
},
|
||||
|
||||
'DEPENDENCY_CONFIG': {
|
||||
'USE_CURL': {'type': bool, 'default': True},
|
||||
'USE_WGET': {'type': bool, 'default': True},
|
||||
'USE_SINGLEFILE': {'type': bool, 'default': True},
|
||||
'USE_GIT': {'type': bool, 'default': True},
|
||||
'USE_CHROME': {'type': bool, 'default': True},
|
||||
'USE_YOUTUBEDL': {'type': bool, 'default': True},
|
||||
|
|
@ -87,6 +114,7 @@ CONFIG_DEFAULTS: Dict[str, ConfigDefaultDict] = {
|
|||
'CURL_BINARY': {'type': str, 'default': 'curl'},
|
||||
'GIT_BINARY': {'type': str, 'default': 'git'},
|
||||
'WGET_BINARY': {'type': str, 'default': 'wget'},
|
||||
'SINGLEFILE_BINARY': {'type': str, 'default': 'single-file'},
|
||||
'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
|
||||
'CHROME_BINARY': {'type': str, 'default': None},
|
||||
},
|
||||
|
|
@ -119,8 +147,20 @@ DEFAULT_CLI_COLORS = {
|
|||
}
|
||||
ANSI = {k: '' for k in DEFAULT_CLI_COLORS.keys()}
|
||||
|
||||
COLOR_DICT = defaultdict(lambda: [(0, 0, 0), (0, 0, 0)], {
|
||||
'00': [(0, 0, 0), (0, 0, 0)],
|
||||
'30': [(0, 0, 0), (0, 0, 0)],
|
||||
'31': [(255, 0, 0), (128, 0, 0)],
|
||||
'32': [(0, 200, 0), (0, 128, 0)],
|
||||
'33': [(255, 255, 0), (128, 128, 0)],
|
||||
'34': [(0, 0, 255), (0, 0, 128)],
|
||||
'35': [(255, 0, 255), (128, 0, 128)],
|
||||
'36': [(0, 255, 255), (0, 128, 128)],
|
||||
'37': [(255, 255, 255), (255, 255, 255)],
|
||||
})
|
||||
|
||||
STATICFILE_EXTENSIONS = {
|
||||
# 99.999% of the time, URLs ending in these extentions are static files
|
||||
# 99.999% of the time, URLs ending in these extensions are static files
|
||||
# that can be downloaded as-is, not html pages that need to be rendered
|
||||
'gif', 'jpeg', 'jpg', 'png', 'tif', 'tiff', 'wbmp', 'ico', 'jng', 'bmp',
|
||||
'svg', 'svgz', 'webp', 'ps', 'eps', 'ai',
|
||||
|
|
@ -137,7 +177,7 @@ STATICFILE_EXTENSIONS = {
|
|||
# pl pm, prc pdb, rar, rpm, sea, sit, tcl tk, der, pem, crt, xpi, xspf,
|
||||
# ra, mng, asx, asf, 3gpp, 3gp, mid, midi, kar, jad, wml, htc, mml
|
||||
|
||||
# Thse are always treated as pages, not as static files, never add them:
|
||||
# These are always treated as pages, not as static files, never add them:
|
||||
# html, htm, shtml, xhtml, xml, aspx, php, cgi
|
||||
}
|
||||
|
||||
|
|
@ -175,11 +215,11 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
|
|||
'TERM_WIDTH': {'default': lambda c: lambda: shutil.get_terminal_size((100, 10)).columns},
|
||||
'USER': {'default': lambda c: getpass.getuser() or os.getlogin()},
|
||||
'ANSI': {'default': lambda c: DEFAULT_CLI_COLORS if c['USE_COLOR'] else {k: '' for k in DEFAULT_CLI_COLORS.keys()}},
|
||||
|
||||
|
||||
'REPO_DIR': {'default': lambda c: os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', '..'))},
|
||||
'PYTHON_DIR': {'default': lambda c: os.path.join(c['REPO_DIR'], PYTHON_DIR_NAME)},
|
||||
'TEMPLATES_DIR': {'default': lambda c: os.path.join(c['PYTHON_DIR'], TEMPLATES_DIR_NAME, 'legacy')},
|
||||
|
||||
|
||||
'OUTPUT_DIR': {'default': lambda c: os.path.abspath(os.path.expanduser(c['OUTPUT_DIR'])) if c['OUTPUT_DIR'] else os.path.abspath(os.curdir)},
|
||||
'ARCHIVE_DIR': {'default': lambda c: os.path.join(c['OUTPUT_DIR'], ARCHIVE_DIR_NAME)},
|
||||
'SOURCES_DIR': {'default': lambda c: os.path.join(c['OUTPUT_DIR'], SOURCES_DIR_NAME)},
|
||||
|
|
@ -195,13 +235,14 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
|
|||
|
||||
'PYTHON_BINARY': {'default': lambda c: sys.executable},
|
||||
'PYTHON_ENCODING': {'default': lambda c: sys.stdout.encoding.upper()},
|
||||
'PYTHON_VERSION': {'default': lambda c: '{}.{}'.format(sys.version_info.major, sys.version_info.minor)},
|
||||
'PYTHON_VERSION': {'default': lambda c: '{}.{}.{}'.format(*sys.version_info[:3])},
|
||||
|
||||
'DJANGO_BINARY': {'default': lambda c: django.__file__.replace('__init__.py', 'bin/django-admin.py')},
|
||||
'DJANGO_VERSION': {'default': lambda c: '{}.{}.{} {} ({})'.format(*django.VERSION)},
|
||||
|
||||
'USE_CURL': {'default': lambda c: c['USE_CURL'] and (c['SAVE_FAVICON'] or c['SAVE_ARCHIVE_DOT_ORG'])},
|
||||
'USE_CURL': {'default': lambda c: c['USE_CURL'] and (c['SAVE_FAVICON'] or c['SAVE_TITLE'] or c['SAVE_ARCHIVE_DOT_ORG'])},
|
||||
'CURL_VERSION': {'default': lambda c: bin_version(c['CURL_BINARY']) if c['USE_CURL'] else None},
|
||||
'CURL_USER_AGENT': {'default': lambda c: c['CURL_USER_AGENT'].format(**c)},
|
||||
'SAVE_FAVICON': {'default': lambda c: c['USE_CURL'] and c['SAVE_FAVICON']},
|
||||
'SAVE_ARCHIVE_DOT_ORG': {'default': lambda c: c['USE_CURL'] and c['SAVE_ARCHIVE_DOT_ORG']},
|
||||
|
||||
|
|
@ -212,6 +253,9 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
|
|||
'SAVE_WGET': {'default': lambda c: c['USE_WGET'] and c['SAVE_WGET']},
|
||||
'SAVE_WARC': {'default': lambda c: c['USE_WGET'] and c['SAVE_WARC']},
|
||||
|
||||
'USE_SINGLEFILE': {'default': lambda c: c['USE_SINGLEFILE'] and c['SAVE_SINGLEFILE']},
|
||||
'SINGLEFILE_VERSION': {'default': lambda c: bin_version(c['SINGLEFILE_BINARY']) if c['USE_SINGLEFILE'] else None},
|
||||
|
||||
'USE_GIT': {'default': lambda c: c['USE_GIT'] and c['SAVE_GIT']},
|
||||
'GIT_VERSION': {'default': lambda c: bin_version(c['GIT_BINARY']) if c['USE_GIT'] else None},
|
||||
'SAVE_GIT': {'default': lambda c: c['USE_GIT'] and c['SAVE_GIT']},
|
||||
|
|
@ -219,13 +263,15 @@ DERIVED_CONFIG_DEFAULTS: ConfigDefaultDict = {
|
|||
'USE_YOUTUBEDL': {'default': lambda c: c['USE_YOUTUBEDL'] and c['SAVE_MEDIA']},
|
||||
'YOUTUBEDL_VERSION': {'default': lambda c: bin_version(c['YOUTUBEDL_BINARY']) if c['USE_YOUTUBEDL'] else None},
|
||||
'SAVE_MEDIA': {'default': lambda c: c['USE_YOUTUBEDL'] and c['SAVE_MEDIA']},
|
||||
'SAVE_PLAYLISTS': {'default': lambda c: c['SAVE_PLAYLISTS'] and c['SAVE_MEDIA']},
|
||||
|
||||
'USE_CHROME': {'default': lambda c: c['USE_CHROME'] and (c['SAVE_PDF'] or c['SAVE_SCREENSHOT'] or c['SAVE_DOM'])},
|
||||
'USE_CHROME': {'default': lambda c: c['USE_CHROME'] and (c['SAVE_PDF'] or c['SAVE_SCREENSHOT'] or c['SAVE_DOM'] or c['SAVE_SINGLEFILE'])},
|
||||
'CHROME_BINARY': {'default': lambda c: c['CHROME_BINARY'] if c['CHROME_BINARY'] else find_chrome_binary()},
|
||||
'CHROME_VERSION': {'default': lambda c: bin_version(c['CHROME_BINARY']) if c['USE_CHROME'] else None},
|
||||
'SAVE_PDF': {'default': lambda c: c['USE_CHROME'] and c['SAVE_PDF']},
|
||||
'SAVE_SCREENSHOT': {'default': lambda c: c['USE_CHROME'] and c['SAVE_SCREENSHOT']},
|
||||
'SAVE_DOM': {'default': lambda c: c['USE_CHROME'] and c['SAVE_DOM']},
|
||||
'SAVE_SINGLEFILE': {'default': lambda c: c['USE_CHROME'] and c['USE_SINGLEFILE']},
|
||||
|
||||
'DEPENDENCIES': {'default': lambda c: get_dependency_info(c)},
|
||||
'CODE_LOCATIONS': {'default': lambda c: get_code_locations(c)},
|
||||
|
|
@ -245,6 +291,8 @@ def load_config_val(key: str,
|
|||
config: Optional[ConfigDict]=None,
|
||||
env_vars: Optional[os._Environ]=None,
|
||||
config_file_vars: Optional[Dict[str, str]]=None) -> ConfigValue:
|
||||
"""parse bool, int, and str key=value pairs from env"""
|
||||
|
||||
|
||||
config_keys_to_check = (key, *(aliases or ()))
|
||||
for key in config_keys_to_check:
|
||||
|
|
@ -263,7 +311,7 @@ def load_config_val(key: str,
|
|||
return default(config)
|
||||
|
||||
return default
|
||||
|
||||
|
||||
elif type is bool:
|
||||
if val.lower() in ('true', 'yes', '1'):
|
||||
return True
|
||||
|
|
@ -284,6 +332,7 @@ def load_config_val(key: str,
|
|||
|
||||
raise Exception('Config values can only be str, bool, or int')
|
||||
|
||||
|
||||
def load_config_file(out_dir: str=None) -> Optional[Dict[str, str]]:
|
||||
"""load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf"""
|
||||
|
||||
|
|
@ -304,53 +353,67 @@ def load_config_file(out_dir: str=None) -> Optional[Dict[str, str]]:
|
|||
return config_file_vars
|
||||
return None
|
||||
|
||||
|
||||
def write_config_file(config: Dict[str, str], out_dir: str=None) -> ConfigDict:
|
||||
"""load the ini-formatted config file from OUTPUT_DIR/Archivebox.conf"""
|
||||
|
||||
from ..system import atomic_write
|
||||
|
||||
out_dir = out_dir or os.path.abspath(os.getenv('OUTPUT_DIR', '.'))
|
||||
config_path = os.path.join(out_dir, CONFIG_FILENAME)
|
||||
|
||||
if not os.path.exists(config_path):
|
||||
with open(config_path, 'w+') as f:
|
||||
f.write(CONFIG_HEADER)
|
||||
|
||||
if not config:
|
||||
return {}
|
||||
atomic_write(config_path, CONFIG_HEADER)
|
||||
|
||||
config_file = ConfigParser()
|
||||
config_file.optionxform = str
|
||||
config_file.read(config_path)
|
||||
|
||||
with open(config_path, 'r') as old:
|
||||
atomic_write(f'{config_path}.bak', old.read())
|
||||
|
||||
find_section = lambda key: [name for name, opts in CONFIG_DEFAULTS.items() if key in opts][0]
|
||||
|
||||
with open(f'{config_path}.old', 'w+') as old:
|
||||
with open(config_path, 'r') as new:
|
||||
old.write(new.read())
|
||||
# Set up sections in empty config file
|
||||
for key, val in config.items():
|
||||
section = find_section(key)
|
||||
if section in config_file:
|
||||
existing_config = dict(config_file[section])
|
||||
else:
|
||||
existing_config = {}
|
||||
config_file[section] = {**existing_config, key: val}
|
||||
|
||||
with open(config_path, 'w+') as f:
|
||||
for key, val in config.items():
|
||||
section = find_section(key)
|
||||
if section in config_file:
|
||||
existing_config = dict(config_file[section])
|
||||
else:
|
||||
existing_config = {}
|
||||
# always make sure there's a SECRET_KEY defined for Django
|
||||
existing_secret_key = None
|
||||
if 'SERVER_CONFIG' in config_file and 'SECRET_KEY' in config_file['SERVER_CONFIG']:
|
||||
existing_secret_key = config_file['SERVER_CONFIG']['SECRET_KEY']
|
||||
|
||||
config_file[section] = {**existing_config, key: val}
|
||||
|
||||
config_file.write(f)
|
||||
if (not existing_secret_key) or ('not a valid secret' in existing_secret_key):
|
||||
from django.utils.crypto import get_random_string
|
||||
chars = 'abcdefghijklmnopqrstuvwxyz0123456789-_+!.'
|
||||
random_secret_key = get_random_string(50, chars)
|
||||
if 'SERVER_CONFIG' in config_file:
|
||||
config_file['SERVER_CONFIG']['SECRET_KEY'] = random_secret_key
|
||||
else:
|
||||
config_file['SERVER_CONFIG'] = {'SECRET_KEY': random_secret_key}
|
||||
|
||||
with open(config_path, 'w+') as new:
|
||||
config_file.write(new)
|
||||
|
||||
try:
|
||||
# validate the config by attempting to re-parse it
|
||||
CONFIG = load_all_config()
|
||||
return {
|
||||
key.upper(): CONFIG.get(key.upper())
|
||||
for key in config.keys()
|
||||
}
|
||||
except:
|
||||
with open(f'{config_path}.old', 'r') as old:
|
||||
with open(config_path, 'w+') as new:
|
||||
new.write(old.read())
|
||||
# something went horribly wrong, rever to the previous version
|
||||
with open(f'{config_path}.bak', 'r') as old:
|
||||
atomic_write(config_path, old.read())
|
||||
|
||||
if os.path.exists(f'{config_path}.old'):
|
||||
os.remove(f'{config_path}.old')
|
||||
if os.path.exists(f'{config_path}.bak'):
|
||||
os.remove(f'{config_path}.bak')
|
||||
|
||||
return {}
|
||||
|
||||
|
|
@ -438,8 +501,10 @@ def bin_path(binary: Optional[str]) -> Optional[str]:
|
|||
return shutil.which(os.path.expanduser(binary)) or binary
|
||||
|
||||
def bin_hash(binary: Optional[str]) -> Optional[str]:
|
||||
if binary is None:
|
||||
return None
|
||||
abs_path = bin_path(binary)
|
||||
if abs_path is None:
|
||||
if abs_path is None or not Path(abs_path).exists():
|
||||
return None
|
||||
|
||||
file_hash = md5()
|
||||
|
|
@ -457,6 +522,7 @@ def find_chrome_binary() -> Optional[str]:
|
|||
'chromium-browser',
|
||||
'chromium',
|
||||
'/Applications/Chromium.app/Contents/MacOS/Chromium',
|
||||
'chrome',
|
||||
'google-chrome',
|
||||
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
|
||||
'google-chrome-stable',
|
||||
|
|
@ -483,6 +549,7 @@ def find_chrome_data_dir() -> Optional[str]:
|
|||
'~/.config/chromium',
|
||||
'~/Library/Application Support/Chromium',
|
||||
'~/AppData/Local/Chromium/User Data',
|
||||
'~/.config/chrome',
|
||||
'~/.config/google-chrome',
|
||||
'~/Library/Application Support/Google/Chrome',
|
||||
'~/AppData/Local/Google/Chrome/User Data',
|
||||
|
|
@ -615,6 +682,13 @@ def get_dependency_info(config: ConfigDict) -> ConfigValue:
|
|||
'enabled': config['USE_WGET'],
|
||||
'is_valid': bool(config['WGET_VERSION']),
|
||||
},
|
||||
'SINGLEFILE_BINARY': {
|
||||
'path': bin_path(config['SINGLEFILE_BINARY']),
|
||||
'version': config['SINGLEFILE_VERSION'],
|
||||
'hash': bin_hash(config['SINGLEFILE_BINARY']),
|
||||
'enabled': config['USE_SINGLEFILE'],
|
||||
'is_valid': bool(config['SINGLEFILE_VERSION']),
|
||||
},
|
||||
'GIT_BINARY': {
|
||||
'path': bin_path(config['GIT_BINARY']),
|
||||
'version': config['GIT_VERSION'],
|
||||
|
|
@ -664,6 +738,9 @@ def load_all_config():
|
|||
CONFIG = load_all_config()
|
||||
globals().update(CONFIG)
|
||||
|
||||
# Timezone set as UTC
|
||||
os.environ["TZ"] = 'UTC'
|
||||
|
||||
|
||||
############################## Importable Checkers #############################
|
||||
|
||||
|
|
@ -676,7 +753,7 @@ def check_system_config(config: ConfigDict=CONFIG) -> None:
|
|||
raise SystemExit(2)
|
||||
|
||||
### Check Python environment
|
||||
if float(config['PYTHON_VERSION']) < 3.6:
|
||||
if sys.version_info[:3] < (3, 6, 0):
|
||||
stderr(f'[X] Python version is not new enough: {config["PYTHON_VERSION"]} (>3.6 is required)', color='red')
|
||||
stderr(' See https://github.com/pirate/ArchiveBox/wiki/Troubleshooting#python for help upgrading your Python installation.')
|
||||
raise SystemExit(2)
|
||||
|
|
@ -705,9 +782,16 @@ def check_system_config(config: ConfigDict=CONFIG) -> None:
|
|||
stderr(' CHROME_USER_DATA_DIR="{}"'.format(config['CHROME_USER_DATA_DIR'].split('/Default')[0]))
|
||||
raise SystemExit(2)
|
||||
|
||||
def dependency_additional_info(dependency: str) -> str:
|
||||
if dependency == "SINGLEFILE_BINARY":
|
||||
return "Please follow the installation instructions at https://github.com/gildas-lormeau/SingleFile/tree/master/cli and set SINGLEFILE_BINARY or set USE_SINGLEFILE=false"
|
||||
return ""
|
||||
|
||||
|
||||
def check_dependencies(config: ConfigDict=CONFIG, show_help: bool=True) -> None:
|
||||
invalid = [
|
||||
'{}: {} ({})'.format(name, info['path'] or 'unable to find binary', info['version'] or 'unable to detect version')
|
||||
'{}: {} ({}). {}'.format(name, info['path'] or 'unable to find binary', info['version'] or 'unable to detect version',
|
||||
dependency_additional_info(name))
|
||||
for name, info in config['DEPENDENCIES'].items()
|
||||
if info['enabled'] and not info['is_valid']
|
||||
]
|
||||
|
|
@ -726,7 +810,7 @@ def check_dependencies(config: ConfigDict=CONFIG, show_help: bool=True) -> None:
|
|||
stderr()
|
||||
stderr(f'[!] Warning: TIMEOUT is set too low! (currently set to TIMEOUT={config["TIMEOUT"]} seconds)', color='red')
|
||||
stderr(' You must allow *at least* 5 seconds for indexing and archive methods to run succesfully.')
|
||||
stderr(' (Setting it to somewhere between 30 and 300 seconds is recommended)')
|
||||
stderr(' (Setting it to somewhere between 30 and 3000 seconds is recommended)')
|
||||
stderr()
|
||||
stderr(' If you want to make ArchiveBox run faster, disable specific archive methods instead:')
|
||||
stderr(' https://github.com/pirate/ArchiveBox/wiki/Configuration#archive-method-toggles')
|
||||
|
|
@ -756,14 +840,14 @@ def check_data_folder(out_dir: Optional[str]=None, config: ConfigDict=CONFIG) ->
|
|||
|
||||
json_index_exists = os.path.exists(os.path.join(output_dir, JSON_INDEX_FILENAME))
|
||||
if not json_index_exists:
|
||||
stderr('[X] No archive main index was found in current directory.', color='red')
|
||||
stderr(f' {output_dir}')
|
||||
stderr('[X] No archivebox index found in the current directory.', color='red')
|
||||
stderr(f' {output_dir}', color='lightyellow')
|
||||
stderr()
|
||||
stderr(' Are you running archivebox in the right folder?')
|
||||
stderr(' {lightred}Hint{reset}: Are you running archivebox in the right folder?'.format(**config['ANSI']))
|
||||
stderr(' cd path/to/your/archive/folder')
|
||||
stderr(' archivebox [command]')
|
||||
stderr()
|
||||
stderr(' To create a new archive collection or import existing data in this folder, run:')
|
||||
stderr(' {lightred}Hint{reset}: To create a new archive collection or import existing data in this folder, run:'.format(**config['ANSI']))
|
||||
stderr(' archivebox init')
|
||||
raise SystemExit(2)
|
||||
|
||||
|
|
@ -785,9 +869,15 @@ def check_data_folder(out_dir: Optional[str]=None, config: ConfigDict=CONFIG) ->
|
|||
stderr(' archivebox init')
|
||||
raise SystemExit(3)
|
||||
|
||||
sources_dir = os.path.join(output_dir, SOURCES_DIR_NAME)
|
||||
if not os.path.exists(sources_dir):
|
||||
os.makedirs(sources_dir)
|
||||
|
||||
|
||||
|
||||
def setup_django(out_dir: str=None, check_db=False, config: ConfigDict=CONFIG) -> None:
|
||||
check_system_config()
|
||||
|
||||
output_dir = out_dir or config['OUTPUT_DIR']
|
||||
|
||||
assert isinstance(output_dir, str) and isinstance(config['PYTHON_DIR'], str)
|
||||
|
|
@ -806,4 +896,4 @@ def setup_django(out_dir: str=None, check_db=False, config: ConfigDict=CONFIG) -
|
|||
except KeyboardInterrupt:
|
||||
raise SystemExit(2)
|
||||
|
||||
check_system_config()
|
||||
os.umask(0o777 - int(OUTPUT_PERMISSIONS, base=8)) # noqa: F821
|
||||
|
|
|
|||
|
|
@ -12,9 +12,24 @@ class BaseConfig(TypedDict):
|
|||
pass
|
||||
|
||||
class ConfigDict(BaseConfig, total=False):
|
||||
"""
|
||||
# Regenerate by pasting this quine into `archivebox shell` 🥚
|
||||
from archivebox.config import ConfigDict, CONFIG_DEFAULTS
|
||||
print('class ConfigDict(BaseConfig, total=False):')
|
||||
print(' ' + '"'*3 + ConfigDict.__doc__ + '"'*3)
|
||||
for section, configs in CONFIG_DEFAULTS.items():
|
||||
for key, attrs in configs.items():
|
||||
Type, default = attrs['type'], attrs['default']
|
||||
if default is None:
|
||||
print(f' {key}: Optional[{Type.__name__}]')
|
||||
else:
|
||||
print(f' {key}: {Type.__name__}')
|
||||
print()
|
||||
"""
|
||||
IS_TTY: bool
|
||||
USE_COLOR: bool
|
||||
SHOW_PROGRESS: bool
|
||||
IN_DOCKER: bool
|
||||
|
||||
OUTPUT_DIR: str
|
||||
CONFIG_FILE: str
|
||||
|
|
@ -22,9 +37,16 @@ class ConfigDict(BaseConfig, total=False):
|
|||
TIMEOUT: int
|
||||
MEDIA_TIMEOUT: int
|
||||
OUTPUT_PERMISSIONS: str
|
||||
FOOTER_INFO: str
|
||||
URL_BLACKLIST: Optional[str]
|
||||
|
||||
SECRET_KEY: str
|
||||
ALLOWED_HOSTS: str
|
||||
DEBUG: bool
|
||||
PUBLIC_INDEX: bool
|
||||
PUBLIC_SNAPSHOTS: bool
|
||||
FOOTER_INFO: str
|
||||
ACTIVE_THEME: str
|
||||
|
||||
SAVE_TITLE: bool
|
||||
SAVE_FAVICON: bool
|
||||
SAVE_WGET: bool
|
||||
|
|
@ -32,14 +54,17 @@ class ConfigDict(BaseConfig, total=False):
|
|||
SAVE_PDF: bool
|
||||
SAVE_SCREENSHOT: bool
|
||||
SAVE_DOM: bool
|
||||
SAVE_SINGLEFILE: bool
|
||||
SAVE_WARC: bool
|
||||
SAVE_GIT: bool
|
||||
SAVE_MEDIA: bool
|
||||
SAVE_PLAYLISTS: bool
|
||||
SAVE_ARCHIVE_DOT_ORG: bool
|
||||
|
||||
RESOLUTION: str
|
||||
GIT_DOMAINS: str
|
||||
CHECK_SSL_VALIDITY: bool
|
||||
CURL_USER_AGENT: str
|
||||
WGET_USER_AGENT: str
|
||||
CHROME_USER_AGENT: str
|
||||
COOKIES_FILE: Optional[str]
|
||||
|
|
@ -52,12 +77,14 @@ class ConfigDict(BaseConfig, total=False):
|
|||
USE_GIT: bool
|
||||
USE_CHROME: bool
|
||||
USE_YOUTUBEDL: bool
|
||||
USE_SINGLEFILE: bool
|
||||
|
||||
CURL_BINARY: Optional[str]
|
||||
GIT_BINARY: Optional[str]
|
||||
WGET_BINARY: Optional[str]
|
||||
YOUTUBEDL_BINARY: Optional[str]
|
||||
CHROME_BINARY: Optional[str]
|
||||
SINGLEFILE_BINARY: Optional[str]
|
||||
|
||||
TERM_WIDTH: Callable[[], int]
|
||||
USER: str
|
||||
|
|
|
|||
|
|
@ -1,17 +1,202 @@
|
|||
__package__ = 'archivebox.core'
|
||||
|
||||
from io import StringIO
|
||||
from contextlib import redirect_stdout
|
||||
from pathlib import Path
|
||||
|
||||
from django.contrib import admin
|
||||
from django.urls import path
|
||||
from django.utils.html import format_html
|
||||
from django.utils.safestring import mark_safe
|
||||
from django.shortcuts import render, redirect
|
||||
from django.contrib.auth import get_user_model
|
||||
|
||||
from core.models import Snapshot
|
||||
from core.forms import AddLinkForm
|
||||
|
||||
from util import htmldecode, urldecode, ansi_to_html
|
||||
from logging_util import printable_filesize
|
||||
from main import add, remove
|
||||
from config import OUTPUT_DIR
|
||||
from extractors import archive_links
|
||||
|
||||
# TODO: https://stackoverflow.com/questions/40760880/add-custom-button-to-django-admin-panel
|
||||
|
||||
def update_snapshots(modeladmin, request, queryset):
|
||||
archive_links([
|
||||
snapshot.as_link()
|
||||
for snapshot in queryset
|
||||
], out_dir=OUTPUT_DIR)
|
||||
update_snapshots.short_description = "Archive"
|
||||
|
||||
def update_titles(modeladmin, request, queryset):
|
||||
archive_links([
|
||||
snapshot.as_link()
|
||||
for snapshot in queryset
|
||||
], overwrite=True, methods=('title',), out_dir=OUTPUT_DIR)
|
||||
update_titles.short_description = "Pull title"
|
||||
|
||||
def overwrite_snapshots(modeladmin, request, queryset):
|
||||
archive_links([
|
||||
snapshot.as_link()
|
||||
for snapshot in queryset
|
||||
], overwrite=True, out_dir=OUTPUT_DIR)
|
||||
overwrite_snapshots.short_description = "Re-archive (overwrite)"
|
||||
|
||||
def verify_snapshots(modeladmin, request, queryset):
|
||||
for snapshot in queryset:
|
||||
print(snapshot.timestamp, snapshot.url, snapshot.is_archived, snapshot.archive_size, len(snapshot.history))
|
||||
|
||||
verify_snapshots.short_description = "Check"
|
||||
|
||||
def delete_snapshots(modeladmin, request, queryset):
|
||||
remove(links=[snapshot.as_link() for snapshot in queryset], yes=True, delete=True, out_dir=OUTPUT_DIR)
|
||||
|
||||
delete_snapshots.short_description = "Delete"
|
||||
|
||||
|
||||
class SnapshotAdmin(admin.ModelAdmin):
|
||||
list_display = ('timestamp', 'short_url', 'title', 'is_archived', 'num_outputs', 'added', 'updated', 'url_hash')
|
||||
readonly_fields = ('num_outputs', 'is_archived', 'added', 'updated', 'bookmarked')
|
||||
fields = ('url', 'timestamp', 'title', 'tags', *readonly_fields)
|
||||
list_display = ('added', 'title_str', 'url_str', 'files', 'size')
|
||||
sort_fields = ('title_str', 'url_str', 'added')
|
||||
readonly_fields = ('id', 'url', 'timestamp', 'num_outputs', 'is_archived', 'url_hash', 'added', 'updated')
|
||||
search_fields = ('url', 'timestamp', 'title', 'tags')
|
||||
fields = ('title', 'tags', *readonly_fields)
|
||||
list_filter = ('added', 'updated', 'tags')
|
||||
ordering = ['-added']
|
||||
actions = [delete_snapshots, overwrite_snapshots, update_snapshots, update_titles, verify_snapshots]
|
||||
actions_template = 'admin/actions_as_select.html'
|
||||
|
||||
def short_url(self, obj):
|
||||
return obj.url[:64]
|
||||
def id_str(self, obj):
|
||||
return format_html(
|
||||
'<code style="font-size: 10px">{}</code>',
|
||||
obj.url_hash[:8],
|
||||
)
|
||||
|
||||
def updated(self, obj):
|
||||
return obj.isoformat()
|
||||
def title_str(self, obj):
|
||||
canon = obj.as_link().canonical_outputs()
|
||||
tags = ''.join(
|
||||
format_html('<span>{}</span>', tag.strip())
|
||||
for tag in obj.tags.split(',')
|
||||
) if obj.tags else ''
|
||||
return format_html(
|
||||
'<a href="/{}">'
|
||||
'<img src="/{}/{}" class="favicon" onerror="this.remove()">'
|
||||
'</a>'
|
||||
'<a href="/{}/{}">'
|
||||
'<b class="status-{}">{}</b>'
|
||||
'</a>',
|
||||
obj.archive_path,
|
||||
obj.archive_path, canon['favicon_path'],
|
||||
obj.archive_path, canon['wget_path'] or '',
|
||||
'fetched' if obj.latest_title or obj.title else 'pending',
|
||||
urldecode(htmldecode(obj.latest_title or obj.title or ''))[:128] or 'Pending...'
|
||||
) + mark_safe(f'<span class="tags">{tags}</span>')
|
||||
|
||||
def files(self, obj):
|
||||
link = obj.as_link()
|
||||
canon = link.canonical_outputs()
|
||||
out_dir = Path(link.link_dir)
|
||||
|
||||
link_tuple = lambda link, method: (link.archive_path, canon[method] or '', canon[method] and (out_dir / (canon[method] or 'notdone')).exists())
|
||||
|
||||
return format_html(
|
||||
'<span class="files-icons" style="font-size: 1.2em; opacity: 0.8">'
|
||||
'<a href="/{}/{}/" class="exists-{}" title="Wget clone">🌐 </a> '
|
||||
'<a href="/{}/{}" class="exists-{}" title="PDF">📄</a> '
|
||||
'<a href="/{}/{}" class="exists-{}" title="Screenshot">🖥 </a> '
|
||||
'<a href="/{}/{}" class="exists-{}" title="HTML dump">🅷 </a> '
|
||||
'<a href="/{}/{}/" class="exists-{}" title="WARC">🆆 </a> '
|
||||
'<a href="/{}/{}" class="exists-{}" title="SingleFile">🗜 </a>'
|
||||
'<a href="/{}/{}/" class="exists-{}" title="Media files">📼 </a> '
|
||||
'<a href="/{}/{}/" class="exists-{}" title="Git repos">📦 </a> '
|
||||
'<a href="{}" class="exists-{}" title="Archive.org snapshot">🏛 </a> '
|
||||
'</span>',
|
||||
*link_tuple(link, 'wget_path'),
|
||||
*link_tuple(link, 'pdf_path'),
|
||||
*link_tuple(link, 'screenshot_path'),
|
||||
*link_tuple(link, 'dom_path'),
|
||||
*link_tuple(link, 'warc_path')[:2], any((out_dir / canon['warc_path']).glob('*.warc.gz')),
|
||||
*link_tuple(link, 'singlefile_path'),
|
||||
*link_tuple(link, 'media_path')[:2], any((out_dir / canon['media_path']).glob('*')),
|
||||
*link_tuple(link, 'git_path')[:2], any((out_dir / canon['git_path']).glob('*')),
|
||||
canon['archive_org_path'], (out_dir / 'archive.org.txt').exists(),
|
||||
)
|
||||
|
||||
def size(self, obj):
|
||||
return format_html(
|
||||
'<a href="/{}" title="View all files">{}</a>',
|
||||
obj.archive_path,
|
||||
printable_filesize(obj.archive_size) if obj.archive_size else 'pending',
|
||||
)
|
||||
|
||||
def url_str(self, obj):
|
||||
return format_html(
|
||||
'<a href="{}">{}</a>',
|
||||
obj.url,
|
||||
obj.url.split('://www.', 1)[-1].split('://', 1)[-1][:64],
|
||||
)
|
||||
|
||||
id_str.short_description = 'ID'
|
||||
title_str.short_description = 'Title'
|
||||
url_str.short_description = 'Original URL'
|
||||
|
||||
id_str.admin_order_field = 'id'
|
||||
title_str.admin_order_field = 'title'
|
||||
url_str.admin_order_field = 'url'
|
||||
|
||||
|
||||
|
||||
class ArchiveBoxAdmin(admin.AdminSite):
|
||||
site_header = 'ArchiveBox'
|
||||
index_title = 'Links'
|
||||
site_title = 'Index'
|
||||
|
||||
def get_urls(self):
|
||||
return [
|
||||
path('core/snapshot/add/', self.add_view, name='Add'),
|
||||
] + super().get_urls()
|
||||
|
||||
def add_view(self, request):
|
||||
if not request.user.is_authenticated:
|
||||
return redirect(f'/admin/login/?next={request.path}')
|
||||
|
||||
request.current_app = self.name
|
||||
context = {
|
||||
**self.each_context(request),
|
||||
'title': 'Add URLs',
|
||||
}
|
||||
|
||||
if request.method == 'GET':
|
||||
context['form'] = AddLinkForm()
|
||||
|
||||
elif request.method == 'POST':
|
||||
form = AddLinkForm(request.POST)
|
||||
if form.is_valid():
|
||||
url = form.cleaned_data["url"]
|
||||
print(f'[+] Adding URL: {url}')
|
||||
depth = 0 if form.cleaned_data["depth"] == "0" else 1
|
||||
input_kwargs = {
|
||||
"urls": url,
|
||||
"depth": depth,
|
||||
"update_all": False,
|
||||
"out_dir": OUTPUT_DIR,
|
||||
}
|
||||
add_stdout = StringIO()
|
||||
with redirect_stdout(add_stdout):
|
||||
add(**input_kwargs)
|
||||
print(add_stdout.getvalue())
|
||||
|
||||
context.update({
|
||||
"stdout": ansi_to_html(add_stdout.getvalue().strip()),
|
||||
"form": AddLinkForm()
|
||||
})
|
||||
else:
|
||||
context["form"] = form
|
||||
|
||||
return render(template_name='add_links.html', request=request, context=context)
|
||||
|
||||
|
||||
admin.site = ArchiveBoxAdmin()
|
||||
admin.site.register(get_user_model())
|
||||
admin.site.register(Snapshot, SnapshotAdmin)
|
||||
admin.site.disable_action('delete_selected')
|
||||
|
|
|
|||
14
archivebox/core/forms.py
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
__package__ = 'archivebox.core'
|
||||
|
||||
from django import forms
|
||||
|
||||
from ..util import URL_REGEX
|
||||
|
||||
CHOICES = (
|
||||
('0', 'depth = 0 (archive just these URLs)'),
|
||||
('1', 'depth = 1 (archive these URLs and all URLs one hop away)'),
|
||||
)
|
||||
|
||||
class AddLinkForm(forms.Form):
|
||||
url = forms.RegexField(label="URLs (one per line)", regex=URL_REGEX, min_length='6', strip=True, widget=forms.Textarea, required=True)
|
||||
depth = forms.ChoiceField(label="Archive depth", choices=CHOICES, widget=forms.RadioSelect, initial='0')
|
||||
18
archivebox/core/migrations/0002_auto_20200625_1521.py
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
# Generated by Django 3.0.7 on 2020-06-25 15:21
|
||||
|
||||
from django.db import migrations, models
|
||||
|
||||
|
||||
class Migration(migrations.Migration):
|
||||
|
||||
dependencies = [
|
||||
('core', '0001_initial'),
|
||||
]
|
||||
|
||||
operations = [
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='timestamp',
|
||||
field=models.CharField(default=None, max_length=32, null=True),
|
||||
),
|
||||
]
|
||||
38
archivebox/core/migrations/0003_auto_20200630_1034.py
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# Generated by Django 3.0.7 on 2020-06-30 10:34
|
||||
|
||||
from django.db import migrations, models
|
||||
|
||||
|
||||
class Migration(migrations.Migration):
|
||||
|
||||
dependencies = [
|
||||
('core', '0002_auto_20200625_1521'),
|
||||
]
|
||||
|
||||
operations = [
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='added',
|
||||
field=models.DateTimeField(auto_now_add=True, db_index=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='tags',
|
||||
field=models.CharField(db_index=True, default=None, max_length=256, null=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='timestamp',
|
||||
field=models.CharField(db_index=True, default=None, max_length=32, null=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='title',
|
||||
field=models.CharField(db_index=True, default=None, max_length=128, null=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='updated',
|
||||
field=models.DateTimeField(db_index=True, default=None, null=True),
|
||||
),
|
||||
]
|
||||
19
archivebox/core/migrations/0004_auto_20200713_1552.py
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Generated by Django 3.0.7 on 2020-07-13 15:52
|
||||
|
||||
from django.db import migrations, models
|
||||
|
||||
|
||||
class Migration(migrations.Migration):
|
||||
|
||||
dependencies = [
|
||||
('core', '0003_auto_20200630_1034'),
|
||||
]
|
||||
|
||||
operations = [
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='timestamp',
|
||||
field=models.CharField(db_index=True, default=None, max_length=32, unique=True),
|
||||
preserve_default=False,
|
||||
),
|
||||
]
|
||||
28
archivebox/core/migrations/0005_auto_20200728_0326.py
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
# Generated by Django 3.0.7 on 2020-07-28 03:26
|
||||
|
||||
from django.db import migrations, models
|
||||
|
||||
|
||||
class Migration(migrations.Migration):
|
||||
|
||||
dependencies = [
|
||||
('core', '0004_auto_20200713_1552'),
|
||||
]
|
||||
|
||||
operations = [
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='tags',
|
||||
field=models.CharField(blank=True, db_index=True, max_length=256, null=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='title',
|
||||
field=models.CharField(blank=True, db_index=True, max_length=128, null=True),
|
||||
),
|
||||
migrations.AlterField(
|
||||
model_name='snapshot',
|
||||
name='updated',
|
||||
field=models.DateTimeField(blank=True, db_index=True, null=True),
|
||||
),
|
||||
]
|
||||
|
|
@ -3,6 +3,7 @@ __package__ = 'archivebox.core'
|
|||
import uuid
|
||||
|
||||
from django.db import models
|
||||
from django.utils.functional import cached_property
|
||||
|
||||
from ..util import parse_date
|
||||
from ..index.schema import Link
|
||||
|
|
@ -12,22 +13,24 @@ class Snapshot(models.Model):
|
|||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
|
||||
url = models.URLField(unique=True)
|
||||
timestamp = models.CharField(unique=True, max_length=32, null=True, default=None)
|
||||
timestamp = models.CharField(max_length=32, unique=True, db_index=True)
|
||||
|
||||
title = models.CharField(max_length=128, null=True, default=None)
|
||||
tags = models.CharField(max_length=256, null=True, default=None)
|
||||
title = models.CharField(max_length=128, null=True, blank=True, db_index=True)
|
||||
tags = models.CharField(max_length=256, null=True, blank=True, db_index=True)
|
||||
|
||||
created = models.DateTimeField(auto_now_add=True)
|
||||
updated = models.DateTimeField(null=True, default=None)
|
||||
added = models.DateTimeField(auto_now_add=True, db_index=True)
|
||||
updated = models.DateTimeField(null=True, blank=True, db_index=True)
|
||||
# bookmarked = models.DateTimeField()
|
||||
|
||||
keys = ('url', 'timestamp', 'title', 'tags', 'updated')
|
||||
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f'[{self.timestamp}] {self.url[:64]} ({self.title[:64]})'
|
||||
title = self.title or '-'
|
||||
return f'[{self.timestamp}] {self.url[:64]} ({title[:64]})'
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f'[{self.timestamp}] {self.url[:64]} ({self.title[:64]})'
|
||||
title = self.title or '-'
|
||||
return f'[{self.timestamp}] {self.url[:64]} ({title[:64]})'
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, info: dict):
|
||||
|
|
@ -44,30 +47,52 @@ class Snapshot(models.Model):
|
|||
def as_link(self) -> Link:
|
||||
return Link.from_json(self.as_json())
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def bookmarked(self):
|
||||
return parse_date(self.timestamp)
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def is_archived(self):
|
||||
return self.as_link().is_archived
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def num_outputs(self):
|
||||
return self.as_link().num_outputs
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def url_hash(self):
|
||||
return self.as_link().url_hash
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def base_url(self):
|
||||
return self.as_link().base_url
|
||||
|
||||
@property
|
||||
@cached_property
|
||||
def link_dir(self):
|
||||
return self.as_link().link_dir
|
||||
|
||||
@cached_property
|
||||
def archive_path(self):
|
||||
return self.as_link().archive_path
|
||||
|
||||
@cached_property
|
||||
def archive_size(self):
|
||||
return self.as_link().archive_size
|
||||
|
||||
@cached_property
|
||||
def history(self):
|
||||
from ..index import load_link_details
|
||||
return load_link_details(self.as_link()).history
|
||||
|
||||
@cached_property
|
||||
def latest_title(self):
|
||||
if ('title' in self.history
|
||||
and self.history['title']
|
||||
and (self.history['title'][-1].status == 'succeeded')
|
||||
and self.history['title'][-1].output.strip()):
|
||||
return self.history['title'][-1].output.strip()
|
||||
return None
|
||||
|
||||
|
||||
class SnapshotResult(models.Model):
|
||||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
|
||||
|
|
|
|||
|
|
@ -2,10 +2,7 @@ __package__ = 'archivebox.core'
|
|||
|
||||
import os
|
||||
import sys
|
||||
|
||||
SECRET_KEY = '---------------- not a valid secret key ! ----------------'
|
||||
DEBUG = os.getenv('DEBUG', 'False').lower() == 'true'
|
||||
ALLOWED_HOSTS = ['*']
|
||||
from django.utils.crypto import get_random_string
|
||||
|
||||
IS_PUBLIC = True # whether archive data requires logging in to view
|
||||
|
||||
|
|
@ -14,20 +11,29 @@ OUTPUT_DIR = os.path.abspath(os.getenv('OUTPUT_DIR', os.curdir))
|
|||
ARCHIVE_DIR = os.path.join(OUTPUT_DIR, 'archive')
|
||||
DATABASE_FILE = os.path.join(OUTPUT_DIR, 'index.sqlite3')
|
||||
|
||||
ACTIVE_THEME = 'default'
|
||||
|
||||
from ..config import ( # noqa: F401
|
||||
DEBUG,
|
||||
SECRET_KEY,
|
||||
ALLOWED_HOSTS,
|
||||
PYTHON_DIR,
|
||||
ACTIVE_THEME,
|
||||
SQL_INDEX_FILENAME,
|
||||
OUTPUT_DIR,
|
||||
}
|
||||
|
||||
ALLOWED_HOSTS = ALLOWED_HOSTS.split(',')
|
||||
IS_SHELL = 'shell' in sys.argv[:3] or 'shell_plus' in sys.argv[:3]
|
||||
|
||||
APPEND_SLASH = True
|
||||
SECRET_KEY = SECRET_KEY or get_random_string(50, 'abcdefghijklmnopqrstuvwxyz0123456789-_+!.')
|
||||
|
||||
INSTALLED_APPS = [
|
||||
'django.contrib.auth',
|
||||
'django.contrib.contenttypes',
|
||||
'django.contrib.sessions',
|
||||
# 'django.contrib.sites',
|
||||
'django.contrib.messages',
|
||||
'django.contrib.admin',
|
||||
'django.contrib.staticfiles',
|
||||
'django.contrib.admin',
|
||||
|
||||
'core',
|
||||
|
||||
|
|
@ -42,17 +48,17 @@ MIDDLEWARE = [
|
|||
'django.middleware.csrf.CsrfViewMiddleware',
|
||||
'django.contrib.auth.middleware.AuthenticationMiddleware',
|
||||
'django.contrib.messages.middleware.MessageMiddleware',
|
||||
# 'django.middleware.clickjacking.XFrameOptionsMiddleware',
|
||||
]
|
||||
|
||||
ROOT_URLCONF = 'core.urls'
|
||||
APPEND_SLASH = True
|
||||
TEMPLATES = [
|
||||
{
|
||||
'BACKEND': 'django.template.backends.django.DjangoTemplates',
|
||||
'DIRS': [
|
||||
os.path.join(REPO_DIR, 'themes', ACTIVE_THEME),
|
||||
os.path.join(REPO_DIR, 'themes', 'default'),
|
||||
os.path.join(REPO_DIR, 'themes'),
|
||||
os.path.join(PYTHON_DIR, 'themes', ACTIVE_THEME),
|
||||
os.path.join(PYTHON_DIR, 'themes', 'default'),
|
||||
os.path.join(PYTHON_DIR, 'themes'),
|
||||
],
|
||||
'APP_DIRS': True,
|
||||
'OPTIONS': {
|
||||
|
|
@ -71,7 +77,7 @@ WSGI_APPLICATION = 'core.wsgi.application'
|
|||
DATABASES = {
|
||||
'default': {
|
||||
'ENGINE': 'django.db.backends.sqlite3',
|
||||
'NAME': DATABASE_FILE,
|
||||
'NAME': os.path.join(OUTPUT_DIR, SQL_INDEX_FILENAME),
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -106,25 +112,23 @@ SHELL_PLUS_PRINT_SQL = False
|
|||
IPYTHON_ARGUMENTS = ['--no-confirm-exit', '--no-banner']
|
||||
IPYTHON_KERNEL_DISPLAY_NAME = 'ArchiveBox Django Shell'
|
||||
if IS_SHELL:
|
||||
os.environ['PYTHONSTARTUP'] = os.path.join(REPO_DIR, 'core', 'welcome_message.py')
|
||||
os.environ['PYTHONSTARTUP'] = os.path.join(PYTHON_DIR, 'core', 'welcome_message.py')
|
||||
|
||||
|
||||
LANGUAGE_CODE = 'en-us'
|
||||
TIME_ZONE = 'UTC'
|
||||
USE_I18N = True
|
||||
USE_L10N = True
|
||||
USE_I18N = False
|
||||
USE_L10N = False
|
||||
USE_TZ = False
|
||||
|
||||
DATETIME_FORMAT = 'Y-m-d g:iA'
|
||||
SHORT_DATETIME_FORMAT = 'Y-m-d h:iA'
|
||||
|
||||
|
||||
EMAIL_BACKEND = 'django.core.mail.backends.console.EmailBackend'
|
||||
|
||||
STATIC_URL = '/static/'
|
||||
STATICFILES_DIRS = [
|
||||
os.path.join(REPO_DIR, 'themes', ACTIVE_THEME, 'static'),
|
||||
os.path.join(REPO_DIR, 'themes', 'default', 'static'),
|
||||
os.path.join(REPO_DIR, 'themes', 'static'),
|
||||
os.path.join(PYTHON_DIR, 'themes', ACTIVE_THEME, 'static'),
|
||||
os.path.join(PYTHON_DIR, 'themes', 'default', 'static'),
|
||||
]
|
||||
|
||||
SERVE_STATIC = True
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,3 +1,3 @@
|
|||
from django.test import TestCase
|
||||
#from django.test import TestCase
|
||||
|
||||
# Create your tests here.
|
||||
|
|
|
|||
|
|
@ -3,28 +3,32 @@ from django.contrib import admin
|
|||
from django.urls import path, include
|
||||
from django.views import static
|
||||
from django.conf import settings
|
||||
from django.contrib.staticfiles import views
|
||||
from django.views.generic.base import RedirectView
|
||||
|
||||
from core.views import MainIndex, AddLinks, LinkDetails
|
||||
from core.views import MainIndex, OldIndex, LinkDetails
|
||||
|
||||
admin.site.site_header = 'ArchiveBox Admin'
|
||||
admin.site.index_title = 'Archive Administration'
|
||||
|
||||
# print('DEBUG', settings.DEBUG)
|
||||
|
||||
urlpatterns = [
|
||||
path('index.html', RedirectView.as_view(url='/')),
|
||||
path('index.json', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'index.json'}),
|
||||
path('robots.txt', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'robots.txt'}),
|
||||
path('favicon.ico', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'favicon.ico'}),
|
||||
|
||||
path('docs/', RedirectView.as_view(url='https://github.com/pirate/ArchiveBox/wiki'), name='Docs'),
|
||||
|
||||
path('archive/', RedirectView.as_view(url='/')),
|
||||
path('archive/<path:path>', LinkDetails.as_view(), name='LinkAssets'),
|
||||
path('add/', AddLinks.as_view(), name='AddLinks'),
|
||||
path('add/', RedirectView.as_view(url='/admin/core/snapshot/add/')),
|
||||
|
||||
path('static/<path>', views.serve),
|
||||
path('accounts/login/', RedirectView.as_view(url='/admin/login/')),
|
||||
path('accounts/logout/', RedirectView.as_view(url='/admin/logout/')),
|
||||
|
||||
|
||||
path('accounts/', include('django.contrib.auth.urls')),
|
||||
path('admin/', admin.site.urls),
|
||||
|
||||
path('old.html', OldIndex.as_view(), name='OldHome'),
|
||||
path('index.html', RedirectView.as_view(url='/')),
|
||||
path('index.json', static.serve, {'document_root': settings.OUTPUT_DIR, 'path': 'index.json'}),
|
||||
path('', MainIndex.as_view(), name='Home'),
|
||||
]
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -8,7 +8,13 @@ from django.views import View, static
|
|||
from core.models import Snapshot
|
||||
|
||||
from ..index import load_main_index, load_main_index_meta
|
||||
from ..config import OUTPUT_DIR, VERSION, FOOTER_INFO
|
||||
from ..config import (
|
||||
OUTPUT_DIR,
|
||||
VERSION,
|
||||
FOOTER_INFO,
|
||||
PUBLIC_INDEX,
|
||||
PUBLIC_SNAPSHOTS,
|
||||
)
|
||||
from ..util import base_url
|
||||
|
||||
|
||||
|
|
@ -16,36 +22,35 @@ class MainIndex(View):
|
|||
template = 'main_index.html'
|
||||
|
||||
def get(self, request):
|
||||
all_links = load_main_index(out_dir=OUTPUT_DIR)
|
||||
meta_info = load_main_index_meta(out_dir=OUTPUT_DIR)
|
||||
if request.user.is_authenticated:
|
||||
return redirect('/admin/core/snapshot/')
|
||||
|
||||
context = {
|
||||
'updated': meta_info['updated'],
|
||||
'num_links': meta_info['num_links'],
|
||||
'links': all_links,
|
||||
'VERSION': VERSION,
|
||||
'FOOTER_INFO': FOOTER_INFO,
|
||||
}
|
||||
if PUBLIC_INDEX:
|
||||
return redirect('OldHome')
|
||||
|
||||
return redirect(f'/admin/login/?next={request.path}')
|
||||
|
||||
return render(template_name=self.template, request=request, context=context)
|
||||
|
||||
|
||||
|
||||
class AddLinks(View):
|
||||
template = 'add_links.html'
|
||||
class OldIndex(View):
|
||||
template = 'main_index.html'
|
||||
|
||||
def get(self, request):
|
||||
context = {}
|
||||
if PUBLIC_INDEX or request.user.is_authenticated:
|
||||
all_links = load_main_index(out_dir=OUTPUT_DIR)
|
||||
meta_info = load_main_index_meta(out_dir=OUTPUT_DIR)
|
||||
|
||||
return render(template_name=self.template, request=request, context=context)
|
||||
context = {
|
||||
'updated': meta_info['updated'],
|
||||
'num_links': meta_info['num_links'],
|
||||
'links': all_links,
|
||||
'VERSION': VERSION,
|
||||
'FOOTER_INFO': FOOTER_INFO,
|
||||
}
|
||||
|
||||
return render(template_name=self.template, request=request, context=context)
|
||||
|
||||
def post(self, request):
|
||||
import_path = request.POST['url']
|
||||
|
||||
# TODO: add the links to the index here using archivebox.main.add
|
||||
print(f'Adding URL: {import_path}')
|
||||
|
||||
return render(template_name=self.template, request=request, context={})
|
||||
return redirect(f'/admin/login/?next={request.path}')
|
||||
|
||||
|
||||
class LinkDetails(View):
|
||||
|
|
@ -54,6 +59,9 @@ class LinkDetails(View):
|
|||
if '/' not in path:
|
||||
return redirect(f'{path}/index.html')
|
||||
|
||||
if not request.user.is_authenticated and not PUBLIC_SNAPSHOTS:
|
||||
return redirect(f'/admin/login/?next={request.path}')
|
||||
|
||||
try:
|
||||
slug, archivefile = path.split('/', 1)
|
||||
except (IndexError, ValueError):
|
||||
|
|
@ -64,7 +72,10 @@ class LinkDetails(View):
|
|||
# slug is a timestamp
|
||||
by_ts = {page.timestamp: page for page in all_pages}
|
||||
try:
|
||||
return static.serve(request, archivefile, by_ts[slug].link_dir, show_indexes=True)
|
||||
# print('SERVING STATICFILE', by_ts[slug].link_dir, request.path, path)
|
||||
response = static.serve(request, archivefile, document_root=by_ts[slug].link_dir, show_indexes=True)
|
||||
response["Link"] = f'<{by_ts[slug].url}>; rel="canonical"'
|
||||
return response
|
||||
except KeyError:
|
||||
pass
|
||||
|
||||
|
|
|
|||
|
|
@ -1,6 +1,5 @@
|
|||
from cli.logging import log_shell_welcome_msg
|
||||
from archivebox.logging_util import log_shell_welcome_msg
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
from main import *
|
||||
log_shell_welcome_msg()
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@ __package__ = 'archivebox.extractors'
|
|||
|
||||
import os
|
||||
|
||||
from typing import Optional
|
||||
from typing import Optional, List, Iterable
|
||||
from datetime import datetime
|
||||
|
||||
from ..index.schema import Link
|
||||
|
|
@ -12,7 +12,10 @@ from ..index import (
|
|||
patch_main_index,
|
||||
)
|
||||
from ..util import enforce_types
|
||||
from ..cli.logging import (
|
||||
from ..logging_util import (
|
||||
log_archiving_started,
|
||||
log_archiving_paused,
|
||||
log_archiving_finished,
|
||||
log_link_archiving_started,
|
||||
log_link_archiving_finished,
|
||||
log_archive_method_started,
|
||||
|
|
@ -22,6 +25,7 @@ from ..cli.logging import (
|
|||
from .title import should_save_title, save_title
|
||||
from .favicon import should_save_favicon, save_favicon
|
||||
from .wget import should_save_wget, save_wget
|
||||
from .singlefile import should_save_singlefile, save_singlefile
|
||||
from .pdf import should_save_pdf, save_pdf
|
||||
from .screenshot import should_save_screenshot, save_screenshot
|
||||
from .dom import should_save_dom, save_dom
|
||||
|
|
@ -29,23 +33,39 @@ from .git import should_save_git, save_git
|
|||
from .media import should_save_media, save_media
|
||||
from .archive_org import should_save_archive_dot_org, save_archive_dot_org
|
||||
|
||||
def get_default_archive_methods():
|
||||
return [
|
||||
('title', should_save_title, save_title),
|
||||
('favicon', should_save_favicon, save_favicon),
|
||||
('wget', should_save_wget, save_wget),
|
||||
('singlefile', should_save_singlefile, save_singlefile),
|
||||
('pdf', should_save_pdf, save_pdf),
|
||||
('screenshot', should_save_screenshot, save_screenshot),
|
||||
('dom', should_save_dom, save_dom),
|
||||
('git', should_save_git, save_git),
|
||||
('media', should_save_media, save_media),
|
||||
('archive_org', should_save_archive_dot_org, save_archive_dot_org),
|
||||
]
|
||||
|
||||
@enforce_types
|
||||
def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None) -> Link:
|
||||
def ignore_methods(to_ignore: List[str]):
|
||||
ARCHIVE_METHODS = get_default_archive_methods()
|
||||
methods = filter(lambda x: x[0] not in to_ignore, ARCHIVE_METHODS)
|
||||
methods = map(lambda x: x[1], methods)
|
||||
return list(methods)
|
||||
|
||||
@enforce_types
|
||||
def archive_link(link: Link, overwrite: bool=False, methods: Optional[Iterable[str]]=None, out_dir: Optional[str]=None, skip_index: bool=False) -> Link:
|
||||
"""download the DOM, PDF, and a screenshot into a folder named after the link's timestamp"""
|
||||
|
||||
ARCHIVE_METHODS = (
|
||||
('title', should_save_title, save_title),
|
||||
('favicon', should_save_favicon, save_favicon),
|
||||
('wget', should_save_wget, save_wget),
|
||||
('pdf', should_save_pdf, save_pdf),
|
||||
('screenshot', should_save_screenshot, save_screenshot),
|
||||
('dom', should_save_dom, save_dom),
|
||||
('git', should_save_git, save_git),
|
||||
('media', should_save_media, save_media),
|
||||
('archive_org', should_save_archive_dot_org, save_archive_dot_org),
|
||||
)
|
||||
ARCHIVE_METHODS = get_default_archive_methods()
|
||||
|
||||
if methods is not None:
|
||||
ARCHIVE_METHODS = [
|
||||
method for method in ARCHIVE_METHODS
|
||||
if method[1] in methods
|
||||
]
|
||||
|
||||
out_dir = out_dir or link.link_dir
|
||||
try:
|
||||
is_new = not os.path.exists(out_dir)
|
||||
|
|
@ -53,6 +73,7 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
|
|||
os.makedirs(out_dir)
|
||||
|
||||
link = load_link_details(link, out_dir=out_dir)
|
||||
write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
|
||||
log_link_archiving_started(link, out_dir, is_new)
|
||||
link = link.overwrite(updated=datetime.now())
|
||||
stats = {'skipped': 0, 'succeeded': 0, 'failed': 0}
|
||||
|
|
@ -61,7 +82,7 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
|
|||
try:
|
||||
if method_name not in link.history:
|
||||
link.history[method_name] = []
|
||||
|
||||
|
||||
if should_run(link, out_dir) or overwrite:
|
||||
log_archive_method_started(method_name)
|
||||
|
||||
|
|
@ -81,9 +102,17 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
|
|||
|
||||
# print(' ', stats)
|
||||
|
||||
write_link_details(link, out_dir=link.link_dir)
|
||||
patch_main_index(link)
|
||||
|
||||
try:
|
||||
latest_title = link.history['title'][-1].output.strip()
|
||||
if latest_title and len(latest_title) >= len(link.title or ''):
|
||||
link = link.overwrite(title=latest_title)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
write_link_details(link, out_dir=out_dir, skip_sql_index=skip_index)
|
||||
if not skip_index:
|
||||
patch_main_index(link)
|
||||
|
||||
# # If any changes were made, update the main links index json and html
|
||||
# was_changed = stats['succeeded'] or stats['failed']
|
||||
# if was_changed:
|
||||
|
|
@ -103,3 +132,25 @@ def archive_link(link: Link, overwrite: bool=False, out_dir: Optional[str]=None)
|
|||
raise
|
||||
|
||||
return link
|
||||
|
||||
|
||||
@enforce_types
|
||||
def archive_links(links: List[Link], overwrite: bool=False, methods: Optional[Iterable[str]]=None, out_dir: Optional[str]=None) -> List[Link]:
|
||||
if not links:
|
||||
return []
|
||||
|
||||
log_archiving_started(len(links))
|
||||
idx: int = 0
|
||||
link: Link = links[0]
|
||||
try:
|
||||
for idx, link in enumerate(links):
|
||||
archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir)
|
||||
except KeyboardInterrupt:
|
||||
log_archiving_paused(len(links), idx, link.timestamp)
|
||||
raise SystemExit(0)
|
||||
except BaseException:
|
||||
print()
|
||||
raise
|
||||
|
||||
log_archiving_finished(len(links))
|
||||
return links
|
||||
|
|
|
|||
|
|
@ -6,20 +6,20 @@ from typing import Optional, List, Dict, Tuple
|
|||
from collections import defaultdict
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, DEVNULL, chmod_file
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
)
|
||||
from ..config import (
|
||||
VERSION,
|
||||
TIMEOUT,
|
||||
CHECK_SSL_VALIDITY,
|
||||
SAVE_ARCHIVE_DOT_ORG,
|
||||
CURL_BINARY,
|
||||
CURL_VERSION,
|
||||
CHECK_SSL_VALIDITY
|
||||
CURL_USER_AGENT,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
|
||||
|
|
@ -45,17 +45,19 @@ def save_archive_dot_org(link: Link, out_dir: Optional[str]=None, timeout: int=T
|
|||
submit_url = 'https://web.archive.org/save/{}'.format(link.url)
|
||||
cmd = [
|
||||
CURL_BINARY,
|
||||
'--silent',
|
||||
'--location',
|
||||
'--head',
|
||||
'--user-agent', 'ArchiveBox/{} (+https://github.com/pirate/ArchiveBox/)'.format(VERSION), # be nice to the Archive.org people and show them where all this ArchiveBox traffic is coming from
|
||||
'--compressed',
|
||||
'--max-time', str(timeout),
|
||||
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
|
||||
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
|
||||
submit_url,
|
||||
]
|
||||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=DEVNULL, cwd=out_dir, timeout=timeout)
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
content_location, errors = parse_archive_dot_org_response(result.stdout)
|
||||
if content_location:
|
||||
archive_org_url = 'https://web.archive.org{}'.format(content_location[0])
|
||||
|
|
@ -105,7 +107,7 @@ def parse_archive_dot_org_response(response: bytes) -> Tuple[List[str], List[str
|
|||
headers[name.lower().strip()].append(val.strip())
|
||||
|
||||
# Get successful archive url in "content-location" header or any errors
|
||||
content_location = headers['content-location']
|
||||
content_location = headers.get('content-location', headers['location'])
|
||||
errors = headers['x-archive-wayback-runtime-error']
|
||||
return content_location, errors
|
||||
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, chmod_file
|
||||
from ..system import run, chmod_file, atomic_write
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -16,7 +16,7 @@ from ..config import (
|
|||
SAVE_DOM,
|
||||
CHROME_VERSION,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
|
||||
|
|
@ -46,8 +46,8 @@ def save_dom(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
|
|||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
with open(output_path, 'w+') as f:
|
||||
result = run(cmd, stdout=f, stderr=PIPE, cwd=out_dir, timeout=timeout)
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
atomic_write(output_path, result.stdout)
|
||||
|
||||
if result.returncode:
|
||||
hints = result.stderr.decode()
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput
|
||||
from ..system import chmod_file, run, PIPE
|
||||
from ..system import chmod_file, run
|
||||
from ..util import enforce_types, domain
|
||||
from ..config import (
|
||||
TIMEOUT,
|
||||
|
|
@ -13,8 +13,9 @@ from ..config import (
|
|||
CURL_BINARY,
|
||||
CURL_VERSION,
|
||||
CHECK_SSL_VALIDITY,
|
||||
CURL_USER_AGENT,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -33,17 +34,21 @@ def save_favicon(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT)
|
|||
output: ArchiveOutput = 'favicon.ico'
|
||||
cmd = [
|
||||
CURL_BINARY,
|
||||
'--silent',
|
||||
'--max-time', str(timeout),
|
||||
'--location',
|
||||
'--compressed',
|
||||
'--output', str(output),
|
||||
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
|
||||
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
|
||||
'https://www.google.com/s2/favicons?domain={}'.format(domain(link.url)),
|
||||
]
|
||||
status = 'succeeded'
|
||||
status = 'pending'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
|
||||
run(cmd, cwd=out_dir, timeout=timeout)
|
||||
chmod_file(output, cwd=out_dir)
|
||||
status = 'succeeded'
|
||||
except Exception as err:
|
||||
status = 'failed'
|
||||
output = err
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, chmod_file
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -22,7 +22,7 @@ from ..config import (
|
|||
GIT_DOMAINS,
|
||||
CHECK_SSL_VALIDITY
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
|
||||
|
|
@ -56,7 +56,6 @@ def save_git(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
|
|||
cmd = [
|
||||
GIT_BINARY,
|
||||
'clone',
|
||||
'--mirror',
|
||||
'--recursive',
|
||||
*([] if CHECK_SSL_VALIDITY else ['-c', 'http.sslVerify=false']),
|
||||
without_query(without_fragment(link.url)),
|
||||
|
|
@ -64,8 +63,7 @@ def save_git(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
|
|||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=output_path, timeout=timeout + 1)
|
||||
|
||||
result = run(cmd, cwd=output_path, timeout=timeout + 1)
|
||||
if result.returncode == 128:
|
||||
# ignore failed re-download when the folder already exists
|
||||
pass
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, chmod_file
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -13,11 +13,12 @@ from ..util import (
|
|||
from ..config import (
|
||||
MEDIA_TIMEOUT,
|
||||
SAVE_MEDIA,
|
||||
SAVE_PLAYLISTS,
|
||||
YOUTUBEDL_BINARY,
|
||||
YOUTUBEDL_VERSION,
|
||||
CHECK_SSL_VALIDITY
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -45,7 +46,6 @@ def save_media(link: Link, out_dir: Optional[str]=None, timeout: int=MEDIA_TIMEO
|
|||
'--write-description',
|
||||
'--write-info-json',
|
||||
'--write-annotations',
|
||||
'--yes-playlist',
|
||||
'--write-thumbnail',
|
||||
'--no-call-home',
|
||||
'--no-check-certificate',
|
||||
|
|
@ -59,13 +59,14 @@ def save_media(link: Link, out_dir: Optional[str]=None, timeout: int=MEDIA_TIMEO
|
|||
'--audio-quality', '320K',
|
||||
'--embed-thumbnail',
|
||||
'--add-metadata',
|
||||
*(['--yes-playlist'] if SAVE_PLAYLISTS else []),
|
||||
*([] if CHECK_SSL_VALIDITY else ['--no-check-certificate']),
|
||||
link.url,
|
||||
]
|
||||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=output_path, timeout=timeout + 1)
|
||||
result = run(cmd, cwd=output_path, timeout=timeout + 1)
|
||||
chmod_file(output, cwd=out_dir)
|
||||
if result.returncode:
|
||||
if (b'ERROR: Unsupported URL' in result.stderr
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, chmod_file
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -16,7 +16,7 @@ from ..config import (
|
|||
SAVE_PDF,
|
||||
CHROME_VERSION,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -45,7 +45,7 @@ def save_pdf(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
|
|||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
|
||||
if result.returncode:
|
||||
hints = (result.stderr or result.stdout).decode()
|
||||
|
|
@ -58,6 +58,7 @@ def save_pdf(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> A
|
|||
finally:
|
||||
timer.end()
|
||||
|
||||
|
||||
return ArchiveResult(
|
||||
cmd=cmd,
|
||||
pwd=out_dir,
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ import os
|
|||
from typing import Optional
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE, chmod_file
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -16,7 +16,7 @@ from ..config import (
|
|||
SAVE_SCREENSHOT,
|
||||
CHROME_VERSION,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
|
||||
|
|
@ -45,7 +45,7 @@ def save_screenshot(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOU
|
|||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
|
||||
if result.returncode:
|
||||
hints = (result.stderr or result.stdout).decode()
|
||||
|
|
|
|||
87
archivebox/extractors/singlefile.py
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
__package__ = 'archivebox.extractors'
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveError
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
chrome_args,
|
||||
)
|
||||
from ..config import (
|
||||
TIMEOUT,
|
||||
SAVE_SINGLEFILE,
|
||||
SINGLEFILE_BINARY,
|
||||
SINGLEFILE_VERSION,
|
||||
CHROME_BINARY,
|
||||
)
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
@enforce_types
|
||||
def should_save_singlefile(link: Link, out_dir: Optional[str]=None) -> bool:
|
||||
out_dir = out_dir or link.link_dir
|
||||
if is_static_file(link.url):
|
||||
return False
|
||||
|
||||
output = Path(out_dir or link.link_dir) / 'singlefile.html'
|
||||
return SAVE_SINGLEFILE and (not output.exists())
|
||||
|
||||
|
||||
@enforce_types
|
||||
def save_singlefile(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
|
||||
"""download full site using single-file"""
|
||||
|
||||
out_dir = out_dir or link.link_dir
|
||||
output = str(Path(out_dir).absolute() / "singlefile.html")
|
||||
|
||||
browser_args = chrome_args(TIMEOUT=0)
|
||||
|
||||
# SingleFile CLI Docs: https://github.com/gildas-lormeau/SingleFile/tree/master/cli
|
||||
cmd = [
|
||||
SINGLEFILE_BINARY,
|
||||
'--browser-executable-path={}'.format(CHROME_BINARY),
|
||||
'--browser-args="{}"'.format(json.dumps(browser_args[1:])),
|
||||
link.url,
|
||||
output
|
||||
]
|
||||
|
||||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
|
||||
# parse out number of files downloaded from last line of stderr:
|
||||
# "Downloaded: 76 files, 4.0M in 1.6s (2.52 MB/s)"
|
||||
output_tail = [
|
||||
line.strip()
|
||||
for line in (result.stdout + result.stderr).decode().rsplit('\n', 3)[-3:]
|
||||
if line.strip()
|
||||
]
|
||||
hints = (
|
||||
'Got single-file response code: {}.'.format(result.returncode),
|
||||
*output_tail,
|
||||
)
|
||||
|
||||
# Check for common failure cases
|
||||
if (result.returncode > 0):
|
||||
raise ArchiveError('SingleFile was not able to archive the page', hints)
|
||||
chmod_file(output)
|
||||
except Exception as err:
|
||||
status = 'failed'
|
||||
output = err
|
||||
finally:
|
||||
timer.end()
|
||||
|
||||
return ArchiveResult(
|
||||
cmd=cmd,
|
||||
pwd=out_dir,
|
||||
cmd_version=SINGLEFILE_VERSION,
|
||||
output=output,
|
||||
status=status,
|
||||
**timer.stats,
|
||||
)
|
||||
|
|
@ -12,11 +12,14 @@ from ..util import (
|
|||
)
|
||||
from ..config import (
|
||||
TIMEOUT,
|
||||
CHECK_SSL_VALIDITY,
|
||||
SAVE_TITLE,
|
||||
CURL_BINARY,
|
||||
CURL_VERSION,
|
||||
CURL_USER_AGENT,
|
||||
setup_django,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
HTML_TITLE_REGEX = re.compile(
|
||||
|
|
@ -41,13 +44,19 @@ def should_save_title(link: Link, out_dir: Optional[str]=None) -> bool:
|
|||
def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) -> ArchiveResult:
|
||||
"""try to guess the page's title from its content"""
|
||||
|
||||
setup_django(out_dir=out_dir)
|
||||
from core.models import Snapshot
|
||||
|
||||
output: ArchiveOutput = None
|
||||
cmd = [
|
||||
CURL_BINARY,
|
||||
'--silent',
|
||||
'--max-time', str(timeout),
|
||||
'--location',
|
||||
'--compressed',
|
||||
*(['--user-agent', '{}'.format(CURL_USER_AGENT)] if CURL_USER_AGENT else []),
|
||||
*([] if CHECK_SSL_VALIDITY else ['--insecure']),
|
||||
link.url,
|
||||
'|',
|
||||
'grep',
|
||||
'<title',
|
||||
]
|
||||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
|
|
@ -55,7 +64,10 @@ def save_title(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
|
|||
html = download_url(link.url, timeout=timeout)
|
||||
match = re.search(HTML_TITLE_REGEX, html)
|
||||
output = htmldecode(match.group(1).strip()) if match else None
|
||||
if not output:
|
||||
if output:
|
||||
if not link.title or len(output) >= len(link.title):
|
||||
Snapshot.objects.filter(url=link.url, timestamp=link.timestamp).update(title=output)
|
||||
else:
|
||||
raise ArchiveError('Unable to detect page title')
|
||||
except Exception as err:
|
||||
status = 'failed'
|
||||
|
|
|
|||
|
|
@ -7,7 +7,7 @@ from typing import Optional
|
|||
from datetime import datetime
|
||||
|
||||
from ..index.schema import Link, ArchiveResult, ArchiveOutput, ArchiveError
|
||||
from ..system import run, PIPE
|
||||
from ..system import run, chmod_file
|
||||
from ..util import (
|
||||
enforce_types,
|
||||
is_static_file,
|
||||
|
|
@ -24,13 +24,14 @@ from ..config import (
|
|||
SAVE_WARC,
|
||||
WGET_BINARY,
|
||||
WGET_VERSION,
|
||||
RESTRICT_FILE_NAMES,
|
||||
CHECK_SSL_VALIDITY,
|
||||
SAVE_WGET_REQUISITES,
|
||||
WGET_AUTO_COMPRESSION,
|
||||
WGET_USER_AGENT,
|
||||
COOKIES_FILE,
|
||||
)
|
||||
from ..cli.logging import TimedProgress
|
||||
from ..logging_util import TimedProgress
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -66,21 +67,22 @@ def save_wget(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
|
|||
'--span-hosts',
|
||||
'--no-parent',
|
||||
'-e', 'robots=off',
|
||||
'--restrict-file-names=windows',
|
||||
'--timeout={}'.format(timeout),
|
||||
*([] if SAVE_WARC else ['--timestamping']),
|
||||
*(['--restrict-file-names={}'.format(RESTRICT_FILE_NAMES)] if RESTRICT_FILE_NAMES else []),
|
||||
*(['--warc-file={}'.format(warc_path)] if SAVE_WARC else []),
|
||||
*(['--page-requisites'] if SAVE_WGET_REQUISITES else []),
|
||||
*(['--user-agent={}'.format(WGET_USER_AGENT)] if WGET_USER_AGENT else []),
|
||||
*(['--load-cookies', COOKIES_FILE] if COOKIES_FILE else []),
|
||||
*(['--compression=auto'] if WGET_AUTO_COMPRESSION else []),
|
||||
*([] if SAVE_WARC else ['--timestamping']),
|
||||
*([] if CHECK_SSL_VALIDITY else ['--no-check-certificate', '--no-hsts']),
|
||||
link.url,
|
||||
]
|
||||
|
||||
status = 'succeeded'
|
||||
timer = TimedProgress(timeout, prefix=' ')
|
||||
try:
|
||||
result = run(cmd, stdout=PIPE, stderr=PIPE, cwd=out_dir, timeout=timeout)
|
||||
result = run(cmd, cwd=out_dir, timeout=timeout)
|
||||
output = wget_output_path(link)
|
||||
|
||||
# parse out number of files downloaded from last line of stderr:
|
||||
|
|
@ -95,22 +97,21 @@ def save_wget(link: Link, out_dir: Optional[str]=None, timeout: int=TIMEOUT) ->
|
|||
if 'Downloaded:' in output_tail[-1]
|
||||
else 0
|
||||
)
|
||||
hints = (
|
||||
'Got wget response code: {}.'.format(result.returncode),
|
||||
*output_tail,
|
||||
)
|
||||
|
||||
# Check for common failure cases
|
||||
if result.returncode > 0 and files_downloaded < 1:
|
||||
hints = (
|
||||
'Got wget response code: {}.'.format(result.returncode),
|
||||
*output_tail,
|
||||
)
|
||||
if (result.returncode > 0 and files_downloaded < 1) or output is None:
|
||||
if b'403: Forbidden' in result.stderr:
|
||||
raise ArchiveError('403 Forbidden (try changing WGET_USER_AGENT)', hints)
|
||||
if b'404: Not Found' in result.stderr:
|
||||
raise ArchiveError('404 Not Found', hints)
|
||||
if b'ERROR 500: Internal Server Error' in result.stderr:
|
||||
raise ArchiveError('500 Internal Server Error', hints)
|
||||
raise ArchiveError('Got an error from the server', hints)
|
||||
|
||||
# chmod_file(output, cwd=out_dir)
|
||||
raise ArchiveError('Wget failed or got an error from the server', hints)
|
||||
chmod_file(output, cwd=out_dir)
|
||||
except Exception as err:
|
||||
status = 'failed'
|
||||
output = err
|
||||
|
|
@ -134,7 +135,6 @@ def wget_output_path(link: Link) -> Optional[str]:
|
|||
|
||||
See docs on wget --adjust-extension (-E)
|
||||
"""
|
||||
|
||||
if is_static_file(link.url):
|
||||
return without_scheme(without_fragment(link.url))
|
||||
|
||||
|
|
@ -172,10 +172,9 @@ def wget_output_path(link: Link) -> Optional[str]:
|
|||
full_path = without_fragment(without_query(path(link.url))).strip('/')
|
||||
search_dir = os.path.join(
|
||||
link.link_dir,
|
||||
domain(link.url),
|
||||
domain(link.url).replace(":", "+"),
|
||||
urldecode(full_path),
|
||||
)
|
||||
|
||||
for _ in range(4):
|
||||
if os.path.exists(search_dir):
|
||||
if os.path.isdir(search_dir):
|
||||
|
|
|
|||
|
|
@ -26,15 +26,16 @@ from ..config import (
|
|||
URL_BLACKLIST_PTN,
|
||||
ANSI,
|
||||
stderr,
|
||||
OUTPUT_PERMISSIONS
|
||||
)
|
||||
from ..cli.logging import (
|
||||
from ..logging_util import (
|
||||
TimedProgress,
|
||||
log_indexing_process_started,
|
||||
log_indexing_process_finished,
|
||||
log_indexing_started,
|
||||
log_indexing_finished,
|
||||
log_parsing_started,
|
||||
log_parsing_finished,
|
||||
log_deduping_finished,
|
||||
)
|
||||
|
||||
from .schema import Link, ArchiveResult
|
||||
|
|
@ -51,6 +52,7 @@ from .json import (
|
|||
from .sql import (
|
||||
write_sql_main_index,
|
||||
parse_sql_main_index,
|
||||
write_sql_link_details,
|
||||
)
|
||||
|
||||
### Link filtering and checking
|
||||
|
|
@ -231,6 +233,8 @@ def write_main_index(links: List[Link], out_dir: str=OUTPUT_DIR, finished: bool=
|
|||
|
||||
with timed_index_update(os.path.join(out_dir, SQL_INDEX_FILENAME)):
|
||||
write_sql_main_index(links, out_dir=out_dir)
|
||||
os.chmod(os.path.join(out_dir, SQL_INDEX_FILENAME), int(OUTPUT_PERMISSIONS, base=8)) # set here because we don't write it with atomic writes
|
||||
|
||||
|
||||
with timed_index_update(os.path.join(out_dir, JSON_INDEX_FILENAME)):
|
||||
write_json_main_index(links, out_dir=out_dir)
|
||||
|
|
@ -267,20 +271,29 @@ def load_main_index_meta(out_dir: str=OUTPUT_DIR) -> Optional[dict]:
|
|||
|
||||
return None
|
||||
|
||||
|
||||
@enforce_types
|
||||
def import_new_links(existing_links: List[Link],
|
||||
import_path: str,
|
||||
out_dir: str=OUTPUT_DIR) -> Tuple[List[Link], List[Link]]:
|
||||
def parse_links_from_source(source_path: str) -> Tuple[List[Link], List[Link]]:
|
||||
|
||||
from ..parsers import parse_links
|
||||
|
||||
new_links: List[Link] = []
|
||||
|
||||
# parse and validate the import file
|
||||
log_parsing_started(import_path)
|
||||
raw_links, parser_name = parse_links(import_path)
|
||||
raw_links, parser_name = parse_links(source_path)
|
||||
new_links = validate_links(raw_links)
|
||||
|
||||
if parser_name:
|
||||
num_parsed = len(raw_links)
|
||||
log_parsing_finished(num_parsed, parser_name)
|
||||
|
||||
return new_links
|
||||
|
||||
|
||||
@enforce_types
|
||||
def dedupe_links(existing_links: List[Link],
|
||||
new_links: List[Link]) -> Tuple[List[Link], List[Link]]:
|
||||
|
||||
# merge existing links in out_dir and new links
|
||||
all_links = validate_links(existing_links + new_links)
|
||||
all_link_urls = {link.url for link in existing_links}
|
||||
|
|
@ -290,10 +303,11 @@ def import_new_links(existing_links: List[Link],
|
|||
if link.url not in all_link_urls
|
||||
]
|
||||
|
||||
if parser_name:
|
||||
num_parsed = len(raw_links)
|
||||
num_new_links = len(all_links) - len(existing_links)
|
||||
log_parsing_finished(num_parsed, num_new_links, parser_name)
|
||||
all_links_deduped = {link.url: link for link in all_links}
|
||||
for i in range(len(new_links)):
|
||||
if new_links[i].url in all_links_deduped.keys():
|
||||
new_links[i] = all_links_deduped[new_links[i].url]
|
||||
log_deduping_finished(len(new_links))
|
||||
|
||||
return all_links, new_links
|
||||
|
||||
|
|
@ -325,7 +339,8 @@ def patch_main_index(link: Link, out_dir: str=OUTPUT_DIR) -> None:
|
|||
# Patch HTML main index
|
||||
html_path = os.path.join(out_dir, 'index.html')
|
||||
with open(html_path, 'r') as f:
|
||||
html = f.read().split('\n')
|
||||
html = f.read().splitlines()
|
||||
|
||||
for idx, line in enumerate(html):
|
||||
if title and ('<span data-title-for="{}"'.format(link.url) in line):
|
||||
html[idx] = '<span>{}</span>'.format(title)
|
||||
|
|
@ -333,17 +348,19 @@ def patch_main_index(link: Link, out_dir: str=OUTPUT_DIR) -> None:
|
|||
html[idx] = '<span>{}</span>'.format(successful)
|
||||
break
|
||||
|
||||
atomic_write('\n'.join(html), html_path)
|
||||
atomic_write(html_path, '\n'.join(html))
|
||||
|
||||
|
||||
### Link Details Index
|
||||
|
||||
@enforce_types
|
||||
def write_link_details(link: Link, out_dir: Optional[str]=None) -> None:
|
||||
def write_link_details(link: Link, out_dir: Optional[str]=None, skip_sql_index: bool=False) -> None:
|
||||
out_dir = out_dir or link.link_dir
|
||||
|
||||
write_json_link_details(link, out_dir=out_dir)
|
||||
write_html_link_details(link, out_dir=out_dir)
|
||||
if not skip_sql_index:
|
||||
write_sql_link_details(link)
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -512,8 +529,16 @@ def get_unrecognized_folders(links, out_dir: str=OUTPUT_DIR) -> Dict[str, Option
|
|||
link = None
|
||||
try:
|
||||
link = parse_json_link_details(entry.path)
|
||||
except Exception:
|
||||
pass
|
||||
except KeyError:
|
||||
# Try to fix index
|
||||
if index_exists:
|
||||
try:
|
||||
# Last attempt to repair the detail index
|
||||
link_guessed = parse_json_link_details(entry.path, guess=True)
|
||||
write_json_link_details(link_guessed, out_dir=entry.path)
|
||||
link = parse_json_link_details(entry.path)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if index_exists and link is None:
|
||||
# index exists but it's corrupted or unparseable
|
||||
|
|
@ -538,7 +563,7 @@ def is_valid(link: Link) -> bool:
|
|||
return False
|
||||
if dir_exists and index_exists:
|
||||
try:
|
||||
parsed_link = parse_json_link_details(link.link_dir)
|
||||
parsed_link = parse_json_link_details(link.link_dir, guess=True)
|
||||
return link.url == parsed_link.url
|
||||
except Exception:
|
||||
pass
|
||||
|
|
@ -569,7 +594,10 @@ def fix_invalid_folder_locations(out_dir: str=OUTPUT_DIR) -> Tuple[List[str], Li
|
|||
for entry in os.scandir(os.path.join(out_dir, ARCHIVE_DIR_NAME)):
|
||||
if entry.is_dir(follow_symlinks=True):
|
||||
if os.path.exists(os.path.join(entry.path, 'index.json')):
|
||||
link = parse_json_link_details(entry.path)
|
||||
try:
|
||||
link = parse_json_link_details(entry.path)
|
||||
except KeyError:
|
||||
link = None
|
||||
if not link:
|
||||
continue
|
||||
|
||||
|
|
|
|||
|
|
@ -41,7 +41,7 @@ TITLE_LOADING_MSG = 'Not yet archived...'
|
|||
def parse_html_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[str]:
|
||||
"""parse an archive index html file and return the list of urls"""
|
||||
|
||||
index_path = os.path.join(out_dir, HTML_INDEX_FILENAME)
|
||||
index_path = join(out_dir, HTML_INDEX_FILENAME)
|
||||
if os.path.exists(index_path):
|
||||
with open(index_path, 'r', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
|
|
@ -58,7 +58,7 @@ def write_html_main_index(links: List[Link], out_dir: str=OUTPUT_DIR, finished:
|
|||
copy_and_overwrite(join(TEMPLATES_DIR, STATIC_DIR_NAME), join(out_dir, STATIC_DIR_NAME))
|
||||
|
||||
rendered_html = main_index_template(links, finished=finished)
|
||||
atomic_write(rendered_html, join(out_dir, HTML_INDEX_FILENAME))
|
||||
atomic_write(join(out_dir, HTML_INDEX_FILENAME), rendered_html)
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -90,7 +90,7 @@ def main_index_row_template(link: Link) -> str:
|
|||
**link._asdict(extended=True),
|
||||
|
||||
# before pages are finished archiving, show loading msg instead of title
|
||||
'title': (
|
||||
'title': htmlencode(
|
||||
link.title
|
||||
or (link.base_url if link.is_archived else TITLE_LOADING_MSG)
|
||||
),
|
||||
|
|
@ -116,7 +116,7 @@ def write_html_link_details(link: Link, out_dir: Optional[str]=None) -> None:
|
|||
out_dir = out_dir or link.link_dir
|
||||
|
||||
rendered_html = link_details_template(link)
|
||||
atomic_write(rendered_html, join(out_dir, HTML_INDEX_FILENAME))
|
||||
atomic_write(join(out_dir, HTML_INDEX_FILENAME), rendered_html)
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -129,15 +129,15 @@ def link_details_template(link: Link) -> str:
|
|||
return render_legacy_template(LINK_DETAILS_TEMPLATE, {
|
||||
**link_info,
|
||||
**link_info['canonical'],
|
||||
'title': (
|
||||
'title': htmlencode(
|
||||
link.title
|
||||
or (link.base_url if link.is_archived else TITLE_LOADING_MSG)
|
||||
),
|
||||
'url_str': htmlencode(urldecode(link.base_url)),
|
||||
'archive_url': urlencode(
|
||||
wget_output_path(link)
|
||||
or (link.domain if link.is_archived else 'about:blank')
|
||||
),
|
||||
or (link.domain if link.is_archived else '')
|
||||
) or 'about:blank',
|
||||
'extension': link.extension or 'html',
|
||||
'tags': link.tags or 'untagged',
|
||||
'status': 'archived' if link.is_archived else 'not yet archived',
|
||||
|
|
|
|||
|
|
@ -3,6 +3,7 @@ __package__ = 'archivebox.index'
|
|||
import os
|
||||
import sys
|
||||
import json as pyjson
|
||||
from pathlib import Path
|
||||
|
||||
from datetime import datetime
|
||||
from typing import List, Optional, Iterator, Any
|
||||
|
|
@ -18,6 +19,7 @@ from ..config import (
|
|||
DEPENDENCIES,
|
||||
JSON_INDEX_FILENAME,
|
||||
ARCHIVE_DIR_NAME,
|
||||
ANSI
|
||||
)
|
||||
|
||||
|
||||
|
|
@ -37,7 +39,6 @@ MAIN_INDEX_HEADER = {
|
|||
},
|
||||
}
|
||||
|
||||
|
||||
### Main Links Index
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -49,8 +50,19 @@ def parse_json_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[Link]:
|
|||
with open(index_path, 'r', encoding='utf-8') as f:
|
||||
links = pyjson.load(f)['links']
|
||||
for link_json in links:
|
||||
yield Link.from_json(link_json)
|
||||
|
||||
try:
|
||||
yield Link.from_json(link_json)
|
||||
except KeyError:
|
||||
try:
|
||||
detail_index_path = Path(OUTPUT_DIR) / ARCHIVE_DIR_NAME / link_json['timestamp']
|
||||
yield parse_json_link_details(str(detail_index_path))
|
||||
except KeyError:
|
||||
# as a last effort, try to guess the missing values out of existing ones
|
||||
try:
|
||||
yield Link.from_json(link_json, guess=True)
|
||||
except KeyError:
|
||||
print(" {lightyellow}! Failed to load the index.json from {}".format(detail_index_path, **ANSI))
|
||||
continue
|
||||
return ()
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -74,7 +86,7 @@ def write_json_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
|
|||
'last_run_cmd': sys.argv,
|
||||
'links': links,
|
||||
}
|
||||
atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME))
|
||||
atomic_write(os.path.join(out_dir, JSON_INDEX_FILENAME), main_index_json)
|
||||
|
||||
|
||||
### Link Details Index
|
||||
|
|
@ -85,19 +97,18 @@ def write_json_link_details(link: Link, out_dir: Optional[str]=None) -> None:
|
|||
|
||||
out_dir = out_dir or link.link_dir
|
||||
path = os.path.join(out_dir, JSON_INDEX_FILENAME)
|
||||
|
||||
atomic_write(link._asdict(extended=True), path)
|
||||
atomic_write(path, link._asdict(extended=True))
|
||||
|
||||
|
||||
@enforce_types
|
||||
def parse_json_link_details(out_dir: str) -> Optional[Link]:
|
||||
def parse_json_link_details(out_dir: str, guess: Optional[bool]=False) -> Optional[Link]:
|
||||
"""load the json link index from a given directory"""
|
||||
existing_index = os.path.join(out_dir, JSON_INDEX_FILENAME)
|
||||
if os.path.exists(existing_index):
|
||||
with open(existing_index, 'r', encoding='utf-8') as f:
|
||||
try:
|
||||
link_json = pyjson.load(f)
|
||||
return Link.from_json(link_json)
|
||||
return Link.from_json(link_json, guess)
|
||||
except pyjson.JSONDecodeError:
|
||||
pass
|
||||
return None
|
||||
|
|
@ -110,7 +121,10 @@ def parse_json_links_details(out_dir: str) -> Iterator[Link]:
|
|||
for entry in os.scandir(os.path.join(out_dir, ARCHIVE_DIR_NAME)):
|
||||
if entry.is_dir(follow_symlinks=True):
|
||||
if os.path.exists(os.path.join(entry.path, 'index.json')):
|
||||
link = parse_json_link_details(entry.path)
|
||||
try:
|
||||
link = parse_json_link_details(entry.path)
|
||||
except KeyError:
|
||||
link = None
|
||||
if link:
|
||||
yield link
|
||||
|
||||
|
|
@ -149,5 +163,3 @@ class ExtendedEncoder(pyjson.JSONEncoder):
|
|||
def to_json(obj: Any, indent: Optional[int]=4, sort_keys: bool=True, cls=ExtendedEncoder) -> str:
|
||||
return pyjson.dumps(obj, indent=indent, sort_keys=sort_keys, cls=ExtendedEncoder)
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,14 +1,19 @@
|
|||
__package__ = 'archivebox.index'
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
from datetime import datetime
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from typing import List, Dict, Any, Optional, Union
|
||||
|
||||
from dataclasses import dataclass, asdict, field, fields
|
||||
|
||||
|
||||
from ..system import get_dir_size
|
||||
|
||||
from ..config import OUTPUT_DIR, ARCHIVE_DIR_NAME
|
||||
|
||||
class ArchiveError(Exception):
|
||||
def __init__(self, message, hints=None):
|
||||
super().__init__(message)
|
||||
|
|
@ -49,7 +54,15 @@ class ArchiveResult:
|
|||
assert self.output
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_info):
|
||||
def guess_ts(_cls, dict_info):
|
||||
from ..util import parse_date
|
||||
parsed_timestamp = parse_date(dict_info["timestamp"])
|
||||
start_ts = parsed_timestamp
|
||||
end_ts = parsed_timestamp + timedelta(seconds=int(dict_info["duration"]))
|
||||
return start_ts, end_ts
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_info, guess=False):
|
||||
from ..util import parse_date
|
||||
|
||||
info = {
|
||||
|
|
@ -57,8 +70,25 @@ class ArchiveResult:
|
|||
for key, val in json_info.items()
|
||||
if key in cls.field_names()
|
||||
}
|
||||
info['start_ts'] = parse_date(info['start_ts'])
|
||||
info['end_ts'] = parse_date(info['end_ts'])
|
||||
if guess:
|
||||
keys = info.keys()
|
||||
if "start_ts" not in keys:
|
||||
info["start_ts"], info["end_ts"] = cls.guess_ts(json_info)
|
||||
else:
|
||||
info['start_ts'] = parse_date(info['start_ts'])
|
||||
info['end_ts'] = parse_date(info['end_ts'])
|
||||
if "pwd" not in keys:
|
||||
info["pwd"] = str(Path(OUTPUT_DIR) / ARCHIVE_DIR_NAME / json_info["timestamp"])
|
||||
if "cmd_version" not in keys:
|
||||
info["cmd_version"] = "Undefined"
|
||||
if "cmd" not in keys:
|
||||
info["cmd"] = []
|
||||
else:
|
||||
info['start_ts'] = parse_date(info['start_ts'])
|
||||
info['end_ts'] = parse_date(info['end_ts'])
|
||||
info['cmd_version'] = info.get('cmd_version')
|
||||
if type(info["cmd"]) is str:
|
||||
info["cmd"] = [info["cmd"]]
|
||||
return cls(**info)
|
||||
|
||||
def to_dict(self, *keys) -> dict:
|
||||
|
|
@ -95,6 +125,7 @@ class Link:
|
|||
updated: Optional[datetime] = None
|
||||
schema: str = 'Link'
|
||||
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f'[{self.timestamp}] {self.base_url} "{self.title}"'
|
||||
|
||||
|
|
@ -178,7 +209,7 @@ class Link:
|
|||
return info
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_info):
|
||||
def from_json(cls, json_info, guess=False):
|
||||
from ..util import parse_date
|
||||
|
||||
info = {
|
||||
|
|
@ -196,7 +227,7 @@ class Link:
|
|||
cast_history[method] = []
|
||||
for json_result in method_history:
|
||||
assert isinstance(json_result, dict), 'Items in Link["history"][method] must be dicts'
|
||||
cast_result = ArchiveResult.from_json(json_result)
|
||||
cast_result = ArchiveResult.from_json(json_result, guess)
|
||||
cast_history[method].append(cast_result)
|
||||
|
||||
info['history'] = cast_history
|
||||
|
|
@ -226,6 +257,13 @@ class Link:
|
|||
from ..config import ARCHIVE_DIR_NAME
|
||||
return '{}/{}'.format(ARCHIVE_DIR_NAME, self.timestamp)
|
||||
|
||||
@property
|
||||
def archive_size(self) -> float:
|
||||
try:
|
||||
return get_dir_size(self.archive_path)[0]
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
### URL Helpers
|
||||
@property
|
||||
def url_hash(self):
|
||||
|
|
@ -267,7 +305,16 @@ class Link:
|
|||
@property
|
||||
def bookmarked_date(self) -> Optional[str]:
|
||||
from ..util import ts_to_date
|
||||
return ts_to_date(self.timestamp) if self.timestamp else None
|
||||
|
||||
max_ts = (datetime.now() + timedelta(days=30)).timestamp()
|
||||
|
||||
if self.timestamp and self.timestamp.replace('.', '').isdigit():
|
||||
if 0 < float(self.timestamp) < max_ts:
|
||||
return ts_to_date(datetime.fromtimestamp(float(self.timestamp)))
|
||||
else:
|
||||
return str(self.timestamp)
|
||||
return None
|
||||
|
||||
|
||||
@property
|
||||
def updated_date(self) -> Optional[str]:
|
||||
|
|
@ -318,6 +365,7 @@ class Link:
|
|||
'screenshot.png',
|
||||
'output.html',
|
||||
'media',
|
||||
'singlefile.html'
|
||||
)
|
||||
|
||||
return any(
|
||||
|
|
@ -329,7 +377,7 @@ class Link:
|
|||
"""get the latest output that each archive method produced for link"""
|
||||
|
||||
ARCHIVE_METHODS = (
|
||||
'title', 'favicon', 'wget', 'warc', 'pdf',
|
||||
'title', 'favicon', 'wget', 'warc', 'singlefile', 'pdf',
|
||||
'screenshot', 'dom', 'git', 'media', 'archive_org',
|
||||
)
|
||||
latest: Dict[str, ArchiveOutput] = {}
|
||||
|
|
@ -345,7 +393,6 @@ class Link:
|
|||
latest[archive_method] = history[0].output
|
||||
else:
|
||||
latest[archive_method] = None
|
||||
|
||||
return latest
|
||||
|
||||
|
||||
|
|
@ -359,6 +406,7 @@ class Link:
|
|||
'google_favicon_path': 'https://www.google.com/s2/favicons?domain={}'.format(self.domain),
|
||||
'wget_path': wget_output_path(self),
|
||||
'warc_path': 'warc',
|
||||
'singlefile_path': 'singlefile.html',
|
||||
'pdf_path': 'output.pdf',
|
||||
'screenshot_path': 'screenshot.png',
|
||||
'dom_path': 'output.html',
|
||||
|
|
@ -378,7 +426,7 @@ class Link:
|
|||
'pdf_path': static_path,
|
||||
'screenshot_path': static_path,
|
||||
'dom_path': static_path,
|
||||
'singlefile_path': static_path,
|
||||
})
|
||||
return canonical
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -20,31 +20,38 @@ def parse_sql_main_index(out_dir: str=OUTPUT_DIR) -> Iterator[Link]:
|
|||
for page in Snapshot.objects.all()
|
||||
)
|
||||
|
||||
@enforce_types
|
||||
def remove_from_sql_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
|
||||
setup_django(out_dir, check_db=True)
|
||||
from core.models import Snapshot
|
||||
from django.db import transaction
|
||||
|
||||
with transaction.atomic():
|
||||
for link in links:
|
||||
Snapshot.objects.filter(url=link.url).delete()
|
||||
|
||||
@enforce_types
|
||||
def write_sql_main_index(links: List[Link], out_dir: str=OUTPUT_DIR) -> None:
|
||||
setup_django(out_dir, check_db=True)
|
||||
from core.models import Snapshot
|
||||
from django.db import transaction
|
||||
|
||||
all_urls = {link.url: link for link in links}
|
||||
all_ts = {link.timestamp: link for link in links}
|
||||
with transaction.atomic():
|
||||
for link in links:
|
||||
info = {k: v for k, v in link._asdict().items() if k in Snapshot.keys}
|
||||
Snapshot.objects.update_or_create(url=link.url, defaults=info)
|
||||
|
||||
@enforce_types
|
||||
def write_sql_link_details(link: Link, out_dir: str=OUTPUT_DIR) -> None:
|
||||
setup_django(out_dir, check_db=True)
|
||||
from core.models import Snapshot
|
||||
from django.db import transaction
|
||||
|
||||
with transaction.atomic():
|
||||
for snapshot in Snapshot.objects.all():
|
||||
if snapshot.timestamp in all_ts:
|
||||
info = {k: v for k, v in all_urls.pop(snapshot.url)._asdict().items() if k in Snapshot.keys}
|
||||
snapshot.delete()
|
||||
Snapshot.objects.create(**info)
|
||||
if snapshot.url in all_urls:
|
||||
info = {k: v for k, v in all_urls.pop(snapshot.url)._asdict().items() if k in Snapshot.keys}
|
||||
snapshot.delete()
|
||||
Snapshot.objects.create(**info)
|
||||
else:
|
||||
snapshot.delete()
|
||||
|
||||
for url, link in all_urls.items():
|
||||
info = {k: v for k, v in link._asdict().items() if k in Snapshot.keys}
|
||||
Snapshot.objects.update_or_create(url=url, defaults=info)
|
||||
snap = Snapshot.objects.get(url=link.url, timestamp=link.timestamp)
|
||||
snap.title = link.title
|
||||
snap.tags = link.tags
|
||||
snap.save()
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,30 +1,32 @@
|
|||
__package__ = 'archivebox.cli'
|
||||
__package__ = 'archivebox'
|
||||
|
||||
import re
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import argparse
|
||||
from multiprocessing import Process
|
||||
|
||||
from datetime import datetime
|
||||
from dataclasses import dataclass
|
||||
from multiprocessing import Process
|
||||
from typing import Optional, List, Dict, Union, IO
|
||||
from typing import Optional, List, Dict, Union, IO, TYPE_CHECKING
|
||||
|
||||
from ..index.schema import Link, ArchiveResult
|
||||
from ..index.json import to_json
|
||||
from ..index.csv import links_to_csv
|
||||
from ..util import enforce_types
|
||||
from ..config import (
|
||||
if TYPE_CHECKING:
|
||||
from .index.schema import Link, ArchiveResult
|
||||
|
||||
from .util import enforce_types
|
||||
from .config import (
|
||||
ConfigDict,
|
||||
PYTHON_ENCODING,
|
||||
ANSI,
|
||||
OUTPUT_DIR,
|
||||
IS_TTY,
|
||||
SHOW_PROGRESS,
|
||||
TERM_WIDTH,
|
||||
OUTPUT_DIR,
|
||||
SOURCES_DIR_NAME,
|
||||
HTML_INDEX_FILENAME,
|
||||
stderr,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class RuntimeStats:
|
||||
"""mutable stats counter for logging archiving timing info to CLI output"""
|
||||
|
|
@ -66,6 +68,7 @@ def reject_stdin(caller: str, stdin: Optional[IO]=sys.stdin) -> None:
|
|||
stderr()
|
||||
raise SystemExit(1)
|
||||
|
||||
|
||||
def accept_stdin(stdin: Optional[IO]=sys.stdin) -> Optional[str]:
|
||||
"""accept any standard input and return it as a string or None"""
|
||||
if not stdin:
|
||||
|
|
@ -80,7 +83,9 @@ class TimedProgress:
|
|||
"""Show a progress bar and measure elapsed time until .end() is called"""
|
||||
|
||||
def __init__(self, seconds, prefix=''):
|
||||
if SHOW_PROGRESS:
|
||||
from .config import SHOW_PROGRESS
|
||||
self.SHOW_PROGRESS = SHOW_PROGRESS
|
||||
if self.SHOW_PROGRESS:
|
||||
self.p = Process(target=progress_bar, args=(seconds, prefix))
|
||||
self.p.start()
|
||||
|
||||
|
|
@ -91,28 +96,39 @@ class TimedProgress:
|
|||
|
||||
end_ts = datetime.now()
|
||||
self.stats['end_ts'] = end_ts
|
||||
if SHOW_PROGRESS:
|
||||
# protect from double termination
|
||||
#if p is None or not hasattr(p, 'kill'):
|
||||
# return
|
||||
if self.p is not None:
|
||||
self.p.terminate()
|
||||
|
||||
self.p = None
|
||||
|
||||
if self.SHOW_PROGRESS:
|
||||
# terminate if we havent already terminated
|
||||
self.p.terminate()
|
||||
self.p.join()
|
||||
self.p.close()
|
||||
|
||||
sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH()), ANSI['reset'])) # clear whole terminal line
|
||||
# clear whole terminal line
|
||||
try:
|
||||
sys.stdout.write('\r{}{}\r'.format((' ' * TERM_WIDTH()), ANSI['reset']))
|
||||
except (IOError, BrokenPipeError):
|
||||
# ignore when the parent proc has stopped listening to our stdout
|
||||
pass
|
||||
|
||||
|
||||
@enforce_types
|
||||
def progress_bar(seconds: int, prefix: str='') -> None:
|
||||
"""show timer in the form of progress bar, with percentage and seconds remaining"""
|
||||
chunk = '█' if sys.stdout.encoding == 'UTF-8' else '#'
|
||||
chunks = TERM_WIDTH() - len(prefix) - 20 # number of progress chunks to show (aka max bar width)
|
||||
chunk = '█' if PYTHON_ENCODING == 'UTF-8' else '#'
|
||||
last_width = TERM_WIDTH()
|
||||
chunks = last_width - len(prefix) - 20 # number of progress chunks to show (aka max bar width)
|
||||
try:
|
||||
for s in range(seconds * chunks):
|
||||
chunks = TERM_WIDTH() - len(prefix) - 20
|
||||
max_width = TERM_WIDTH()
|
||||
if max_width < last_width:
|
||||
# when the terminal size is shrunk, we have to write a newline
|
||||
# otherwise the progress bar will keep wrapping incorrectly
|
||||
sys.stdout.write('\r\n')
|
||||
sys.stdout.flush()
|
||||
chunks = max_width - len(prefix) - 20
|
||||
progress = s / chunks / seconds * 100
|
||||
bar_width = round(progress/(100/chunks))
|
||||
last_width = max_width
|
||||
|
||||
# ████████████████████ 0.9% (1/60sec)
|
||||
sys.stdout.write('\r{0}{1}{2}{3} {4}% ({5}/{6}sec)'.format(
|
||||
|
|
@ -138,27 +154,51 @@ def progress_bar(seconds: int, prefix: str='') -> None:
|
|||
seconds,
|
||||
))
|
||||
sys.stdout.flush()
|
||||
except KeyboardInterrupt:
|
||||
except (KeyboardInterrupt, BrokenPipeError):
|
||||
print()
|
||||
pass
|
||||
|
||||
|
||||
def log_cli_command(subcommand: str, subcommand_args: List[str], stdin: Optional[str], pwd: str):
|
||||
from .config import VERSION, ANSI
|
||||
cmd = ' '.join(('archivebox', subcommand, *subcommand_args))
|
||||
stdin_hint = ' < /dev/stdin' if not stdin.isatty() else ''
|
||||
stderr('{black}[i] [{now}] ArchiveBox v{VERSION}: {cmd}{stdin_hint}{reset}'.format(
|
||||
now=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
VERSION=VERSION,
|
||||
cmd=cmd,
|
||||
stdin_hint=stdin_hint,
|
||||
**ANSI,
|
||||
))
|
||||
stderr('{black} > {pwd}{reset}'.format(pwd=pwd, **ANSI))
|
||||
stderr()
|
||||
|
||||
### Parsing Stage
|
||||
|
||||
def log_parsing_started(source_file: str):
|
||||
start_ts = datetime.now()
|
||||
_LAST_RUN_STATS.parse_start_ts = start_ts
|
||||
print('\n{green}[*] [{}] Parsing new links from output/sources/{}...{reset}'.format(
|
||||
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
|
||||
source_file.rsplit('/', 1)[-1],
|
||||
|
||||
def log_importing_started(urls: Union[str, List[str]], depth: int, index_only: bool):
|
||||
_LAST_RUN_STATS.parse_start_ts = datetime.now()
|
||||
print('{green}[+] [{}] Adding {} links to index (crawl depth={}){}...{reset}'.format(
|
||||
_LAST_RUN_STATS.parse_start_ts.strftime('%Y-%m-%d %H:%M:%S'),
|
||||
len(urls) if isinstance(urls, list) else len(urls.split('\n')),
|
||||
depth,
|
||||
' (index only)' if index_only else '',
|
||||
**ANSI,
|
||||
))
|
||||
|
||||
def log_parsing_finished(num_parsed: int, num_new_links: int, parser_name: str):
|
||||
end_ts = datetime.now()
|
||||
_LAST_RUN_STATS.parse_end_ts = end_ts
|
||||
print(' > Parsed {} links as {} ({} new links added)'.format(num_parsed, parser_name, num_new_links))
|
||||
def log_source_saved(source_file: str):
|
||||
print(' > Saved verbatim input to {}/{}'.format(SOURCES_DIR_NAME, source_file.rsplit('/', 1)[-1]))
|
||||
|
||||
def log_parsing_finished(num_parsed: int, parser_name: str):
|
||||
_LAST_RUN_STATS.parse_end_ts = datetime.now()
|
||||
print(' > Parsed {} URLs from input ({})'.format(num_parsed, parser_name))
|
||||
|
||||
def log_deduping_finished(num_new_links: int):
|
||||
print(' > Found {} new URLs not already in index'.format(num_new_links))
|
||||
|
||||
|
||||
def log_crawl_started(new_links):
|
||||
print('{lightred}[*] Starting crawl of {} sites 1 hop out from starting point{reset}'.format(len(new_links), **ANSI))
|
||||
|
||||
### Indexing Stage
|
||||
|
||||
|
|
@ -166,20 +206,23 @@ def log_indexing_process_started(num_links: int):
|
|||
start_ts = datetime.now()
|
||||
_LAST_RUN_STATS.index_start_ts = start_ts
|
||||
print()
|
||||
print('{green}[*] [{}] Writing {} links to main index...{reset}'.format(
|
||||
print('{black}[*] [{}] Writing {} links to main index...{reset}'.format(
|
||||
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
|
||||
num_links,
|
||||
**ANSI,
|
||||
))
|
||||
|
||||
|
||||
def log_indexing_process_finished():
|
||||
end_ts = datetime.now()
|
||||
_LAST_RUN_STATS.index_end_ts = end_ts
|
||||
|
||||
|
||||
def log_indexing_started(out_path: str):
|
||||
if IS_TTY:
|
||||
sys.stdout.write(f' > {out_path}')
|
||||
|
||||
|
||||
def log_indexing_finished(out_path: str):
|
||||
print(f'\r √ {out_path}')
|
||||
|
||||
|
|
@ -198,7 +241,7 @@ def log_archiving_started(num_links: int, resume: Optional[float]=None):
|
|||
**ANSI,
|
||||
))
|
||||
else:
|
||||
print('{green}[▶] [{}] Updating content for {} matching pages in archive...{reset}'.format(
|
||||
print('{green}[▶] [{}] Collecting content for {} Snapshots in archive...{reset}'.format(
|
||||
start_ts.strftime('%Y-%m-%d %H:%M:%S'),
|
||||
num_links,
|
||||
**ANSI,
|
||||
|
|
@ -216,8 +259,8 @@ def log_archiving_paused(num_links: int, idx: int, timestamp: str):
|
|||
total=num_links,
|
||||
))
|
||||
print()
|
||||
print(' To view your archive, open:')
|
||||
print(' {}/index.html'.format(OUTPUT_DIR))
|
||||
print(' {lightred}Hint:{reset} To view your archive index, open:'.format(**ANSI))
|
||||
print(' {}/{}'.format(OUTPUT_DIR, HTML_INDEX_FILENAME))
|
||||
print(' Continue archiving where you left off by running:')
|
||||
print(' archivebox update --resume={}'.format(timestamp))
|
||||
|
||||
|
|
@ -227,9 +270,9 @@ def log_archiving_finished(num_links: int):
|
|||
assert _LAST_RUN_STATS.archiving_start_ts is not None
|
||||
seconds = end_ts.timestamp() - _LAST_RUN_STATS.archiving_start_ts.timestamp()
|
||||
if seconds > 60:
|
||||
duration = '{0:.2f} min'.format(seconds / 60, 2)
|
||||
duration = '{0:.2f} min'.format(seconds / 60)
|
||||
else:
|
||||
duration = '{0:.2f} sec'.format(seconds, 2)
|
||||
duration = '{0:.2f} sec'.format(seconds)
|
||||
|
||||
print()
|
||||
print('{}[√] [{}] Update of {} pages complete ({}){}'.format(
|
||||
|
|
@ -243,13 +286,13 @@ def log_archiving_finished(num_links: int):
|
|||
print(' - {} links updated'.format(_LAST_RUN_STATS.succeeded))
|
||||
print(' - {} links had errors'.format(_LAST_RUN_STATS.failed))
|
||||
print()
|
||||
print(' To view your archive, open:')
|
||||
print(' {}/index.html'.format(OUTPUT_DIR))
|
||||
print(' {lightred}Hint:{reset} To view your archive index, open:'.format(**ANSI))
|
||||
print(' {}/{}'.format(OUTPUT_DIR, HTML_INDEX_FILENAME))
|
||||
print(' Or run the built-in webserver:')
|
||||
print(' archivebox server')
|
||||
|
||||
|
||||
def log_link_archiving_started(link: Link, link_dir: str, is_new: bool):
|
||||
def log_link_archiving_started(link: "Link", link_dir: str, is_new: bool):
|
||||
# [*] [2019-03-22 13:46:45] "Log Structured Merge Trees - ben stopford"
|
||||
# http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
|
||||
# > output/archive/1478739709
|
||||
|
|
@ -267,7 +310,7 @@ def log_link_archiving_started(link: Link, link_dir: str, is_new: bool):
|
|||
pretty_path(link_dir),
|
||||
))
|
||||
|
||||
def log_link_archiving_finished(link: Link, link_dir: str, is_new: bool, stats: dict):
|
||||
def log_link_archiving_finished(link: "Link", link_dir: str, is_new: bool, stats: dict):
|
||||
total = sum(stats.values())
|
||||
|
||||
if stats['failed'] > 0 :
|
||||
|
|
@ -282,7 +325,7 @@ def log_archive_method_started(method: str):
|
|||
print(' > {}'.format(method))
|
||||
|
||||
|
||||
def log_archive_method_finished(result: ArchiveResult):
|
||||
def log_archive_method_finished(result: "ArchiveResult"):
|
||||
"""quote the argument with whitespace in a command so the user can
|
||||
copy-paste the outputted string directly to run the cmd
|
||||
"""
|
||||
|
|
@ -331,6 +374,7 @@ def log_list_started(filter_patterns: Optional[List[str]], filter_type: str):
|
|||
print(' {}'.format(' '.join(filter_patterns or ())))
|
||||
|
||||
def log_list_finished(links):
|
||||
from .index.csv import links_to_csv
|
||||
print()
|
||||
print('---------------------------------------------------------------------------------------------------')
|
||||
print(links_to_csv(links, cols=['timestamp', 'is_archived', 'num_outputs', 'url'], header=True, ljust=16, separator=' | '))
|
||||
|
|
@ -338,7 +382,7 @@ def log_list_finished(links):
|
|||
print()
|
||||
|
||||
|
||||
def log_removal_started(links: List[Link], yes: bool, delete: bool):
|
||||
def log_removal_started(links: List["Link"], yes: bool, delete: bool):
|
||||
print('{lightyellow}[i] Found {} matching URLs to remove.{reset}'.format(len(links), **ANSI))
|
||||
if delete:
|
||||
file_counts = [link.num_outputs for link in links if os.path.exists(link.link_dir)]
|
||||
|
|
@ -348,8 +392,8 @@ def log_removal_started(links: List[Link], yes: bool, delete: bool):
|
|||
)
|
||||
else:
|
||||
print(
|
||||
f' Matching links will be de-listed from the main index, but their archived content folders will remain in place on disk.\n'
|
||||
f' (Pass --delete if you also want to permanently delete the data folders)'
|
||||
' Matching links will be de-listed from the main index, but their archived content folders will remain in place on disk.\n'
|
||||
' (Pass --delete if you also want to permanently delete the data folders)'
|
||||
)
|
||||
|
||||
if not yes:
|
||||
|
|
@ -376,7 +420,7 @@ def log_removal_finished(all_links: int, to_keep: int):
|
|||
|
||||
|
||||
def log_shell_welcome_msg():
|
||||
from . import list_subcommands
|
||||
from .cli import list_subcommands
|
||||
|
||||
print('{green}# ArchiveBox Imports{reset}'.format(**ANSI))
|
||||
print('{green}from archivebox.core.models import Snapshot, User{reset}'.format(**ANSI))
|
||||
|
|
@ -412,13 +456,15 @@ def printable_filesize(num_bytes: Union[int, float]) -> str:
|
|||
|
||||
|
||||
@enforce_types
|
||||
def printable_folders(folders: Dict[str, Optional[Link]],
|
||||
def printable_folders(folders: Dict[str, Optional["Link"]],
|
||||
json: bool=False,
|
||||
csv: Optional[str]=None) -> str:
|
||||
if json:
|
||||
from .index.json import to_json
|
||||
return to_json(folders.values(), indent=4, sort_keys=True)
|
||||
|
||||
elif csv:
|
||||
from .index.csv import links_to_csv
|
||||
return links_to_csv(folders.values(), cols=csv.split(','), header=True)
|
||||
|
||||
return '\n'.join(f'{folder} {link}' for folder, link in folders.items())
|
||||
|
|
@ -472,6 +518,7 @@ def printable_folder_status(name: str, folder: Dict) -> str:
|
|||
|
||||
@enforce_types
|
||||
def printable_dependency_version(name: str, dependency: Dict) -> str:
|
||||
version = None
|
||||
if dependency['enabled']:
|
||||
if dependency['is_valid']:
|
||||
color, symbol, note, version = 'green', '√', 'valid', ''
|
||||
|
|
@ -4,8 +4,7 @@ import os
|
|||
import sys
|
||||
import shutil
|
||||
|
||||
from typing import Dict, List, Optional, Iterable, IO
|
||||
|
||||
from typing import Dict, List, Optional, Iterable, IO, Union
|
||||
from crontab import CronTab, CronSlices
|
||||
|
||||
from .cli import (
|
||||
|
|
@ -17,16 +16,17 @@ from .cli import (
|
|||
archive_cmds,
|
||||
)
|
||||
from .parsers import (
|
||||
save_stdin_to_sources,
|
||||
save_file_to_sources,
|
||||
save_text_as_source,
|
||||
save_file_as_source,
|
||||
parse_links_memory,
|
||||
)
|
||||
from .index.schema import Link
|
||||
from .util import enforce_types, docstring
|
||||
from .util import enforce_types # type: ignore
|
||||
from .system import get_dir_size, dedupe_cron_jobs, CRON_COMMENT
|
||||
from .index import (
|
||||
links_after_timestamp,
|
||||
load_main_index,
|
||||
import_new_links,
|
||||
parse_links_from_source,
|
||||
dedupe_links,
|
||||
write_main_index,
|
||||
link_matches_filter,
|
||||
get_indexed_folders,
|
||||
|
|
@ -49,14 +49,16 @@ from .index.sql import (
|
|||
parse_sql_main_index,
|
||||
get_admins,
|
||||
apply_migrations,
|
||||
remove_from_sql_main_index,
|
||||
)
|
||||
from .index.html import parse_html_main_index
|
||||
from .extractors import archive_link
|
||||
from .extractors import archive_links, archive_link, ignore_methods
|
||||
from .config import (
|
||||
stderr,
|
||||
ConfigDict,
|
||||
ANSI,
|
||||
IS_TTY,
|
||||
IN_DOCKER,
|
||||
USER,
|
||||
ARCHIVEBOX_BINARY,
|
||||
ONLY_NEW,
|
||||
|
|
@ -88,11 +90,11 @@ from .config import (
|
|||
USER_CONFIG,
|
||||
get_real_name,
|
||||
)
|
||||
from .cli.logging import (
|
||||
from .logging_util import (
|
||||
TERM_WIDTH,
|
||||
TimedProgress,
|
||||
log_archiving_started,
|
||||
log_archiving_paused,
|
||||
log_archiving_finished,
|
||||
log_importing_started,
|
||||
log_crawl_started,
|
||||
log_removal_started,
|
||||
log_removal_finished,
|
||||
log_list_started,
|
||||
|
|
@ -161,7 +163,7 @@ def help(out_dir: str=OUTPUT_DIR) -> None:
|
|||
{lightred}Example Use:{reset}
|
||||
mkdir my-archive; cd my-archive/
|
||||
archivebox init
|
||||
archivebox info
|
||||
archivebox status
|
||||
|
||||
archivebox add https://example.com/some/page
|
||||
archivebox add --depth=1 ~/Downloads/bookmarks_export.html
|
||||
|
|
@ -177,6 +179,10 @@ def help(out_dir: str=OUTPUT_DIR) -> None:
|
|||
else:
|
||||
print('{green}Welcome to ArchiveBox v{}!{reset}'.format(VERSION, **ANSI))
|
||||
print()
|
||||
if IN_DOCKER:
|
||||
print('When using Docker, you need to mount a volume to use as your data dir:')
|
||||
print(' docker run -v /some/path:/data archivebox ...')
|
||||
print()
|
||||
print('To import an existing archive (from a previous version of ArchiveBox):')
|
||||
print(' 1. cd into your data dir OUTPUT_DIR (usually ArchiveBox/output) and run:')
|
||||
print(' 2. archivebox init')
|
||||
|
|
@ -241,7 +247,6 @@ def run(subcommand: str,
|
|||
def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
|
||||
"""Initialize a new ArchiveBox collection in the current directory"""
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
is_empty = not len(set(os.listdir(out_dir)) - ALLOWED_IN_OUTPUT_DIR)
|
||||
existing_index = os.path.exists(os.path.join(out_dir, JSON_INDEX_FILENAME))
|
||||
|
||||
|
|
@ -291,15 +296,14 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
|
|||
print('\n{green}[+] Building main SQL index and running migrations...{reset}'.format(**ANSI))
|
||||
|
||||
setup_django(out_dir, check_db=False)
|
||||
from django.conf import settings
|
||||
assert settings.DATABASE_FILE == os.path.join(out_dir, SQL_INDEX_FILENAME)
|
||||
print(f' √ {settings.DATABASE_FILE}')
|
||||
DATABASE_FILE = os.path.join(out_dir, SQL_INDEX_FILENAME)
|
||||
print(f' √ {DATABASE_FILE}')
|
||||
print()
|
||||
for migration_line in apply_migrations(out_dir):
|
||||
print(f' {migration_line}')
|
||||
|
||||
|
||||
assert os.path.exists(settings.DATABASE_FILE)
|
||||
assert os.path.exists(DATABASE_FILE)
|
||||
|
||||
# from django.contrib.auth.models import User
|
||||
# if IS_TTY and not User.objects.filter(is_superuser=True).exists():
|
||||
|
|
@ -364,7 +368,7 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
|
|||
print(' X ' + '\n X '.join(f'{folder} {link}' for folder, link in invalid_folders.items()))
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} For more information about the link data directories that were skipped, run:'.format(**ANSI))
|
||||
print(' archivebox info')
|
||||
print(' archivebox status')
|
||||
print(' archivebox list --status=invalid')
|
||||
|
||||
|
||||
|
|
@ -376,27 +380,31 @@ def init(force: bool=False, out_dir: str=OUTPUT_DIR) -> None:
|
|||
else:
|
||||
print('{green}[√] Done. A new ArchiveBox collection was initialized ({} links).{reset}'.format(len(all_links), **ANSI))
|
||||
print()
|
||||
print(' To view your archive index, open:')
|
||||
print(' {}'.format(os.path.join(out_dir, HTML_INDEX_FILENAME)))
|
||||
print(' {lightred}Hint:{reset} To view your archive index, run:'.format(**ANSI))
|
||||
print(' archivebox server # then visit http://127.0.0.1:8000')
|
||||
print()
|
||||
print(' To add new links, you can run:')
|
||||
print(" archivebox add 'https://example.com'")
|
||||
print(" archivebox add ~/some/path/or/url/to/list_of_links.txt")
|
||||
print()
|
||||
print(' For more usage and examples, run:')
|
||||
print(' archivebox help')
|
||||
|
||||
|
||||
@enforce_types
|
||||
def info(out_dir: str=OUTPUT_DIR) -> None:
|
||||
def status(out_dir: str=OUTPUT_DIR) -> None:
|
||||
"""Print out some info and statistics about the archive collection"""
|
||||
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
print('{green}[*] Scanning archive collection main index...{reset}'.format(**ANSI))
|
||||
print(f' {out_dir}/*')
|
||||
from core.models import Snapshot
|
||||
from django.contrib.auth import get_user_model
|
||||
User = get_user_model()
|
||||
|
||||
print('{green}[*] Scanning archive main index...{reset}'.format(**ANSI))
|
||||
print(ANSI['lightyellow'], f' {out_dir}/*', ANSI['reset'])
|
||||
num_bytes, num_dirs, num_files = get_dir_size(out_dir, recursive=False, pattern='index.')
|
||||
size = printable_filesize(num_bytes)
|
||||
print(f' Size: {size} across {num_files} files')
|
||||
print(f' Index size: {size} across {num_files} files')
|
||||
print()
|
||||
|
||||
links = list(load_main_index(out_dir=out_dir))
|
||||
|
|
@ -404,33 +412,23 @@ def info(out_dir: str=OUTPUT_DIR) -> None:
|
|||
num_sql_links = sum(1 for link in parse_sql_main_index(out_dir=out_dir))
|
||||
num_html_links = sum(1 for url in parse_html_main_index(out_dir=out_dir))
|
||||
num_link_details = sum(1 for link in parse_json_links_details(out_dir=out_dir))
|
||||
users = get_admins().values_list('username', flat=True)
|
||||
print(f' > JSON Main Index: {num_json_links} links'.ljust(36), f'(found in {JSON_INDEX_FILENAME})')
|
||||
print(f' > SQL Main Index: {num_sql_links} links'.ljust(36), f'(found in {SQL_INDEX_FILENAME})')
|
||||
print(f' > HTML Main Index: {num_html_links} links'.ljust(36), f'(found in {HTML_INDEX_FILENAME})')
|
||||
print(f' > JSON Link Details: {num_link_details} links'.ljust(36), f'(found in {ARCHIVE_DIR_NAME}/*/index.json)')
|
||||
|
||||
print(f' > Admin: {len(users)} users {", ".join(users)}'.ljust(36), f'(found in {SQL_INDEX_FILENAME})')
|
||||
|
||||
if num_html_links != len(links) or num_sql_links != len(links):
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} You can fix index count differences automatically by running:'.format(**ANSI))
|
||||
print(' archivebox init')
|
||||
|
||||
if not users:
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} You can create an admin user by running:'.format(**ANSI))
|
||||
print(' archivebox manage createsuperuser')
|
||||
|
||||
print()
|
||||
print('{green}[*] Scanning archive collection link data directories...{reset}'.format(**ANSI))
|
||||
print(f' {ARCHIVE_DIR}/*')
|
||||
|
||||
print('{green}[*] Scanning archive data directories...{reset}'.format(**ANSI))
|
||||
print(ANSI['lightyellow'], f' {ARCHIVE_DIR}/*', ANSI['reset'])
|
||||
num_bytes, num_dirs, num_files = get_dir_size(ARCHIVE_DIR)
|
||||
size = printable_filesize(num_bytes)
|
||||
print(f' Size: {size} across {num_files} files in {num_dirs} directories')
|
||||
print()
|
||||
|
||||
print(ANSI['black'])
|
||||
num_indexed = len(get_indexed_folders(links, out_dir=out_dir))
|
||||
num_archived = len(get_archived_folders(links, out_dir=out_dir))
|
||||
num_unarchived = len(get_unarchived_folders(links, out_dir=out_dir))
|
||||
|
|
@ -454,91 +452,125 @@ def info(out_dir: str=OUTPUT_DIR) -> None:
|
|||
print(f' > orphaned: {len(orphaned)}'.ljust(36), f'({get_orphaned_folders.__doc__})')
|
||||
print(f' > corrupted: {len(corrupted)}'.ljust(36), f'({get_corrupted_folders.__doc__})')
|
||||
print(f' > unrecognized: {len(unrecognized)}'.ljust(36), f'({get_unrecognized_folders.__doc__})')
|
||||
|
||||
|
||||
print(ANSI['reset'])
|
||||
|
||||
if num_indexed:
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} You can list link data directories by status like so:'.format(**ANSI))
|
||||
print(' archivebox list --status=<status> (e.g. indexed, corrupted, archived, etc.)')
|
||||
|
||||
if orphaned:
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} To automatically import orphaned data directories into the main index, run:'.format(**ANSI))
|
||||
print(' archivebox init')
|
||||
|
||||
if num_invalid:
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} You may need to manually remove or fix some invalid data directories, afterwards make sure to run:'.format(**ANSI))
|
||||
print(' archivebox init')
|
||||
|
||||
print()
|
||||
print('{green}[*] Scanning recent archive changes and user logins:{reset}'.format(**ANSI))
|
||||
print(ANSI['lightyellow'], f' {LOGS_DIR}/*', ANSI['reset'])
|
||||
users = get_admins().values_list('username', flat=True)
|
||||
print(f' UI users {len(users)}: {", ".join(users)}')
|
||||
last_login = User.objects.order_by('last_login').last()
|
||||
if last_login:
|
||||
print(f' Last UI login: {last_login.username} @ {str(last_login.last_login)[:16]}')
|
||||
last_updated = Snapshot.objects.order_by('updated').last()
|
||||
print(f' Last changes: {str(last_updated.updated)[:16]}')
|
||||
|
||||
if not users:
|
||||
print()
|
||||
print(' {lightred}Hint:{reset} You can create an admin user by running:'.format(**ANSI))
|
||||
print(' archivebox manage createsuperuser')
|
||||
|
||||
print()
|
||||
for snapshot in Snapshot.objects.order_by('-updated')[:10]:
|
||||
if not snapshot.updated:
|
||||
continue
|
||||
print(
|
||||
ANSI['black'],
|
||||
(
|
||||
f' > {str(snapshot.updated)[:16]} '
|
||||
f'[{snapshot.num_outputs} {("X", "√")[snapshot.is_archived]} {printable_filesize(snapshot.archive_size)}] '
|
||||
f'"{snapshot.title}": {snapshot.url}'
|
||||
)[:TERM_WIDTH()],
|
||||
ANSI['reset'],
|
||||
)
|
||||
print(ANSI['black'], ' ...', ANSI['reset'])
|
||||
|
||||
|
||||
@enforce_types
|
||||
def add(import_str: Optional[str]=None,
|
||||
import_path: Optional[str]=None,
|
||||
def oneshot(url: str, out_dir: str=OUTPUT_DIR):
|
||||
"""
|
||||
Create a single URL archive folder with an index.json and index.html, and all the archive method outputs.
|
||||
You can run this to archive single pages without needing to create a whole collection with archivebox init.
|
||||
"""
|
||||
oneshot_link, _ = parse_links_memory([url])
|
||||
if len(oneshot_link) > 1:
|
||||
stderr(
|
||||
'[X] You should pass a single url to the oneshot command',
|
||||
color='red'
|
||||
)
|
||||
raise SystemExit(2)
|
||||
methods = ignore_methods(['title'])
|
||||
archive_link(oneshot_link[0], out_dir=out_dir, methods=methods, skip_index=True)
|
||||
return oneshot_link
|
||||
|
||||
@enforce_types
|
||||
def add(urls: Union[str, List[str]],
|
||||
depth: int=0,
|
||||
update_all: bool=not ONLY_NEW,
|
||||
index_only: bool=False,
|
||||
out_dir: str=OUTPUT_DIR) -> List[Link]:
|
||||
"""Add a new URL or list of URLs to your archive"""
|
||||
|
||||
assert depth in (0, 1), 'Depth must be 0 or 1 (depth >1 is not supported yet)'
|
||||
|
||||
# Load list of links from the existing index
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
if import_str and import_path:
|
||||
stderr(
|
||||
'[X] You should pass either an import path as an argument, '
|
||||
'or pass a list of links via stdin, but not both.\n',
|
||||
color='red',
|
||||
)
|
||||
raise SystemExit(2)
|
||||
elif import_str:
|
||||
import_path = save_stdin_to_sources(import_str, out_dir=out_dir)
|
||||
else:
|
||||
import_path = save_file_to_sources(import_path, out_dir=out_dir)
|
||||
|
||||
check_dependencies()
|
||||
|
||||
# Step 1: Load list of links from the existing index
|
||||
# merge in and dedupe new links from import_path
|
||||
all_links: List[Link] = []
|
||||
new_links: List[Link] = []
|
||||
all_links = load_main_index(out_dir=out_dir)
|
||||
if import_path:
|
||||
all_links, new_links = import_new_links(all_links, import_path, out_dir=out_dir)
|
||||
|
||||
# Step 2: Write updated index with deduped old and new links back to disk
|
||||
write_main_index(links=all_links, out_dir=out_dir)
|
||||
log_importing_started(urls=urls, depth=depth, index_only=index_only)
|
||||
if isinstance(urls, str):
|
||||
# save verbatim stdin to sources
|
||||
write_ahead_log = save_text_as_source(urls, filename='{ts}-import.txt', out_dir=out_dir)
|
||||
elif isinstance(urls, list):
|
||||
# save verbatim args to sources
|
||||
write_ahead_log = save_text_as_source('\n'.join(urls), filename='{ts}-import.txt', out_dir=out_dir)
|
||||
|
||||
new_links += parse_links_from_source(write_ahead_log)
|
||||
|
||||
# If we're going one level deeper, download each link and look for more links
|
||||
new_links_depth = []
|
||||
if new_links and depth == 1:
|
||||
log_crawl_started(new_links)
|
||||
for new_link in new_links:
|
||||
downloaded_file = save_file_as_source(new_link.url, filename='{ts}-crawl-{basename}.txt', out_dir=out_dir)
|
||||
new_links_depth += parse_links_from_source(downloaded_file)
|
||||
all_links, new_links = dedupe_links(all_links, new_links + new_links_depth)
|
||||
write_main_index(links=all_links, out_dir=out_dir, finished=not new_links)
|
||||
|
||||
if index_only:
|
||||
return all_links
|
||||
|
||||
# Step 3: Run the archive methods for each link
|
||||
links = all_links if update_all else new_links
|
||||
log_archiving_started(len(links))
|
||||
idx: int = 0
|
||||
link: Link = None # type: ignore
|
||||
try:
|
||||
for idx, link in enumerate(links):
|
||||
archive_link(link, out_dir=link.link_dir)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
log_archiving_paused(len(links), idx, link.timestamp if link else '0')
|
||||
raise SystemExit(0)
|
||||
|
||||
except:
|
||||
print()
|
||||
raise
|
||||
|
||||
log_archiving_finished(len(links))
|
||||
# Run the archive methods for each link
|
||||
to_archive = all_links if update_all else new_links
|
||||
archive_links(to_archive, out_dir=out_dir)
|
||||
|
||||
# Step 4: Re-write links index with updated titles, icons, and resources
|
||||
all_links = load_main_index(out_dir=out_dir)
|
||||
write_main_index(links=list(all_links), out_dir=out_dir, finished=True)
|
||||
if to_archive:
|
||||
all_links = load_main_index(out_dir=out_dir)
|
||||
write_main_index(links=list(all_links), out_dir=out_dir, finished=True)
|
||||
return all_links
|
||||
|
||||
@enforce_types
|
||||
def remove(filter_str: Optional[str]=None,
|
||||
filter_patterns: Optional[List[str]]=None,
|
||||
filter_type: str='exact',
|
||||
links: Optional[List[Link]]=None,
|
||||
after: Optional[float]=None,
|
||||
before: Optional[float]=None,
|
||||
yes: bool=False,
|
||||
|
|
@ -548,38 +580,40 @@ def remove(filter_str: Optional[str]=None,
|
|||
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
if filter_str and filter_patterns:
|
||||
stderr(
|
||||
'[X] You should pass either a pattern as an argument, '
|
||||
'or pass a list of patterns via stdin, but not both.\n',
|
||||
color='red',
|
||||
)
|
||||
raise SystemExit(2)
|
||||
elif not (filter_str or filter_patterns):
|
||||
stderr(
|
||||
'[X] You should pass either a pattern as an argument, '
|
||||
'or pass a list of patterns via stdin.',
|
||||
color='red',
|
||||
)
|
||||
stderr()
|
||||
stderr(' {lightred}Hint:{reset} To remove all urls you can run:'.format(**ANSI))
|
||||
stderr(" archivebox remove --filter-type=regex '.*'")
|
||||
stderr()
|
||||
raise SystemExit(2)
|
||||
elif filter_str:
|
||||
filter_patterns = [ptn.strip() for ptn in filter_str.split('\n')]
|
||||
if links is None:
|
||||
if filter_str and filter_patterns:
|
||||
stderr(
|
||||
'[X] You should pass either a pattern as an argument, '
|
||||
'or pass a list of patterns via stdin, but not both.\n',
|
||||
color='red',
|
||||
)
|
||||
raise SystemExit(2)
|
||||
elif not (filter_str or filter_patterns):
|
||||
stderr(
|
||||
'[X] You should pass either a pattern as an argument, '
|
||||
'or pass a list of patterns via stdin.',
|
||||
color='red',
|
||||
)
|
||||
stderr()
|
||||
stderr(' {lightred}Hint:{reset} To remove all urls you can run:'.format(**ANSI))
|
||||
stderr(" archivebox remove --filter-type=regex '.*'")
|
||||
stderr()
|
||||
raise SystemExit(2)
|
||||
elif filter_str:
|
||||
filter_patterns = [ptn.strip() for ptn in filter_str.split('\n')]
|
||||
|
||||
log_list_started(filter_patterns, filter_type)
|
||||
timer = TimedProgress(360, prefix=' ')
|
||||
try:
|
||||
links = list(list_links(
|
||||
filter_patterns=filter_patterns,
|
||||
filter_type=filter_type,
|
||||
after=after,
|
||||
before=before,
|
||||
))
|
||||
finally:
|
||||
timer.end()
|
||||
|
||||
log_list_started(filter_patterns, filter_type)
|
||||
timer = TimedProgress(360, prefix=' ')
|
||||
try:
|
||||
links = list(list_links(
|
||||
filter_patterns=filter_patterns,
|
||||
filter_type=filter_type,
|
||||
after=after,
|
||||
before=before,
|
||||
))
|
||||
finally:
|
||||
timer.end()
|
||||
|
||||
if not len(links):
|
||||
log_removal_finished(0, 0)
|
||||
|
|
@ -592,20 +626,26 @@ def remove(filter_str: Optional[str]=None,
|
|||
timer = TimedProgress(360, prefix=' ')
|
||||
try:
|
||||
to_keep = []
|
||||
to_delete = []
|
||||
all_links = load_main_index(out_dir=out_dir)
|
||||
for link in all_links:
|
||||
should_remove = (
|
||||
(after is not None and float(link.timestamp) < after)
|
||||
or (before is not None and float(link.timestamp) > before)
|
||||
or link_matches_filter(link, filter_patterns, filter_type)
|
||||
or link_matches_filter(link, filter_patterns or [], filter_type)
|
||||
or link in links
|
||||
)
|
||||
if not should_remove:
|
||||
if should_remove:
|
||||
to_delete.append(link)
|
||||
|
||||
if delete:
|
||||
shutil.rmtree(link.link_dir, ignore_errors=True)
|
||||
else:
|
||||
to_keep.append(link)
|
||||
elif should_remove and delete:
|
||||
shutil.rmtree(link.link_dir, ignore_errors=True)
|
||||
finally:
|
||||
timer.end()
|
||||
|
||||
remove_from_sql_main_index(links=to_delete, out_dir=out_dir)
|
||||
write_main_index(links=to_keep, out_dir=out_dir, finished=True)
|
||||
log_removal_finished(len(all_links), len(to_keep))
|
||||
|
||||
|
|
@ -625,8 +665,8 @@ def update(resume: Optional[float]=None,
|
|||
out_dir: str=OUTPUT_DIR) -> List[Link]:
|
||||
"""Import any new links from subscriptions and retry any previously failed/skipped links"""
|
||||
|
||||
check_dependencies()
|
||||
check_data_folder(out_dir=out_dir)
|
||||
check_dependencies()
|
||||
|
||||
# Step 1: Load list of links from the existing index
|
||||
# merge in and dedupe new links from import_path
|
||||
|
|
@ -655,23 +695,8 @@ def update(resume: Optional[float]=None,
|
|||
return all_links
|
||||
|
||||
# Step 3: Run the archive methods for each link
|
||||
links = new_links if only_new else all_links
|
||||
log_archiving_started(len(links), resume)
|
||||
idx: int = 0
|
||||
link: Link = None # type: ignore
|
||||
try:
|
||||
for idx, link in enumerate(links_after_timestamp(links, resume)):
|
||||
archive_link(link, overwrite=overwrite, out_dir=link.link_dir)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
log_archiving_paused(len(links), idx, link.timestamp if link else '0')
|
||||
raise SystemExit(0)
|
||||
|
||||
except:
|
||||
print()
|
||||
raise
|
||||
|
||||
log_archiving_finished(len(links))
|
||||
to_archive = new_links if only_new else all_links
|
||||
archive_links(to_archive, overwrite=overwrite, out_dir=out_dir)
|
||||
|
||||
# Step 4: Re-write links index with updated titles, icons, and resources
|
||||
all_links = load_main_index(out_dir=out_dir)
|
||||
|
|
@ -860,7 +885,7 @@ def config(config_options_str: Optional[str]=None,
|
|||
print(' {}'.format(printable_config(side_effect_changes, prefix=' ')))
|
||||
if failed_options:
|
||||
stderr()
|
||||
stderr('[X] These options failed to set:', color='red')
|
||||
stderr('[X] These options failed to set (check for typos):', color='red')
|
||||
stderr(' {}'.format('\n '.join(failed_options)))
|
||||
raise SystemExit(bool(failed_options))
|
||||
elif reset:
|
||||
|
|
@ -974,7 +999,7 @@ def schedule(add: bool=False,
|
|||
if total_runs > 60 and not quiet:
|
||||
stderr()
|
||||
stderr('{lightyellow}[!] With the current cron config, ArchiveBox is estimated to run >{} times per year.{reset}'.format(total_runs, **ANSI))
|
||||
stderr(f' Congrats on being an enthusiastic internet archiver! 👌')
|
||||
stderr(' Congrats on being an enthusiastic internet archiver! 👌')
|
||||
stderr()
|
||||
stderr(' Make sure you have enough storage space available to hold all the data.')
|
||||
stderr(' Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.')
|
||||
|
|
@ -985,32 +1010,50 @@ def schedule(add: bool=False,
|
|||
def server(runserver_args: Optional[List[str]]=None,
|
||||
reload: bool=False,
|
||||
debug: bool=False,
|
||||
init: bool=False,
|
||||
out_dir: str=OUTPUT_DIR) -> None:
|
||||
"""Run the ArchiveBox HTTP server"""
|
||||
|
||||
runserver_args = runserver_args or []
|
||||
|
||||
if init:
|
||||
run_subcommand('init', stdin=None, pwd=out_dir)
|
||||
|
||||
# setup config for django runserver
|
||||
from . import config
|
||||
config.SHOW_PROGRESS = False
|
||||
config.DEBUG = config.DEBUG or debug
|
||||
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
if debug:
|
||||
os.environ['DEBUG'] = 'True'
|
||||
else:
|
||||
runserver_args.append('--insecure')
|
||||
|
||||
setup_django(out_dir)
|
||||
|
||||
from django.core.management import call_command
|
||||
from django.contrib.auth.models import User
|
||||
|
||||
if IS_TTY and not User.objects.filter(is_superuser=True).exists():
|
||||
admin_user = User.objects.filter(is_superuser=True).order_by('date_joined').only('username').last()
|
||||
|
||||
print('{green}[+] Starting ArchiveBox webserver...{reset}'.format(**ANSI))
|
||||
if admin_user:
|
||||
print("{lightred}[i] The admin username is:{lightblue} {}{reset}".format(admin_user.username, **ANSI))
|
||||
else:
|
||||
print('{lightyellow}[!] No admin users exist yet, you will not be able to edit links in the UI.{reset}'.format(**ANSI))
|
||||
print()
|
||||
print(' To create an admin user, run:')
|
||||
print(' archivebox manage createsuperuser')
|
||||
print()
|
||||
|
||||
print('{green}[+] Starting ArchiveBox webserver...{reset}'.format(**ANSI))
|
||||
# fallback to serving staticfiles insecurely with django when DEBUG=False
|
||||
if not config.DEBUG:
|
||||
runserver_args.append('--insecure') # TODO: serve statics w/ nginx instead
|
||||
|
||||
# toggle autoreloading when archivebox code changes (it's on by default)
|
||||
if not reload:
|
||||
runserver_args.append('--noreload')
|
||||
|
||||
config.SHOW_PROGRESS = False
|
||||
config.DEBUG = config.DEBUG or debug
|
||||
|
||||
|
||||
call_command("runserver", *runserver_args)
|
||||
|
||||
|
||||
|
|
@ -1019,10 +1062,14 @@ def manage(args: Optional[List[str]]=None, out_dir: str=OUTPUT_DIR) -> None:
|
|||
"""Run an ArchiveBox Django management command"""
|
||||
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
setup_django(out_dir)
|
||||
from django.core.management import execute_from_command_line
|
||||
|
||||
if (args and "createsuperuser" in args) and (IN_DOCKER and not IS_TTY):
|
||||
stderr('[!] Warning: you need to pass -it to use interactive commands in docker', color='lightyellow')
|
||||
stderr(' docker run -it archivebox manage {}'.format(' '.join(args or ['...'])), color='lightyellow')
|
||||
stderr()
|
||||
|
||||
execute_from_command_line([f'{ARCHIVEBOX_BINARY} manage', *(args or ['help'])])
|
||||
|
||||
|
||||
|
|
@ -1035,3 +1082,4 @@ def shell(out_dir: str=OUTPUT_DIR) -> None:
|
|||
setup_django(OUTPUT_DIR)
|
||||
from django.core.management import call_command
|
||||
call_command("shell_plus")
|
||||
|
||||
|
|
|
|||
|
|
@ -3,6 +3,21 @@ import os
|
|||
import sys
|
||||
|
||||
if __name__ == '__main__':
|
||||
# if you're a developer working on archivebox, still prefer the archivebox
|
||||
# versions of ./manage.py commands whenever possible. When that's not possible
|
||||
# (e.g. makemigrations), you can comment out this check temporarily
|
||||
|
||||
if not ('makemigrations' in sys.argv or 'migrate' in sys.argv):
|
||||
print("[X] Don't run ./manage.py directly, use the archivebox CLI instead e.g.:")
|
||||
print(' archivebox manage createsuperuser')
|
||||
print()
|
||||
print(' Hint: Use these archivebox commands instead of the ./manage.py equivalents:')
|
||||
print(' archivebox init (migrates the databse to latest version)')
|
||||
print(' archivebox server (runs the Django web server)')
|
||||
print(' archivebox shell (opens an iPython Django shell with all models imported)')
|
||||
print(' archivebox manage [cmd] (any other management commands)')
|
||||
raise SystemExit(2)
|
||||
|
||||
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'core.settings')
|
||||
try:
|
||||
from django.core.management import execute_from_command_line
|
||||
|
|
|
|||
|
|
@ -9,27 +9,26 @@ __package__ = 'archivebox.parsers'
|
|||
|
||||
import re
|
||||
import os
|
||||
from io import StringIO
|
||||
|
||||
from typing import Tuple, List
|
||||
from typing import IO, Tuple, List
|
||||
from datetime import datetime
|
||||
|
||||
from ..index.schema import Link
|
||||
from ..system import atomic_write
|
||||
from ..config import (
|
||||
ANSI,
|
||||
OUTPUT_DIR,
|
||||
SOURCES_DIR_NAME,
|
||||
TIMEOUT,
|
||||
check_data_folder,
|
||||
)
|
||||
from ..util import (
|
||||
basename,
|
||||
domain,
|
||||
download_url,
|
||||
enforce_types,
|
||||
URL_REGEX,
|
||||
)
|
||||
from ..cli.logging import pretty_path, TimedProgress
|
||||
from ..index.schema import Link
|
||||
from ..logging_util import TimedProgress, log_source_saved
|
||||
from .pocket_html import parse_pocket_html_export
|
||||
from .pinboard_rss import parse_pinboard_rss_export
|
||||
from .shaarli_rss import parse_shaarli_rss_export
|
||||
|
|
@ -39,15 +38,7 @@ from .generic_rss import parse_generic_rss_export
|
|||
from .generic_json import parse_generic_json_export
|
||||
from .generic_txt import parse_generic_txt_export
|
||||
|
||||
|
||||
@enforce_types
|
||||
def parse_links(source_file: str) -> Tuple[List[Link], str]:
|
||||
"""parse a list of URLs with their metadata from an
|
||||
RSS feed, bookmarks export, or text file
|
||||
"""
|
||||
|
||||
check_url_parsing_invariants()
|
||||
PARSERS = (
|
||||
PARSERS = (
|
||||
# Specialized parsers
|
||||
('Pocket HTML', parse_pocket_html_export),
|
||||
('Pinboard RSS', parse_pinboard_rss_export),
|
||||
|
|
@ -62,57 +53,79 @@ def parse_links(source_file: str) -> Tuple[List[Link], str]:
|
|||
# Fallback parser
|
||||
('Plain Text', parse_generic_txt_export),
|
||||
)
|
||||
|
||||
@enforce_types
|
||||
def parse_links_memory(urls: List[str]):
|
||||
"""
|
||||
parse a list of URLS without touching the filesystem
|
||||
"""
|
||||
check_url_parsing_invariants()
|
||||
|
||||
timer = TimedProgress(TIMEOUT * 4)
|
||||
#urls = list(map(lambda x: x + "\n", urls))
|
||||
file = StringIO()
|
||||
file.writelines(urls)
|
||||
file.name = "io_string"
|
||||
output = _parse(file, timer)
|
||||
|
||||
if output is not None:
|
||||
return output
|
||||
|
||||
timer.end()
|
||||
return [], 'Failed to parse'
|
||||
|
||||
|
||||
@enforce_types
|
||||
def parse_links(source_file: str) -> Tuple[List[Link], str]:
|
||||
"""parse a list of URLs with their metadata from an
|
||||
RSS feed, bookmarks export, or text file
|
||||
"""
|
||||
|
||||
check_url_parsing_invariants()
|
||||
|
||||
timer = TimedProgress(TIMEOUT * 4)
|
||||
with open(source_file, 'r', encoding='utf-8') as file:
|
||||
for parser_name, parser_func in PARSERS:
|
||||
try:
|
||||
links = list(parser_func(file))
|
||||
if links:
|
||||
timer.end()
|
||||
return links, parser_name
|
||||
except Exception as err: # noqa
|
||||
pass
|
||||
# Parsers are tried one by one down the list, and the first one
|
||||
# that succeeds is used. To see why a certain parser was not used
|
||||
# due to error or format incompatibility, uncomment this line:
|
||||
# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))
|
||||
# raise
|
||||
output = _parse(file, timer)
|
||||
|
||||
if output is not None:
|
||||
return output
|
||||
|
||||
timer.end()
|
||||
return [], 'Failed to parse'
|
||||
|
||||
def _parse(to_parse: IO[str], timer) -> Tuple[List[Link], str]:
|
||||
for parser_name, parser_func in PARSERS:
|
||||
try:
|
||||
links = list(parser_func(to_parse))
|
||||
if links:
|
||||
timer.end()
|
||||
return links, parser_name
|
||||
except Exception as err: # noqa
|
||||
pass
|
||||
# Parsers are tried one by one down the list, and the first one
|
||||
# that succeeds is used. To see why a certain parser was not used
|
||||
# due to error or format incompatibility, uncomment this line:
|
||||
# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))
|
||||
# raise
|
||||
|
||||
|
||||
@enforce_types
|
||||
def save_stdin_to_sources(raw_text: str, out_dir: str=OUTPUT_DIR) -> str:
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
sources_dir = os.path.join(out_dir, SOURCES_DIR_NAME)
|
||||
if not os.path.exists(sources_dir):
|
||||
os.makedirs(sources_dir)
|
||||
|
||||
def save_text_as_source(raw_text: str, filename: str='{ts}-stdin.txt', out_dir: str=OUTPUT_DIR) -> str:
|
||||
ts = str(datetime.now().timestamp()).split('.', 1)[0]
|
||||
|
||||
source_path = os.path.join(sources_dir, '{}-{}.txt'.format('stdin', ts))
|
||||
|
||||
atomic_write(raw_text, source_path)
|
||||
source_path = os.path.join(out_dir, SOURCES_DIR_NAME, filename.format(ts=ts))
|
||||
atomic_write(source_path, raw_text)
|
||||
log_source_saved(source_file=source_path)
|
||||
return source_path
|
||||
|
||||
|
||||
@enforce_types
|
||||
def save_file_to_sources(path: str, timeout: int=TIMEOUT, out_dir: str=OUTPUT_DIR) -> str:
|
||||
def save_file_as_source(path: str, timeout: int=TIMEOUT, filename: str='{ts}-{basename}.txt', out_dir: str=OUTPUT_DIR) -> str:
|
||||
"""download a given url's content into output/sources/domain-<timestamp>.txt"""
|
||||
check_data_folder(out_dir=out_dir)
|
||||
|
||||
sources_dir = os.path.join(out_dir, SOURCES_DIR_NAME)
|
||||
if not os.path.exists(sources_dir):
|
||||
os.makedirs(sources_dir)
|
||||
|
||||
ts = str(datetime.now().timestamp()).split('.', 1)[0]
|
||||
|
||||
source_path = os.path.join(sources_dir, '{}-{}.txt'.format(basename(path), ts))
|
||||
source_path = os.path.join(OUTPUT_DIR, SOURCES_DIR_NAME, filename.format(basename=basename(path), ts=ts))
|
||||
|
||||
if any(path.startswith(s) for s in ('http://', 'https://', 'ftp://')):
|
||||
source_path = os.path.join(sources_dir, '{}-{}.txt'.format(domain(path), ts))
|
||||
# Source is a URL that needs to be downloaded
|
||||
print('{}[*] [{}] Downloading {}{}'.format(
|
||||
ANSI['green'],
|
||||
datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
|
||||
|
|
@ -134,12 +147,13 @@ def save_file_to_sources(path: str, timeout: int=TIMEOUT, out_dir: str=OUTPUT_DI
|
|||
raise SystemExit(1)
|
||||
|
||||
else:
|
||||
# Source is a path to a local file on the filesystem
|
||||
with open(path, 'r') as f:
|
||||
raw_source_text = f.read()
|
||||
|
||||
atomic_write(raw_source_text, source_path)
|
||||
atomic_write(source_path, raw_source_text)
|
||||
|
||||
print(' > {}'.format(pretty_path(source_path)))
|
||||
log_source_saved(source_file=source_path)
|
||||
|
||||
return source_path
|
||||
|
||||
|
|
|
|||
|
|
@ -5,6 +5,7 @@ import re
|
|||
|
||||
from typing import IO, Iterable
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from ..index.schema import Link
|
||||
from ..util import (
|
||||
|
|
@ -13,14 +14,28 @@ from ..util import (
|
|||
URL_REGEX
|
||||
)
|
||||
|
||||
|
||||
@enforce_types
|
||||
def parse_generic_txt_export(text_file: IO[str]) -> Iterable[Link]:
|
||||
"""Parse raw links from each line in a text file"""
|
||||
|
||||
text_file.seek(0)
|
||||
for line in text_file.readlines():
|
||||
urls = re.findall(URL_REGEX, line) if line.strip() else ()
|
||||
for url in urls: # type: ignore
|
||||
if not line.strip():
|
||||
continue
|
||||
|
||||
# if the line is a local file path that resolves, then we can archive it
|
||||
if Path(line).exists():
|
||||
yield Link(
|
||||
url=line,
|
||||
timestamp=str(datetime.now().timestamp()),
|
||||
title=None,
|
||||
tags=None,
|
||||
sources=[text_file.name],
|
||||
)
|
||||
|
||||
# otherwise look for anything that looks like a URL in the line
|
||||
for url in re.findall(URL_REGEX, line):
|
||||
yield Link(
|
||||
url=htmldecode(url),
|
||||
timestamp=str(datetime.now().timestamp()),
|
||||
|
|
@ -28,3 +43,15 @@ def parse_generic_txt_export(text_file: IO[str]) -> Iterable[Link]:
|
|||
tags=None,
|
||||
sources=[text_file.name],
|
||||
)
|
||||
|
||||
# look inside the URL for any sub-urls, e.g. for archive.org links
|
||||
# https://web.archive.org/web/20200531203453/https://www.reddit.com/r/socialism/comments/gu24ke/nypd_officers_claim_they_are_protecting_the_rule/fsfq0sw/
|
||||
# -> https://www.reddit.com/r/socialism/comments/gu24ke/nypd_officers_claim_they_are_protecting_the_rule/fsfq0sw/
|
||||
for url in re.findall(URL_REGEX, line[1:]):
|
||||
yield Link(
|
||||
url=htmldecode(url),
|
||||
timestamp=str(datetime.now().timestamp()),
|
||||
title=None,
|
||||
tags=None,
|
||||
sources=[text_file.name],
|
||||
)
|
||||
|
|
|
|||
|
|
@ -4,96 +4,60 @@ __package__ = 'archivebox'
|
|||
import os
|
||||
import shutil
|
||||
|
||||
import json as pyjson
|
||||
from json import dump
|
||||
from pathlib import Path
|
||||
from typing import Optional, Union, Set, Tuple
|
||||
from subprocess import run as subprocess_run
|
||||
|
||||
from crontab import CronTab
|
||||
|
||||
from subprocess import (
|
||||
Popen,
|
||||
PIPE,
|
||||
DEVNULL,
|
||||
CompletedProcess,
|
||||
TimeoutExpired,
|
||||
CalledProcessError,
|
||||
)
|
||||
from atomicwrites import atomic_write as lib_atomic_write
|
||||
|
||||
from .util import enforce_types, ExtendedEncoder
|
||||
from .config import OUTPUT_PERMISSIONS
|
||||
|
||||
|
||||
def run(*popenargs, input=None, capture_output=False, timeout=None, check=False, **kwargs):
|
||||
def run(*args, input=None, capture_output=True, text=False, **kwargs):
|
||||
"""Patched of subprocess.run to fix blocking io making timeout=innefective"""
|
||||
|
||||
if input is not None:
|
||||
if 'stdin' in kwargs:
|
||||
raise ValueError('stdin and input arguments may not both be used.')
|
||||
kwargs['stdin'] = PIPE
|
||||
|
||||
if capture_output:
|
||||
if ('stdout' in kwargs) or ('stderr' in kwargs):
|
||||
raise ValueError('stdout and stderr arguments may not be used '
|
||||
'with capture_output.')
|
||||
kwargs['stdout'] = PIPE
|
||||
kwargs['stderr'] = PIPE
|
||||
|
||||
with Popen(*popenargs, **kwargs) as process:
|
||||
try:
|
||||
stdout, stderr = process.communicate(input, timeout=timeout)
|
||||
except TimeoutExpired:
|
||||
process.kill()
|
||||
try:
|
||||
stdout, stderr = process.communicate(input, timeout=2)
|
||||
except:
|
||||
pass
|
||||
raise TimeoutExpired(popenargs[0][0], timeout)
|
||||
except BaseException:
|
||||
process.kill()
|
||||
# We don't call process.wait() as .__exit__ does that for us.
|
||||
raise
|
||||
retcode = process.poll()
|
||||
if check and retcode:
|
||||
raise CalledProcessError(retcode, process.args,
|
||||
output=stdout, stderr=stderr)
|
||||
return CompletedProcess(process.args, retcode, stdout, stderr)
|
||||
|
||||
|
||||
def atomic_write(contents: Union[dict, str, bytes], path: str) -> None:
|
||||
"""Safe atomic write to filesystem by writing to temp file + atomic rename"""
|
||||
try:
|
||||
tmp_file = '{}.tmp'.format(path)
|
||||
|
||||
if isinstance(contents, bytes):
|
||||
args = {'mode': 'wb+'}
|
||||
else:
|
||||
args = {'mode': 'w+', 'encoding': 'utf-8'}
|
||||
|
||||
with open(tmp_file, **args) as f:
|
||||
if isinstance(contents, dict):
|
||||
pyjson.dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
|
||||
else:
|
||||
f.write(contents)
|
||||
|
||||
os.fsync(f.fileno())
|
||||
|
||||
os.rename(tmp_file, path)
|
||||
chmod_file(path)
|
||||
finally:
|
||||
if os.path.exists(tmp_file):
|
||||
os.remove(tmp_file)
|
||||
return subprocess_run(*args, input=input, capture_output=capture_output, text=text, **kwargs)
|
||||
|
||||
|
||||
@enforce_types
|
||||
def chmod_file(path: str, cwd: str='.', permissions: str=OUTPUT_PERMISSIONS, timeout: int=30) -> None:
|
||||
def atomic_write(path: Union[Path, str], contents: Union[dict, str, bytes], overwrite: bool=True) -> None:
|
||||
"""Safe atomic write to filesystem by writing to temp file + atomic rename"""
|
||||
|
||||
mode = 'wb+' if isinstance(contents, bytes) else 'w'
|
||||
|
||||
# print('\n> Atomic Write:', mode, path, len(contents), f'overwrite={overwrite}')
|
||||
with lib_atomic_write(path, mode=mode, overwrite=overwrite) as f:
|
||||
if isinstance(contents, dict):
|
||||
dump(contents, f, indent=4, sort_keys=True, cls=ExtendedEncoder)
|
||||
elif isinstance(contents, (bytes, str)):
|
||||
f.write(contents)
|
||||
os.chmod(path, int(OUTPUT_PERMISSIONS, base=8))
|
||||
|
||||
@enforce_types
|
||||
def chmod_file(path: str, cwd: str='.', permissions: str=OUTPUT_PERMISSIONS) -> None:
|
||||
"""chmod -R <permissions> <cwd>/<path>"""
|
||||
|
||||
if not os.path.exists(os.path.join(cwd, path)):
|
||||
root = Path(cwd) / path
|
||||
if not root.exists():
|
||||
raise Exception('Failed to chmod: {} does not exist (did the previous step fail?)'.format(path))
|
||||
|
||||
chmod_result = run(['chmod', '-R', permissions, path], cwd=cwd, stdout=DEVNULL, stderr=PIPE, timeout=timeout)
|
||||
if chmod_result.returncode == 1:
|
||||
print(' ', chmod_result.stderr.decode())
|
||||
raise Exception('Failed to chmod {}/{}'.format(cwd, path))
|
||||
if not root.is_dir():
|
||||
os.chmod(root, int(OUTPUT_PERMISSIONS, base=8))
|
||||
else:
|
||||
for subpath in Path(path).glob('**/*'):
|
||||
os.chmod(subpath, int(OUTPUT_PERMISSIONS, base=8))
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -104,7 +68,8 @@ def copy_and_overwrite(from_path: str, to_path: str):
|
|||
shutil.copytree(from_path, to_path)
|
||||
else:
|
||||
with open(from_path, 'rb') as src:
|
||||
atomic_write(src.read(), to_path)
|
||||
contents = src.read()
|
||||
atomic_write(to_path, contents)
|
||||
|
||||
|
||||
@enforce_types
|
||||
|
|
@ -132,6 +97,7 @@ def get_dir_size(path: str, recursive: bool=True, pattern: Optional[str]=None) -
|
|||
|
||||
CRON_COMMENT = 'archivebox_schedule'
|
||||
|
||||
|
||||
@enforce_types
|
||||
def dedupe_cron_jobs(cron: CronTab) -> CronTab:
|
||||
deduped: Set[Tuple[str, str]] = set()
|
||||
|
|
|
|||
1
archivebox/themes/admin/actions_as_select.html
Normal file
|
|
@ -0,0 +1 @@
|
|||
actions_as_select
|
||||
|
|
@ -2,7 +2,7 @@
|
|||
{% get_current_language as LANGUAGE_CODE %}{% get_current_language_bidi as LANGUAGE_BIDI %}
|
||||
<html lang="{{ LANGUAGE_CODE|default:"en-us" }}" {% if LANGUAGE_BIDI %}dir="rtl"{% endif %}>
|
||||
<head>
|
||||
<title>{% block title %}{% endblock %}</title>
|
||||
<title>{% block title %}{% endblock %} | ArchiveBox</title>
|
||||
<link rel="stylesheet" type="text/css" href="{% block stylesheet %}{% static "admin/css/base.css" %}{% endblock %}">
|
||||
{% block extrastyle %}{% endblock %}
|
||||
{% if LANGUAGE_BIDI %}<link rel="stylesheet" type="text/css" href="{% block stylesheet_rtl %}{% static "admin/css/rtl.css" %}{% endblock %}">{% endif %}
|
||||
|
|
@ -13,12 +13,61 @@
|
|||
{% if LANGUAGE_BIDI %}<link rel="stylesheet" type="text/css" href="{% static "admin/css/responsive_rtl.css" %}">{% endif %}
|
||||
{% endblock %}
|
||||
{% block blockbots %}<meta name="robots" content="NONE,NOARCHIVE">{% endblock %}
|
||||
<link rel="stylesheet" type="text/css" href="{% static "admin.css" %}">
|
||||
</head>
|
||||
{% load i18n %}
|
||||
|
||||
<body class="{% if is_popup %}popup {% endif %}{% block bodyclass %}{% endblock %}"
|
||||
data-admin-utc-offset="{% now "Z" %}">
|
||||
|
||||
<style nonce="{{nonce}}">
|
||||
/* Loading Progress Bar */
|
||||
#progress {
|
||||
position: absolute;
|
||||
z-index: 1000;
|
||||
top: 0px;
|
||||
left: -6px;
|
||||
width: 2%;
|
||||
opacity: 1;
|
||||
height: 2px;
|
||||
background: #1a1a1a;
|
||||
border-radius: 1px;
|
||||
transition: width 4s ease-out, opacity 400ms linear;
|
||||
}
|
||||
|
||||
@-moz-keyframes bugfix { from { padding-right: 1px ; } to { padding-right: 0; } }
|
||||
</style>
|
||||
|
||||
<script>
|
||||
// Page Loading Bar
|
||||
window.loadStart = function(distance) {
|
||||
var distance = distance || 0;
|
||||
// only add progrstess bar if not already present
|
||||
if (django.jQuery("#loading-bar").length == 0) {
|
||||
django.jQuery("body").add("<div id=\"loading-bar\"></div>");
|
||||
}
|
||||
if (django.jQuery("#progress").length === 0) {
|
||||
django.jQuery("body").append(django.jQuery("<div></div>").attr("id", "progress"));
|
||||
let last_distance = (distance || (30 + (Math.random() * 30)))
|
||||
django.jQuery("#progress").width(last_distance + "%");
|
||||
setInterval(function() {
|
||||
last_distance += Math.random()
|
||||
django.jQuery("#progress").width(last_distance + "%");
|
||||
}, 1000)
|
||||
}
|
||||
};
|
||||
|
||||
window.loadFinish = function() {
|
||||
django.jQuery("#progress").width("101%").delay(200).fadeOut(400, function() {
|
||||
django.jQuery(this).remove();
|
||||
});
|
||||
};
|
||||
window.loadStart();
|
||||
window.addEventListener('beforeunload', function() {window.loadStart(27)});
|
||||
document.addEventListener('DOMContentLoaded', function() {window.loadFinish()});
|
||||
</script>
|
||||
|
||||
|
||||
<!-- Container -->
|
||||
<div id="container">
|
||||
|
||||
|
|
@ -26,14 +75,22 @@
|
|||
<!-- Header -->
|
||||
<div id="header">
|
||||
<div id="branding">
|
||||
{% block branding %}{% endblock %}
|
||||
<h1 id="site-name">
|
||||
<a href="{% url 'Home' %}">
|
||||
<img src="{% static 'archive.png' %}" id="logo">
|
||||
ArchiveBox
|
||||
</a>
|
||||
</h1>
|
||||
|
||||
</div>
|
||||
{% block usertools %}
|
||||
{% if has_permission %}
|
||||
<div id="user-tools">
|
||||
<a href="/add/">Add Links</a> /
|
||||
<a href="/">Main Index</a> /
|
||||
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
|
||||
<a href="{% url 'admin:Add' %}">Add ➕</a> /
|
||||
<a href="{% url 'Home' %}">Snapshots</a> /
|
||||
<a href="/admin/auth/user/">Users</a> /
|
||||
<a href="{% url 'OldHome' %}">Old UI</a> /
|
||||
<a href="{% url 'Docs' %}">Docs</a>
|
||||
|
||||
{% block welcome-msg %}
|
||||
{% trans 'User' %}
|
||||
|
|
@ -56,13 +113,13 @@
|
|||
{% endblock %}
|
||||
{% block nav-global %}{% endblock %}
|
||||
</div>
|
||||
<!-- END Header -->
|
||||
{% block breadcrumbs %}
|
||||
<div class="breadcrumbs">
|
||||
<a href="{% url 'admin:index' %}">{% trans 'Home' %}</a>
|
||||
{% if title %} › {{ title }}{% endif %}
|
||||
</div>
|
||||
{% endblock %}
|
||||
<!-- END Header -->
|
||||
{% block breadcrumbs %}
|
||||
<div class="breadcrumbs">
|
||||
<a href="{% url 'admin:index' %}">{% trans 'Home' %}</a>
|
||||
{% if title %} › {{ title }}{% endif %}
|
||||
</div>
|
||||
{% endblock %}
|
||||
{% endif %}
|
||||
|
||||
{% block messages %}
|
||||
|
|
@ -76,10 +133,10 @@
|
|||
<!-- Content -->
|
||||
<div id="content" class="{% block coltype %}colM{% endblock %}">
|
||||
{% block pretitle %}{% endblock %}
|
||||
{% block content_title %}{% if title %}<h1>{{ title }}</h1>{% endif %}{% endblock %}
|
||||
{% block content_title %}{# {% if title %}<h1>{{ title }}</h1>{% endif %} #}{% endblock %}
|
||||
{% block content %}
|
||||
{% block object-tools %}{% endblock %}
|
||||
{{ content }}
|
||||
{% block object-tools %}{% endblock %}
|
||||
{{ content }}
|
||||
{% endblock %}
|
||||
{% block sidebar %}{% endblock %}
|
||||
<br class="clear">
|
||||
|
|
@ -90,5 +147,42 @@
|
|||
</div>
|
||||
<!-- END Container -->
|
||||
|
||||
<script>
|
||||
(function ($) {
|
||||
$.fn.reverse = [].reverse;
|
||||
|
||||
function fix_actions() {
|
||||
var container = $('div.actions');
|
||||
|
||||
if (container.find('option').length < 10) {
|
||||
container.find('label, button').hide();
|
||||
|
||||
var buttons = $('<div></div>')
|
||||
.prependTo(container)
|
||||
.css('display', 'inline')
|
||||
.addClass('class', 'action-buttons');
|
||||
|
||||
container.find('option:gt(0)').reverse().each(function () {
|
||||
const name = this.value
|
||||
$('<button>')
|
||||
.appendTo(buttons)
|
||||
.attr('name', this.value)
|
||||
.addClass('button')
|
||||
.text(this.text)
|
||||
.click(function () {
|
||||
container.find('select')
|
||||
.find(':selected').attr('selected', '').end()
|
||||
.find('[value=' + this.name + ']').attr('selected', 'selected');
|
||||
$('#changelist-form button[name="index"]').click();
|
||||
document.querySelector('#logo').outerHTML = '<div class="loader"></div>'
|
||||
});
|
||||
});
|
||||
}
|
||||
};
|
||||
$(function () {
|
||||
fix_actions();
|
||||
});
|
||||
})(django.jQuery);
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
|
|
|
|||
|
|
@ -11,7 +11,7 @@
|
|||
|
||||
{% block usertools %}
|
||||
<br/>
|
||||
<a href="/">Back to Main Index</a>
|
||||
<a href="{% url 'Home' %}">Back to Main Index</a>
|
||||
{% endblock %}
|
||||
|
||||
{% block nav-global %}{% endblock %}
|
||||
|
|
|
|||
|
|
@ -1,209 +1,100 @@
|
|||
{% load static %}
|
||||
{% extends "admin/index.html" %}
|
||||
{% load i18n %}
|
||||
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<title>Archived Sites</title>
|
||||
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
|
||||
<style>
|
||||
html, body {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
font-size: 18px;
|
||||
font-weight: 200;
|
||||
text-align: center;
|
||||
margin: 0px;
|
||||
padding: 0px;
|
||||
font-family: "Gill Sans", Helvetica, sans-serif;
|
||||
}
|
||||
.header-top small {
|
||||
font-weight: 200;
|
||||
color: #efefef;
|
||||
}
|
||||
|
||||
.header-top {
|
||||
width: 100%;
|
||||
height: auto;
|
||||
min-height: 40px;
|
||||
margin: 0px;
|
||||
text-align: center;
|
||||
color: white;
|
||||
font-size: calc(11px + 0.84vw);
|
||||
font-weight: 200;
|
||||
padding: 4px 4px;
|
||||
border-bottom: 3px solid #aa1e55;
|
||||
background-color: #aa1e55;
|
||||
}
|
||||
input[type=search] {
|
||||
width: 22vw;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #aeaeae;
|
||||
padding: 3px 5px;
|
||||
}
|
||||
.nav > div {
|
||||
min-height: 30px;
|
||||
}
|
||||
.header-top a {
|
||||
text-decoration: none;
|
||||
color: rgba(0,0,0,0.6);
|
||||
}
|
||||
.header-top a:hover {
|
||||
text-decoration: none;
|
||||
color: rgba(0,0,0,0.9);
|
||||
}
|
||||
.header-top .col-lg-4 {
|
||||
text-align: center;
|
||||
padding-top: 4px;
|
||||
padding-bottom: 4px;
|
||||
}
|
||||
.header-archivebox img {
|
||||
display: inline-block;
|
||||
margin-right: 3px;
|
||||
height: 30px;
|
||||
margin-left: 12px;
|
||||
margin-top: -4px;
|
||||
margin-bottom: 2px;
|
||||
}
|
||||
.header-archivebox img:hover {
|
||||
opacity: 0.5;
|
||||
}
|
||||
{% block breadcrumbs %}
|
||||
<div class="breadcrumbs">
|
||||
<a href="{% url 'admin:index' %}">{% trans 'Home' %}</a>
|
||||
{% if title %} › {{ title }}{% endif %}
|
||||
</div>
|
||||
{% endblock %}
|
||||
|
||||
#table-bookmarks_length, #table-bookmarks_filter {
|
||||
padding-top: 12px;
|
||||
opacity: 0.8;
|
||||
padding-left: 24px;
|
||||
padding-right: 22px;
|
||||
margin-bottom: -16px;
|
||||
}
|
||||
table {
|
||||
padding: 6px;
|
||||
width: 100%;
|
||||
}
|
||||
table thead th {
|
||||
font-weight: 400;
|
||||
}
|
||||
table tr {
|
||||
height: 35px;
|
||||
}
|
||||
tbody tr:nth-child(odd) {
|
||||
background-color: #ffebeb !important;
|
||||
}
|
||||
table tr td {
|
||||
white-space: nowrap;
|
||||
overflow: hidden;
|
||||
/*padding-bottom: 0.4em;*/
|
||||
/*padding-top: 0.4em;*/
|
||||
padding-left: 2px;
|
||||
text-align: center;
|
||||
}
|
||||
table tr td a {
|
||||
text-decoration: none;
|
||||
}
|
||||
table tr td img, table tr td object {
|
||||
display: inline-block;
|
||||
margin: auto;
|
||||
height: 24px;
|
||||
width: 24px;
|
||||
padding: 0px;
|
||||
padding-right: 5px;
|
||||
vertical-align: middle;
|
||||
margin-left: 4px;
|
||||
}
|
||||
#table-bookmarks {
|
||||
width: 100%;
|
||||
overflow-y: scroll;
|
||||
table-layout: fixed;
|
||||
}
|
||||
.dataTables_wrapper {
|
||||
background-color: #fafafa;
|
||||
}
|
||||
table tr a span[data-archived~=False] {
|
||||
opacity: 0.4;
|
||||
}
|
||||
.files-spinner {
|
||||
height: 15px;
|
||||
width: auto;
|
||||
opacity: 0.5;
|
||||
vertical-align: -2px;
|
||||
}
|
||||
.in-progress {
|
||||
display: none;
|
||||
}
|
||||
body[data-status~=finished] .files-spinner {
|
||||
display: none;
|
||||
}
|
||||
/*body[data-status~=running] .in-progress {
|
||||
display: inline-block;
|
||||
}*/
|
||||
tr td a.favicon img {
|
||||
padding-left: 6px;
|
||||
padding-right: 12px;
|
||||
vertical-align: -4px;
|
||||
}
|
||||
tr td a.title {
|
||||
font-size: 1.4em;
|
||||
text-decoration:none;
|
||||
color:black;
|
||||
}
|
||||
tr td a.title small {
|
||||
background-color: #efefef;
|
||||
border-radius: 4px;
|
||||
float:right
|
||||
}
|
||||
input[type=search]::-webkit-search-cancel-button {
|
||||
-webkit-appearance: searchfield-cancel-button;
|
||||
}
|
||||
.title-col {
|
||||
text-align: left;
|
||||
}
|
||||
.title-col a {
|
||||
color: black;
|
||||
}
|
||||
</style>
|
||||
<link rel="stylesheet" href="{% static 'bootstrap.min.css' %}">
|
||||
<link rel="stylesheet" href="{% static 'jquery.dataTables.min.css' %}"/>
|
||||
<script src="{% static 'jquery.min.js' %}"></script>
|
||||
<script src="{% static 'jquery.dataTables.min.js' %}"></script>
|
||||
<script>
|
||||
document.addEventListener('error', function(e) {
|
||||
e.target.style.opacity = 0;
|
||||
}, true)
|
||||
jQuery(document).ready(function() {
|
||||
jQuery('#table-bookmarks').DataTable({
|
||||
stateSave: true, // save state (filtered input, number of entries shown, etc) in localStorage
|
||||
dom: '<lf<t>ip>', // how to show the table and its helpers (filter, etc) in the DOM
|
||||
order: [[0, 'desc']],
|
||||
iDisplayLength: 100,
|
||||
});
|
||||
});
|
||||
</script>
|
||||
</head>
|
||||
<body data-status="finished">
|
||||
<header>
|
||||
<div class="header-top container-fluid">
|
||||
<div class="row nav">
|
||||
<div class="col-sm-2">
|
||||
<a href="/" class="header-archivebox" title="Last updated: {{updated}}">
|
||||
<img src="{% static 'archive.png' %}" alt="Logo"/>
|
||||
ArchiveBox: Add
|
||||
</a>
|
||||
</div>
|
||||
<div class="col-sm-10" style="text-align: right">
|
||||
<a href="/">Main Index</a> |
|
||||
<a href="/admin/">Admin</a> |
|
||||
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
<center>
|
||||
<br/><br/>
|
||||
<form action="?" method="POST">{% csrf_token %}
|
||||
Add new links...<br/>
|
||||
<input type="text" name="url" placeholder="URL of page or feed..."/><br/>
|
||||
<button role="submit">Add</button>
|
||||
{% block content %}
|
||||
<style>
|
||||
.dashboard #content {
|
||||
width: 100%;
|
||||
margin-right: 0px;
|
||||
margin-left: 0px;
|
||||
}
|
||||
#submit {
|
||||
border: 1px solid rgba(0,0,0,0.2);
|
||||
padding: 10px;
|
||||
border-radius: 4px;
|
||||
background-color: #f5dd5d;
|
||||
color: #333;
|
||||
font-size: 18px;
|
||||
font-weight: 800;
|
||||
}
|
||||
#add-form button[role=submit]:hover {
|
||||
background-color: #e5cd4d;
|
||||
}
|
||||
#add-form label {
|
||||
display: block;
|
||||
font-size: 16px;
|
||||
}
|
||||
#add-form textarea {
|
||||
width: 100%;
|
||||
min-height: 300px;
|
||||
}
|
||||
#delay-warning div {
|
||||
border: 1px solid red;
|
||||
border-radius: 4px;
|
||||
margin: 10px;
|
||||
padding: 10px;
|
||||
font-size: 15px;
|
||||
background-color: #F5DD5D;
|
||||
}
|
||||
#stdout {
|
||||
background-color: #ded;
|
||||
padding: 10px 10px;
|
||||
border-radius: 4px;
|
||||
white-space: normal;
|
||||
}
|
||||
</style>
|
||||
<div style="max-width: 550px; margin: auto; float: none">
|
||||
<br/><br/>
|
||||
{% if stdout %}
|
||||
<h1>Add new URLs to your archive: results</h1>
|
||||
<pre id="stdout">
|
||||
{{ stdout | safe }}
|
||||
<br/><br/>
|
||||
</pre>
|
||||
<br/>
|
||||
<center>
|
||||
<a href="/add" id="submit"> Add more URLs ➕</a>
|
||||
</center>
|
||||
{% else %}
|
||||
<form id="add-form" action="?" method="POST" class="p-form">{% csrf_token %}
|
||||
<h1>Add new URLs to your archive</h1>
|
||||
<br/>
|
||||
{{ form.as_p }}
|
||||
<center>
|
||||
<button role="submit" id="submit"> Add URLs and archive ➕</button>
|
||||
</center>
|
||||
</form>
|
||||
</center>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
<br/><br/><br/>
|
||||
<center id="delay-warning" style="display: none">
|
||||
<b><i>This page will be unresponsive until the process is completely finished.</i></b>
|
||||
<br/><br/>
|
||||
<div>
|
||||
Warning: it may take several minutes to finish adding!<br/>
|
||||
<br/>
|
||||
Progress will be displayed in the <code>archivebox server</code> stdout,<br/>
|
||||
and on this page once the archiving process completes.<br/>
|
||||
<br/>
|
||||
<small>(it's safe to leave this page, adding will continue in the background)</small>
|
||||
</div>
|
||||
</center>
|
||||
<script>
|
||||
document.getElementById('add-form').addEventListener('submit', function(event) {
|
||||
setTimeout(function() {
|
||||
document.getElementById('add-form').innerHTML = '<center><h3>Adding URLs to index and running archive methods...<h3><br/><div class="loader"></div><br/>(see terminal for progress)</center>'
|
||||
document.getElementById('delay-warning').style.display = 'block'
|
||||
}, 200)
|
||||
return true
|
||||
})
|
||||
</script>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endblock %}
|
||||
|
||||
{% block sidebar %}{% endblock %}
|
||||
|
|
|
|||
|
|
@ -6,6 +6,37 @@
|
|||
<title>Archived Sites</title>
|
||||
<meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
|
||||
<style>
|
||||
:root {
|
||||
--bg-main: #efefef;
|
||||
--accent-1: #aa1e55;
|
||||
--accent-2: #ffebeb;
|
||||
--accent-3: #efefef;
|
||||
|
||||
--text-1: #1c1c1c;
|
||||
--text-2: #eaeaea;
|
||||
--text-main: #1a1a1a;
|
||||
--font-main: "Gill Sans", Helvetica, sans-serif;
|
||||
}
|
||||
/* Dark Mode (WIP) */
|
||||
/*
|
||||
@media (prefers-color-scheme: dark) {
|
||||
:root {
|
||||
--accent-2: hsl(160, 100%, 96%);
|
||||
|
||||
--text-1: #eaeaea;
|
||||
--text-2: #1a1a1a;
|
||||
--bg-main: #101010;
|
||||
}
|
||||
|
||||
#table-bookmarks_wrapper,
|
||||
#table-bookmarks_wrapper img,
|
||||
tbody td:nth-child(3),
|
||||
tbody td:nth-child(3) span,
|
||||
footer {
|
||||
filter: invert(100%);
|
||||
}
|
||||
}*/
|
||||
|
||||
html, body {
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
|
|
@ -14,11 +45,12 @@
|
|||
text-align: center;
|
||||
margin: 0px;
|
||||
padding: 0px;
|
||||
font-family: "Gill Sans", Helvetica, sans-serif;
|
||||
font-family: var(--font-main);
|
||||
}
|
||||
|
||||
.header-top small {
|
||||
font-weight: 200;
|
||||
color: #efefef;
|
||||
color: var(--accent-3);
|
||||
}
|
||||
|
||||
.header-top {
|
||||
|
|
@ -31,8 +63,8 @@
|
|||
font-size: calc(11px + 0.84vw);
|
||||
font-weight: 200;
|
||||
padding: 4px 4px;
|
||||
border-bottom: 3px solid #aa1e55;
|
||||
background-color: #aa1e55;
|
||||
border-bottom: 3px solid var(--accent-1);
|
||||
background-color: var(--accent-1);
|
||||
}
|
||||
input[type=search] {
|
||||
width: 22vw;
|
||||
|
|
@ -86,7 +118,7 @@
|
|||
height: 35px;
|
||||
}
|
||||
tbody tr:nth-child(odd) {
|
||||
background-color: #ffebeb !important;
|
||||
background-color: var(--accent-2) !important;
|
||||
}
|
||||
table tr td {
|
||||
white-space: nowrap;
|
||||
|
|
@ -146,7 +178,7 @@
|
|||
color:black;
|
||||
}
|
||||
tr td a.title small {
|
||||
background-color: #efefef;
|
||||
background-color: var(--accent-3);
|
||||
border-radius: 4px;
|
||||
float:right
|
||||
}
|
||||
|
|
@ -190,7 +222,7 @@
|
|||
</div>
|
||||
<div class="col-sm-10" style="text-align: right">
|
||||
<a href="/add/">Add Links</a> |
|
||||
<a href="/admin/core/page/">Admin</a> |
|
||||
<a href="/admin/core/snapshot/">Admin</a> |
|
||||
<a href="https://github.com/pirate/ArchiveBox/wiki">Docs</a>
|
||||
</div>
|
||||
</div>
|
||||
|
|
@ -216,7 +248,7 @@
|
|||
<a href="archive/{{link.timestamp}}/index.html"><img src="{% static 'spinner.gif' %}" class="link-favicon" decoding="async"></a>
|
||||
{% endif %}
|
||||
<a href="archive/{{link.timestamp}}/{{link.canonical_outputs.wget_path}}" title="{{link.title}}">
|
||||
<span data-title-for="{{link.url}}" data-archived="{{link.is_archived}}">{{link.title}}</span>
|
||||
<span data-title-for="{{link.url}}" data-archived="{{link.is_archived}}">{{link.title|default:'Loading...'}}</span>
|
||||
<small style="float:right">{{link.tags|default:''}}</small>
|
||||
</a>
|
||||
</td>
|
||||
|
|
|
|||
224
archivebox/themes/default/static/admin.css
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
#logo {
|
||||
height: 30px;
|
||||
vertical-align: -6px;
|
||||
padding-right: 5px;
|
||||
}
|
||||
#site-name:hover a {
|
||||
opacity: 0.9;
|
||||
}
|
||||
#site-name .loader {
|
||||
height: 25px;
|
||||
width: 25px;
|
||||
display: inline-block;
|
||||
border-width: 3px;
|
||||
vertical-align: -3px;
|
||||
margin-right: 5px;
|
||||
margin-top: 2px;
|
||||
}
|
||||
#branding h1, #branding h1 a:link, #branding h1 a:visited {
|
||||
color: mintcream;
|
||||
}
|
||||
#header {
|
||||
background: #aa1e55;
|
||||
padding: 6px 14px;
|
||||
}
|
||||
#content {
|
||||
padding: 8px 8px;
|
||||
}
|
||||
#user-tools {
|
||||
font-size: 13px;
|
||||
|
||||
}
|
||||
|
||||
div.breadcrumbs {
|
||||
background: #772948;
|
||||
color: #f5dd5d;
|
||||
padding: 6px 15px;
|
||||
}
|
||||
|
||||
body.model-snapshot.change-list div.breadcrumbs,
|
||||
body.model-snapshot.change-list #content .object-tools {
|
||||
display: none;
|
||||
}
|
||||
|
||||
.module h2, .module caption, .inline-group h2 {
|
||||
background: #772948;
|
||||
}
|
||||
|
||||
#content .object-tools {
|
||||
margin-top: -35px;
|
||||
margin-right: -10px;
|
||||
float: right;
|
||||
}
|
||||
|
||||
#content .object-tools a:link, #content .object-tools a:visited {
|
||||
border-radius: 0px;
|
||||
background-color: #f5dd5d;
|
||||
color: #333;
|
||||
font-size: 12px;
|
||||
font-weight: 800;
|
||||
}
|
||||
|
||||
#content .object-tools a.addlink {
|
||||
background-blend-mode: difference;
|
||||
}
|
||||
|
||||
#content #changelist #toolbar {
|
||||
padding: 0px;
|
||||
background: none;
|
||||
margin-bottom: 10px;
|
||||
border-top: 0px;
|
||||
border-bottom: 0px;
|
||||
}
|
||||
|
||||
#content #changelist #toolbar form input[type="submit"] {
|
||||
border-color: #aa1e55;
|
||||
}
|
||||
|
||||
#content #changelist-filter li.selected a {
|
||||
color: #aa1e55;
|
||||
}
|
||||
|
||||
|
||||
/*#content #changelist .actions {
|
||||
position: fixed;
|
||||
bottom: 0px;
|
||||
z-index: 800;
|
||||
}*/
|
||||
#content #changelist .actions {
|
||||
float: right;
|
||||
margin-top: -34px;
|
||||
padding: 0px;
|
||||
background: none;
|
||||
margin-right: 0px;
|
||||
}
|
||||
|
||||
#content #changelist .actions .button {
|
||||
border-radius: 2px;
|
||||
background-color: #f5dd5d;
|
||||
color: #333;
|
||||
font-size: 12px;
|
||||
font-weight: 800;
|
||||
margin-right: 4px;
|
||||
box-shadow: 4px 4px 4px rgba(0,0,0,0.02);
|
||||
border: 1px solid rgba(0,0,0,0.08);
|
||||
}
|
||||
#content #changelist .actions .button:hover {
|
||||
border: 1px solid rgba(0,0,0,0.2);
|
||||
opacity: 0.9;
|
||||
}
|
||||
#content #changelist .actions .button[name=verify_snapshots], #content #changelist .actions .button[name=update_titles] {
|
||||
background-color: #dedede;
|
||||
color: #333;
|
||||
}
|
||||
#content #changelist .actions .button[name=update_snapshots] {
|
||||
background-color:lightseagreen;
|
||||
color: #333;
|
||||
}
|
||||
#content #changelist .actions .button[name=overwrite_snapshots] {
|
||||
background-color: #ffaa31;
|
||||
color: #333;
|
||||
}
|
||||
#content #changelist .actions .button[name=delete_snapshots] {
|
||||
background-color: #f91f74;
|
||||
color: rgb(255 248 252 / 64%);
|
||||
}
|
||||
|
||||
|
||||
#content #changelist-filter h2 {
|
||||
border-radius: 4px 4px 0px 0px;
|
||||
}
|
||||
|
||||
@media (min-width: 767px) {
|
||||
#content #changelist-filter {
|
||||
top: 35px;
|
||||
width: 110px;
|
||||
margin-bottom: 35px;
|
||||
}
|
||||
|
||||
.change-list .filtered .results,
|
||||
.change-list .filtered .paginator,
|
||||
.filtered #toolbar,
|
||||
.filtered div.xfull {
|
||||
margin-right: 115px;
|
||||
}
|
||||
}
|
||||
|
||||
@media (max-width: 1127px) {
|
||||
#content #changelist .actions {
|
||||
position: fixed;
|
||||
bottom: 6px;
|
||||
left: 10px;
|
||||
float: left;
|
||||
z-index: 1000;
|
||||
}
|
||||
}
|
||||
|
||||
#content a img.favicon {
|
||||
height: 20px;
|
||||
width: 20px;
|
||||
vertical-align: -5px;
|
||||
padding-right: 6px;
|
||||
}
|
||||
|
||||
#content td, #content th {
|
||||
vertical-align: middle;
|
||||
padding: 4px;
|
||||
}
|
||||
|
||||
#content #changelist table input {
|
||||
vertical-align: -2px;
|
||||
}
|
||||
|
||||
#content thead th .text a {
|
||||
padding: 8px 4px;
|
||||
}
|
||||
|
||||
#content th.field-added, #content td.field-updated {
|
||||
word-break: break-word;
|
||||
min-width: 128px;
|
||||
white-space: normal;
|
||||
}
|
||||
|
||||
#content th.field-title_str {
|
||||
min-width: 300px;
|
||||
}
|
||||
|
||||
#content td.field-files {
|
||||
white-space: nowrap;
|
||||
}
|
||||
#content td.field-files .exists-True {
|
||||
opacity: 1;
|
||||
}
|
||||
#content td.field-files .exists-False {
|
||||
opacity: 0.1;
|
||||
filter: grayscale(100%);
|
||||
}
|
||||
#content td.field-size {
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
#content td.field-url_str {
|
||||
word-break: break-all;
|
||||
min-width: 200px;
|
||||
}
|
||||
|
||||
#content tr b.status-pending {
|
||||
font-weight: 200;
|
||||
opacity: 0.6;
|
||||
}
|
||||
|
||||
.loader {
|
||||
border: 16px solid #f3f3f3; /* Light grey */
|
||||
border-top: 16px solid #3498db; /* Blue */
|
||||
border-radius: 50%;
|
||||
width: 30px;
|
||||
height: 30px;
|
||||
box-sizing: border-box;
|
||||
animation: spin 2s linear infinite;
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% { transform: rotate(0deg); }
|
||||
100% { transform: rotate(360deg); }
|
||||
}
|
||||
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 1.6 KiB After Width: | Height: | Size: 1.6 KiB |
|
Before Width: | Height: | Size: 158 B After Width: | Height: | Size: 158 B |
|
Before Width: | Height: | Size: 201 B After Width: | Height: | Size: 201 B |
|
Before Width: | Height: | Size: 157 B After Width: | Height: | Size: 157 B |
|
Before Width: | Height: | Size: 11 KiB After Width: | Height: | Size: 11 KiB |
|
|
@ -79,6 +79,7 @@
|
|||
.card {
|
||||
overflow: hidden;
|
||||
box-shadow: 2px 3px 14px 0px rgba(0,0,0,0.02);
|
||||
margin-top: 10px;
|
||||
}
|
||||
.card h4 {
|
||||
font-size: 1.4vw;
|
||||
|
|
@ -335,6 +336,18 @@
|
|||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-2">
|
||||
<div class="card">
|
||||
<iframe class="card-img-top" src="$singlefile_path" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
|
||||
<div class="card-body">
|
||||
<a href="$singlefile_path" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
|
||||
<img src="../../static/external.png" class="external"/>
|
||||
</a>
|
||||
<a href="$singlefile_path" target="preview"><h4 class="card-title">SingleFile</h4></a>
|
||||
<p class="card-text">archive/singlefile.html</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-2">
|
||||
<div class="card">
|
||||
<iframe class="card-img-top pdf-frame" src="$pdf_path" scrolling="no"></iframe>
|
||||
|
|
@ -359,18 +372,6 @@
|
|||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-2">
|
||||
<div class="card">
|
||||
<iframe class="card-img-top" src="$url" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
|
||||
<div class="card-body">
|
||||
<a href="$url" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
|
||||
<img src="../../static/external.png" class="external"/>
|
||||
</a>
|
||||
<a href="$url" target="preview"><h4 class="card-title">Original</h4></a>
|
||||
<p class="card-text">$domain</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-2">
|
||||
<div class="card">
|
||||
<iframe class="card-img-top" src="$archive_org_path" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
|
||||
|
|
@ -383,6 +384,18 @@
|
|||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="col-lg-2">
|
||||
<div class="card">
|
||||
<iframe class="card-img-top" src="$url" sandbox="allow-same-origin allow-scripts allow-forms" scrolling="no"></iframe>
|
||||
<div class="card-body">
|
||||
<a href="$url" style="float:right" title="Open in new tab..." target="_blank" rel="noopener">
|
||||
<img src="../../static/external.png" class="external"/>
|
||||
</a>
|
||||
<a href="$url" target="preview"><h4 class="card-title">Original</h4></a>
|
||||
<p class="card-text">$domain</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
|
|
|
|||
|
|
@ -1,26 +1,27 @@
|
|||
__package__ = 'archivebox'
|
||||
|
||||
import re
|
||||
import ssl
|
||||
import json as pyjson
|
||||
|
||||
|
||||
from typing import List, Optional, Any
|
||||
from inspect import signature
|
||||
from functools import wraps
|
||||
from hashlib import sha256
|
||||
from urllib.request import Request, urlopen
|
||||
from urllib.parse import urlparse, quote, unquote
|
||||
from html import escape, unescape
|
||||
from datetime import datetime
|
||||
from dateparser import parse as dateparser
|
||||
|
||||
from base32_crockford import encode as base32_encode # type: ignore
|
||||
import json as pyjson
|
||||
import requests
|
||||
from base32_crockford import encode as base32_encode # type: ignore
|
||||
from w3lib.encoding import html_body_declared_encoding, http_content_type_encoding
|
||||
|
||||
from .config import (
|
||||
TIMEOUT,
|
||||
STATICFILE_EXTENSIONS,
|
||||
CHECK_SSL_VALIDITY,
|
||||
WGET_USER_AGENT,
|
||||
CHROME_OPTIONS,
|
||||
)
|
||||
try:
|
||||
import chardet
|
||||
detect_encoding = lambda rawdata: chardet.detect(rawdata)["encoding"]
|
||||
except ImportError:
|
||||
detect_encoding = lambda rawdata: "utf-8"
|
||||
|
||||
### Parsing Helpers
|
||||
|
||||
|
|
@ -42,7 +43,6 @@ base_url = lambda url: without_scheme(url) # uniq base url used to dedupe links
|
|||
without_www = lambda url: url.replace('://www.', '://', 1)
|
||||
without_trailing_slash = lambda url: url[:-1] if url[-1] == '/' else url.replace('/?', '?')
|
||||
hashurl = lambda url: base32_encode(int(sha256(base_url(url).encode('utf-8')).hexdigest(), 16))[:20]
|
||||
is_static_file = lambda url: extension(url).lower() in STATICFILE_EXTENSIONS # TODO: the proper way is with MIME type detection, not using extension
|
||||
|
||||
urlencode = lambda s: s and quote(s, encoding='utf-8', errors='replace')
|
||||
urldecode = lambda s: s and unquote(s)
|
||||
|
|
@ -63,6 +63,13 @@ URL_REGEX = re.compile(
|
|||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
COLOR_REGEX = re.compile(r'\[(?P<arg_1>\d+)(;(?P<arg_2>\d+)(;(?P<arg_3>\d+))?)?m')
|
||||
|
||||
def is_static_file(url: str):
|
||||
# TODO: the proper way is with MIME type detection + ext, not only extension
|
||||
from .config import STATICFILE_EXTENSIONS
|
||||
return extension(url).lower() in STATICFILE_EXTENSIONS
|
||||
|
||||
|
||||
def enforce_types(func):
|
||||
"""
|
||||
|
|
@ -140,74 +147,38 @@ def parse_date(date: Any) -> Optional[datetime]:
|
|||
date = str(date)
|
||||
|
||||
if isinstance(date, str):
|
||||
if date.replace('.', '').isdigit():
|
||||
# this is a brittle attempt at unix timestamp parsing (which is
|
||||
# notoriously hard to do). It may lead to dates being off by
|
||||
# anything from hours to decades, depending on which app, OS,
|
||||
# and sytem time configuration was used for the original timestamp
|
||||
# more info: https://github.com/pirate/ArchiveBox/issues/119
|
||||
return dateparser(date)
|
||||
|
||||
# Note: always always always store the original timestamp string
|
||||
# somewhere indepentendly of the parsed datetime, so that later
|
||||
# bugs dont repeatedly misparse and rewrite increasingly worse dates.
|
||||
# the correct date can always be re-derived from the timestamp str
|
||||
timestamp = float(date)
|
||||
|
||||
EARLIEST_POSSIBLE = 473403600.0 # 1985
|
||||
LATEST_POSSIBLE = 1735707600.0 # 2025
|
||||
|
||||
if EARLIEST_POSSIBLE < timestamp < LATEST_POSSIBLE:
|
||||
# number is seconds
|
||||
return datetime.fromtimestamp(timestamp)
|
||||
|
||||
elif EARLIEST_POSSIBLE * 1000 < timestamp < LATEST_POSSIBLE * 1000:
|
||||
# number is milliseconds
|
||||
return datetime.fromtimestamp(timestamp / 1000)
|
||||
|
||||
elif EARLIEST_POSSIBLE * 1000*1000 < timestamp < LATEST_POSSIBLE * 1000*1000:
|
||||
# number is microseconds
|
||||
return datetime.fromtimestamp(timestamp / (1000*1000))
|
||||
|
||||
else:
|
||||
# continue to the end and raise a parsing failed error.
|
||||
# we dont want to even attempt parsing timestamp strings that
|
||||
# arent within these ranges
|
||||
pass
|
||||
|
||||
if '-' in date:
|
||||
# 2019-04-07T05:44:39.227520
|
||||
try:
|
||||
return datetime.fromisoformat(date)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
return datetime.strptime(date, '%Y-%m-%d %H:%M')
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
raise ValueError('Tried to parse invalid date! {}'.format(date))
|
||||
|
||||
|
||||
@enforce_types
|
||||
def download_url(url: str, timeout: int=TIMEOUT) -> str:
|
||||
def download_url(url: str, timeout: int=None) -> str:
|
||||
"""Download the contents of a remote url and return the text"""
|
||||
from .config import TIMEOUT, CHECK_SSL_VALIDITY, WGET_USER_AGENT
|
||||
timeout = timeout or TIMEOUT
|
||||
response = requests.get(
|
||||
url,
|
||||
headers={'User-Agent': WGET_USER_AGENT},
|
||||
verify=CHECK_SSL_VALIDITY,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
req = Request(url, headers={'User-Agent': WGET_USER_AGENT})
|
||||
content_type = response.headers.get('Content-Type', '')
|
||||
encoding = http_content_type_encoding(content_type) or html_body_declared_encoding(response.text)
|
||||
|
||||
if CHECK_SSL_VALIDITY:
|
||||
resp = urlopen(req, timeout=timeout)
|
||||
else:
|
||||
insecure = ssl._create_unverified_context()
|
||||
resp = urlopen(req, timeout=timeout, context=insecure)
|
||||
if encoding is not None:
|
||||
response.encoding = encoding
|
||||
|
||||
encoding = resp.headers.get_content_charset() or 'utf-8' # type: ignore
|
||||
return resp.read().decode(encoding)
|
||||
return response.text
|
||||
|
||||
|
||||
@enforce_types
|
||||
def chrome_args(**options) -> List[str]:
|
||||
"""helper to build up a chrome shell command with arguments"""
|
||||
|
||||
from .config import CHROME_OPTIONS
|
||||
|
||||
options = {**CHROME_OPTIONS, **options}
|
||||
|
||||
cmd_args = [options['CHROME_BINARY']]
|
||||
|
|
@ -216,8 +187,16 @@ def chrome_args(**options) -> List[str]:
|
|||
cmd_args += ('--headless',)
|
||||
|
||||
if not options['CHROME_SANDBOX']:
|
||||
# dont use GPU or sandbox when running inside docker container
|
||||
cmd_args += ('--no-sandbox', '--disable-gpu')
|
||||
# assume this means we are running inside a docker container
|
||||
# in docker, GPU support is limited, sandboxing is unecessary,
|
||||
# and SHM is limited to 64MB by default (which is too low to be usable).
|
||||
cmd_args += (
|
||||
'--no-sandbox',
|
||||
'--disable-gpu',
|
||||
'--disable-dev-shm-usage',
|
||||
'--disable-software-rasterizer',
|
||||
)
|
||||
|
||||
|
||||
if not options['CHECK_SSL_VALIDITY']:
|
||||
cmd_args += ('--disable-web-security', '--ignore-certificate-errors')
|
||||
|
|
@ -236,6 +215,46 @@ def chrome_args(**options) -> List[str]:
|
|||
|
||||
return cmd_args
|
||||
|
||||
def ansi_to_html(text):
|
||||
"""
|
||||
Based on: https://stackoverflow.com/questions/19212665/python-converting-ansi-color-codes-to-html
|
||||
"""
|
||||
from .config import COLOR_DICT
|
||||
|
||||
TEMPLATE = '<span style="color: rgb{}"><br>'
|
||||
text = text.replace('[m', '</span>')
|
||||
|
||||
def single_sub(match):
|
||||
argsdict = match.groupdict()
|
||||
if argsdict['arg_3'] is None:
|
||||
if argsdict['arg_2'] is None:
|
||||
_, color = 0, argsdict['arg_1']
|
||||
else:
|
||||
_, color = argsdict['arg_1'], argsdict['arg_2']
|
||||
else:
|
||||
_, color = argsdict['arg_3'], argsdict['arg_2']
|
||||
|
||||
return TEMPLATE.format(COLOR_DICT[color][0])
|
||||
|
||||
return COLOR_REGEX.sub(single_sub, text)
|
||||
|
||||
|
||||
class AttributeDict(dict):
|
||||
"""Helper to allow accessing dict values via Example.key or Example['key']"""
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
# Recursively convert nested dicts to AttributeDicts (optional):
|
||||
# for key, val in self.items():
|
||||
# if isinstance(val, dict) and type(val) is not AttributeDict:
|
||||
# self[key] = AttributeDict(val)
|
||||
|
||||
def __getattr__(self, attr: str) -> Any:
|
||||
return dict.__getitem__(self, attr)
|
||||
|
||||
def __setattr__(self, attr: str, value: Any) -> None:
|
||||
return dict.__setitem__(self, attr, value)
|
||||
|
||||
|
||||
class ExtendedEncoder(pyjson.JSONEncoder):
|
||||
"""
|
||||
|
|
|
|||
7
bin/archive
Executable file
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/sh
|
||||
|
||||
echo "[X] This method of running ArchiveBox is deprecated as of >= v0.4."
|
||||
echo " You should 'pip install archivebox' and use the installed 'archivebox' binary instead."
|
||||
echo " For more info, see the Quickstart section of the README.md:"
|
||||
echo " https://github.com/pirate/ArchiveBox#Quickstart"
|
||||
exit 2
|
||||
|
|
@ -1 +0,0 @@
|
|||
../archivebox/__main__.py
|
||||
34
bin/docker_entrypoint.sh
Executable file
|
|
@ -0,0 +1,34 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Autodetect UID,GID of host user based on ownership of files in the data volume
|
||||
DATA_DIR="${DATA_DIR:-/data}"
|
||||
ARCHIVEBOX_USER="${ARCHIVEBOX_USER:-archivebox}"
|
||||
|
||||
USID=$(stat --format="%u" "$DATA_DIR")
|
||||
GRID=$(stat --format="%g" "$DATA_DIR")
|
||||
|
||||
# If user is not root, modify the archivebox user+files to have the same uid,gid
|
||||
if [[ "$USID" != 0 && "$GRID" != 0 ]]; then
|
||||
usermod -u "$USID" "$ARCHIVEBOX_USER"
|
||||
groupmod -g "$GRID" "$ARCHIVEBOX_USER"
|
||||
chown -R "$USID":"$GRID" "/home/$ARCHIVEBOX_USER"
|
||||
chown "$USID":"$GRID" "$DATA_DIR"
|
||||
chown "$USID":"$GRID" "$DATA_DIR/*" > /dev/null 2>&1 || true
|
||||
fi
|
||||
|
||||
# Run commands as the new archivebox user in Docker.
|
||||
# Any files touched will have the same uid & gid
|
||||
# inside Docker and outside on the host machine.
|
||||
if [[ "$1" == /* || "$1" == "echo" || "$1" == "archivebox" ]]; then
|
||||
# arg 1 is a binary, execute it verbatim
|
||||
# e.g. "archivebox init"
|
||||
# "/bin/bash"
|
||||
# "echo"
|
||||
gosu "$ARCHIVEBOX_USER" bash -c "$*"
|
||||
else
|
||||
# no command given, assume args were meant to be passed to archivebox cmd
|
||||
# e.g. "add https://example.com"
|
||||
# "manage createsupseruser"
|
||||
# "server 0.0.0.0:8000"
|
||||
gosu "$ARCHIVEBOX_USER" bash -c "archivebox $*"
|
||||
fi
|
||||
|
|
@ -35,3 +35,19 @@ if [[ "$1" == "--firefox" ]]; then
|
|||
echo "Firefox history exported to:"
|
||||
echo " output/sources/firefox_history.json"
|
||||
fi
|
||||
|
||||
if [[ "$1" == "--safari" ]]; then
|
||||
# Safari
|
||||
if [[ -e "$2" ]]; then
|
||||
cp "$2" "$REPO_DIR/output/sources/safari_history.db.tmp"
|
||||
else
|
||||
default="~/Library/Safari/History.db"
|
||||
echo "Defaulting to history db: $default"
|
||||
echo "Optionally specify the path to a different sqlite history database as the 2nd argument."
|
||||
cp "$default" "$REPO_DIR/output/sources/safari_history.db.tmp"
|
||||
fi
|
||||
sqlite3 "$REPO_DIR/output/sources/safari_history.db.tmp" "select url from history_items" > "$REPO_DIR/output/sources/safari_history.json"
|
||||
rm "$REPO_DIR"/output/sources/safari_history.db.*
|
||||
echo "Safari history exported to:"
|
||||
echo " output/sources/safari_history.json"
|
||||
fi
|
||||
23
bin/lint.sh
Executable file
|
|
@ -0,0 +1,23 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
### Bash Environment Setup
|
||||
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
|
||||
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
|
||||
# set -o xtrace
|
||||
set -o errexit
|
||||
set -o errtrace
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
IFS=$'\n'
|
||||
|
||||
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
|
||||
|
||||
source "$DIR/.venv/bin/activate"
|
||||
|
||||
echo "[*] Running flake8..."
|
||||
flake8 archivebox && echo "√ No errors found."
|
||||
|
||||
echo
|
||||
|
||||
echo "[*] Running mypy..."
|
||||
echo "(skipping for now, run 'mypy archivebox' to run it manually)"
|
||||
80
bin/release.sh
Executable file
|
|
@ -0,0 +1,80 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
### Bash Environment Setup
|
||||
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
|
||||
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
|
||||
# set -o xtrace
|
||||
set -o errexit
|
||||
set -o errtrace
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
IFS=$'\n'
|
||||
|
||||
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
|
||||
VERSION_FILE="$DIR/archivebox/VERSION"
|
||||
|
||||
function bump_semver {
|
||||
echo "$1" | awk -F. '{$NF = $NF + 1;} 1' | sed 's/ /./g'
|
||||
}
|
||||
|
||||
source "$DIR/.venv/bin/activate"
|
||||
cd "$DIR"
|
||||
|
||||
OLD_VERSION="$(cat "$VERSION_FILE")"
|
||||
NEW_VERSION="$(bump_semver "$OLD_VERSION")"
|
||||
|
||||
echo "[*] Fetching latest docs version"
|
||||
cd "$DIR/docs"
|
||||
git pull
|
||||
cd "$DIR"
|
||||
|
||||
echo "[+] Building docs"
|
||||
sphinx-apidoc -o docs archivebox
|
||||
cd "$DIR/docs"
|
||||
make html
|
||||
cd "$DIR"
|
||||
|
||||
if [ -z "$(git status --porcelain)" ] && [[ "$(git branch --show-current)" == "master" ]]; then
|
||||
git pull
|
||||
else
|
||||
echo "[X] Commit your changes and make sure git is checked out on clean master."
|
||||
exit 4
|
||||
fi
|
||||
|
||||
echo "[*] Bumping VERSION from $OLD_VERSION to $NEW_VERSION"
|
||||
echo "$NEW_VERSION" > "$VERSION_FILE"
|
||||
git add "$VERSION_FILE"
|
||||
git commit -m "$NEW_VERSION release"
|
||||
git tag -a "v$NEW_VERSION" -m "v$NEW_VERSION"
|
||||
git push origin master
|
||||
git push origin --tags
|
||||
|
||||
echo "[*] Cleaning up build dirs"
|
||||
cd "$DIR"
|
||||
rm -Rf build dist
|
||||
|
||||
echo "[+] Building sdist and bdist_wheel"
|
||||
python3 setup.py sdist bdist_wheel
|
||||
|
||||
echo "[^] Uploading to test.pypi.org"
|
||||
python3 -m twine upload --repository testpypi dist/*
|
||||
|
||||
echo "[^] Uploading to pypi.org"
|
||||
python3 -m twine upload --repository pypi dist/*
|
||||
|
||||
echo "[+] Building docker image"
|
||||
docker build . -t archivebox \
|
||||
-t archivebox:latest \
|
||||
-t archivebox:$NEW_VERSION \
|
||||
-t docker.io/nikisweeting/archivebox:latest \
|
||||
-t docker.io/nikisweeting/archivebox:$NEW_VERSION \
|
||||
-t docker.pkg.github.com/pirate/archivebox/archivebox:latest \
|
||||
-t docker.pkg.github.com/pirate/archivebox/archivebox:$NEW_VERSION
|
||||
|
||||
echo "[^] Uploading docker image"
|
||||
# docker login --username=nikisweeting
|
||||
# docker login docker.pkg.github.com --username=pirate
|
||||
docker push docker.io/nikisweeting/archivebox
|
||||
docker push docker.pkg.github.com/pirate/archivebox/archivebox
|
||||
|
||||
echo "[√] Done. Published version v$NEW_VERSION"
|
||||
17
bin/test.sh
Executable file
|
|
@ -0,0 +1,17 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
### Bash Environment Setup
|
||||
# http://redsymbol.net/articles/unofficial-bash-strict-mode/
|
||||
# https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html
|
||||
# set -o xtrace
|
||||
set -o errexit
|
||||
set -o errtrace
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
IFS=$'\n'
|
||||
|
||||
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && cd .. && pwd )"
|
||||
|
||||
source "$DIR/.venv/bin/activate"
|
||||
|
||||
pytest
|
||||
|
|
@ -1,32 +1,75 @@
|
|||
# This docker-compose config for ArchiveBox runs the following containers:
|
||||
# - ArchiveBox (it creates the initial archive, then sleeps forever to allow commands to be run with exec to add links)
|
||||
# - nginx webserver running on https://127.0.0.1:8098
|
||||
# Usage:
|
||||
# docker-compose up -d
|
||||
# echo "https://example.com" | docker-compose exec -T archivebox /bin/archive
|
||||
# docker-compose exec archivebox /bin/archive https://example.com/some/feed.rss
|
||||
# docker-compose run archivebox init
|
||||
# echo "https://example.com" | docker-compose run archivebox archivebox add
|
||||
# docker-compose run archivebox add --depth=1 https://example.com/some/feed.rss
|
||||
# docker-compose run archivebox config --set PUBLIC_INDEX=True
|
||||
# Documentation:
|
||||
# https://github.com/pirate/ArchiveBox/wiki/Docker#docker-compose
|
||||
|
||||
version: '3'
|
||||
version: '3.7'
|
||||
|
||||
services:
|
||||
archivebox:
|
||||
build: .
|
||||
# build: .
|
||||
image: nikisweeting/archivebox:latest
|
||||
command: server 0.0.0.0:8000
|
||||
stdin_open: true
|
||||
tty: true
|
||||
# env_file: path/to/your/ArchiveBox.conf
|
||||
ports:
|
||||
- 8000:8000
|
||||
environment:
|
||||
- USE_COLOR=False
|
||||
- USE_COLOR=True
|
||||
- SHOW_PROGRESS=False
|
||||
volumes:
|
||||
- ./data:/data
|
||||
command: bash -c 'echo "https://github.com/pirate/ArchiveBox" | /bin/archive; tail -f /dev/null'
|
||||
|
||||
nginx:
|
||||
image: 'nginx'
|
||||
ports:
|
||||
- '8098:80'
|
||||
volumes:
|
||||
- ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
|
||||
- ./data:/var/www
|
||||
|
||||
# Optional Addons: tweak these examples as needed for your specific use case
|
||||
|
||||
# Example: Run scheduled imports in a docker instead of using cron on the
|
||||
# host machine, add tasks and see more info with archivebox schedule --help
|
||||
# scheduler:
|
||||
# image: nikisweeting/archivebox:latest
|
||||
# command: schedule --foreground
|
||||
# environment:
|
||||
# - USE_COLOR=True
|
||||
# - SHOW_PROGRESS=False
|
||||
# volumes:
|
||||
# - ./data:/data
|
||||
|
||||
# Example: Put Nginx in front of the ArchiveBox server for SSL termination
|
||||
# nginx:
|
||||
# image: nginx:alpine
|
||||
# ports:
|
||||
# - 443:443
|
||||
# - 80:80
|
||||
# volumes:
|
||||
# - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
|
||||
# - ./data:/var/www
|
||||
|
||||
# Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel
|
||||
# wireguard:
|
||||
# image: linuxserver/wireguard
|
||||
# network_mode: 'service:archivebox'
|
||||
# cap_add:
|
||||
# - NET_ADMIN
|
||||
# - SYS_MODULE
|
||||
# sysctls:
|
||||
# - net.ipv4.conf.all.rp_filter=2
|
||||
# - net.ipv4.conf.all.src_valid_mark=1
|
||||
# volumes:
|
||||
# - /lib/modules:/lib/modules
|
||||
# - ./wireguard.conf:/config/wg0.conf:ro
|
||||
|
||||
# Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox
|
||||
# pywb:
|
||||
# image: webrecorder/pywb:latest
|
||||
# entrypoint: /bin/sh 'wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback --proxy;'
|
||||
# environment:
|
||||
# - INIT_COLLECTION=archivebox
|
||||
# ports:
|
||||
# - 8080:8080
|
||||
# volumes:
|
||||
# ./data:/archivebox
|
||||
# ./data/wayback:/webarchive
|
||||
|
|
|
|||