Repository: oduwsdl/archivenow
Branch: master
Commit: dbc688f4f238
Files: 16
Total size: 52.6 KB
Directory structure:
gitextract_3tv8j6jl/
├── .dockerignore
├── .gitignore
├── Dockerfile
├── LICENSE
├── README.rst
├── archivenow/
│ ├── __init__.py
│ ├── archivenow.py
│ ├── handlers/
│ │ ├── cc_handler.py
│ │ ├── ia_handler.py
│ │ ├── is_handler.py
│ │ ├── mg_handler.py
│ │ └── warc_handler.py
│ └── templates/
│ ├── api.txt
│ └── index.html
├── requirements.txt
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .dockerignore
================================================
.git
.gitignore
LICENSE
Dockerfile
================================================
FILE: .gitignore
================================================
.DS_Store
archivenow.egg-info/
build/
dist/
__pycache__
================================================
FILE: Dockerfile
================================================
ARG PYTAG=latest
FROM python:${PYTAG}
LABEL maintainer "Mohamed Aturban <mohsci1@yahoo.com>"
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . ./
RUN chmod a+x ./archivenow/archivenow.py
ENTRYPOINT ["./archivenow/archivenow.py"]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2017 ODU Web Science / Digital Libraries Research Group
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.rst
================================================
Archive Now (archivenow)
=============================
A Tool To Push Web Resources Into Web Archives
----------------------------------------------
Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers".
Update January 2021
~~~~~~~~~
Originally, **archivenow** was configured to push to 6 different public web archives. The two removed web archives are `WebCite <https://www.webcitation.org/>`_ and `archive.st <http://archive.st/>`_. WebCite was removed from **archivenow** as they are no longer accepting archiving requests. Archive.st was removed from **archivenow** due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_ from **archivenow** has been updated. In order to push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_, `Selenium <https://selenium-python.readthedocs.io/>`_ is used.
As explained below, this library can be used through:
- Command Line Interface (CLI)
- A Web Service
- A Docker Container
- Python
Installing
----------
The latest release of **archivenow** can be installed using pip:
.. code-block:: bash
$ pip install archivenow
The latest development version containing changes not yet released can be installed from source:
.. code-block:: bash
$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./
In order to push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_, **archivenow** must use `Selenium <https://selenium-python.readthedocs.io/>`_, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and **archivenow** with `Firefox <https://www.mozilla.org/en-US/firefox/releases/>`_ and Firefox's corresponding `GeckoDriver <https://github.com/mozilla/geckodriver/releases>`_.
You can download the latest versions of `Firefox <https://www.mozilla.org/en-US/firefox/releases/>`_ and the `GeckoDriver <https://github.com/mozilla/geckodriver/releases>`_ to use with **archivenow**.
After installing the driver, you can push to `archive.today <https://archive.vn/>`_ and `megalodon.jp <https://megalodon.jp/>`_ from **archivenow**.
CLI USAGE
---------
Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below.
.. code-block:: bash
$ archivenow -h
usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]]
[--is] [--ia] [--warc [WARC]] [-v] [--all]
[--server] [--host [HOST]] [--agent [AGENT]]
[--port [PORT]]
[URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--mg Use Megalodon.jp
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc Archive
--is Use The Archive.is
--ia Use The Internet Archive
--warc [WARC] Generate WARC file
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--agent [AGENT] Use "wget" or "squidwarc" for WARC generation
--port [PORT] A port number to run a Web Service
Examples
--------
Example 1
~~~~~~~~~
To save the web page (www.foxnews.com) in the Internet Archive:
.. code-block:: bash
$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com
Example 2
~~~~~~~~~
By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:
.. code-block:: bash
$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com
Example 3
~~~~~~~~~
To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:
.. code-block:: bash
$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com
http://archive.is/fPVyc
Example 4
~~~~~~~~~
To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:
.. code-block:: bash
$ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key
http://archive.is/dcnan
https://perma.cc/53CC-5ST8
https://web.archive.org/web/20181002081445/https://nypost.com/
https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/
https_nypost.com__96ec2300.warc
Example 5
~~~~~~~~~
To download the web page (https://nypost.com/) and create a WARC file:
.. code-block:: bash
$ archivenow --warc=mypage --agent=wget https://nypost.com/
mypage.warc
Server
------
You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 12345)
.. code-block:: bash
$ archivenow --server
Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)
Example 6
~~~~~~~~~
To save the web page (www.foxnews.com) in The Internet Archive through the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Tue, 02 Oct 2018 08:20:18 GMT
{
"results": [
"https://web.archive.org/web/20181002082007/http://www.foxnews.com"
]
}
Example 7
~~~~~~~~~
To save the web page (www.foxnews.com) in all configured archives though the web service:
.. code-block:: bash
$ curl -i http://0.0.0.0:12345/all/www.foxnews.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 385
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Tue, 02 Oct 2018 08:23:53 GMT
{
"results": [
"Error (The Perma.cc Archive): An API Key is required ",
"http://archive.is/ukads",
"https://web.archive.org/web/20181002082007/http://www.foxnews.com",
"Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ",
"http://www.webcitation.org/72rbKsX8B"
]
}
Example 8
~~~~~~~~~
Because an API Key is required by Perma.cc, the HTTP request should be as follows:
.. code-block:: bash
$ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key
Or use only Perma.cc:
.. code-block:: bash
$ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key
Running as a Docker Container
-----------------------------
.. code-block:: bash
$ docker image pull oduwsdl/archivenow
Different ways to run archivenow
.. code-block:: bash
$ docker container run -it --rm oduwsdl/archivenow -h
Accessible at 127.0.0.1:12345:
.. code-block:: bash
$ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0
Accessible at 127.0.0.1:22222:
.. code-block:: bash
$ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0
.. image:: http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif
:width: 10pt
To save the web page (http://www.cnn.com) in The Internet Archive
.. code-block:: bash
$ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com
Python Usage
------------
.. code-block:: bash
>>> from archivenow import archivenow
Example 9
~~~~~~~~~~
To save the web page (www.foxnews.com) in all configured archives:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]
Example 10
~~~~~~~~~~
To save the web page (www.foxnews.com) in The Perma.cc:
.. code-block:: bash
>>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})
['https://perma.cc/8YYC-C7RM']
Example 11
~~~~~~~~~~
To start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')):
.. code-block:: bash
>>> archivenow.start()
2017-02-09 15:02:37
Running on http://127.0.0.1:12345
(Press CTRL+C to quit)
Configuring a new archive or removing existing one
--------------------------------------------------
Additional archives may be added by creating a handler file in the "handlers" directory.
For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:
.. code-block:: python
archivenow.push("www.cnn.com","ma")
In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing `handler files`_ for examples on how to organized a newly configured archive handler.
Removing an archive can be done by one of the following options:
- Removing the archive handler file from the folder "handlers"
- Renaming the archive handler file to other name that does not end with "_handler.py"
- Setting the variable "enabled" to "False" inside the handler file
Notes
-----
The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource.
For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (*C*) of this URI. IA will then return *C* for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.
.. _handler files: https://github.com/oduwsdl/archivenow/tree/master/archivenow/handlers
Citing Project
--------------
.. code-block:: latex
@INPROCEEDINGS{archivenow-jcdl2018,
AUTHOR = {Mohamed Aturban and
Mat Kelly and
Sawood Alam and
John A. Berlin and
Michael L. Nelson and
Michele C. Weigle},
TITLE = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},
BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},
SERIES = {{JCDL} '18},
PAGES = {321--322},
MONTH = {June},
YEAR = {2018},
ADDRESS = {Fort Worth, Texas, USA},
URL = {https://doi.org/10.1145/3197026.3203880},
DOI = {10.1145/3197026.3203880}
}
================================================
FILE: archivenow/__init__.py
================================================
__version__ = '2020.7.18.12.19.44'
================================================
FILE: archivenow/archivenow.py
================================================
#!/usr/bin/env python
import os
import re
import sys
import uuid
import glob
import json
import importlib
import argparse
import string
import requests
from threading import Thread
from flask import request, Flask, jsonify, render_template
from pathlib import Path
#from __init__ import __version__ as archiveNowVersion
archiveNowVersion = '2020.7.18.12.19.44'
# archive handlers path
PATH = Path(os.path.dirname(os.path.abspath(__file__)))
PATH_HANDLER = PATH / 'handlers'
# for the web app
app = Flask(__name__)
# create handlers for enabled archives
global handlers
handlers = {}
# defult value for server/port
SERVER_IP = '0.0.0.0'
SERVER_PORT = 12345
def bad_request(error=None):
message = {
'status': 400,
'message': 'Error in processing the request',
}
resp = jsonify(message)
resp.status_code = 400
return resp
# def getServer_IP_PORT():
# u = str(SERVER_IP)
# if str(SERVER_PORT) != '80':
# u = u + ":" + str(SERVER_PORT)
# if 'http' != u[0:4]:
# u = 'http://' + u
# return u
def listArchives_server(handlers):
uri_args = ''
if 'cc' in handlers:
if handlers['cc'].enabled and handlers['cc'].api_required:
uri_args = '?cc_api_key={Your-Perma.cc-API-Key}'
li = {"archives": [{ # getServer_IP_PORT() +
"id": "all", "GET":'/all/' + '{URI}'+uri_args,
"archive-name": "All enabled archives"}]}
for handler in handlers:
if handlers[handler].enabled:
uri_args2 = ''
if handler == 'cc':
uri_args2 = uri_args
li["archives"].append({ #getServer_IP_PORT() +
"id": handler, "archive-name": handlers[handler].name,
"GET": '/' + handler + '/' + '{URI}'+uri_args2})
return li
@app.route('/', defaults={'path': ''}, methods=['GET'])
@app.route('/<path:path>', methods=['GET'])
def pushit(path):
# no path; return a list of avaliable archives
if path == '':
#resp = jsonify(listArchives_server(handlers))
#resp.status_code = 200
return render_template('index.html')
#return resp
# get request with path
elif (path == 'api'):
resp = jsonify(listArchives_server(handlers))
resp.status_code = 200
return resp
elif (path == "ajax-loader.gif"):
return render_template('ajax-loader.gif')
else:
try:
# get the args passed to push function like API KEY if provided
PUSH_ARGS = {}
for k in request.args.keys():
PUSH_ARGS[k] = request.args[k]
s = str(path).split('/', 1)
arc_id = s[0]
URI = request.url.split('/', 4)[4] # include query params, too
if 'herokuapp.com' in request.host:
PUSH_ARGS['from_heroku'] = True
# To push into archives
resp = {"results": push(URI, arc_id, PUSH_ARGS)}
if len(resp["results"]) == 0:
return bad_request()
else:
# what to return
resp = jsonify(resp)
resp.status_code = 200
return resp
except Exception as e:
pass
return bad_request()
res_uris = {}
def push_proxy(hdlr, URIproxy, p_args_proxy, res_uris_idx, session=requests.Session()):
global res_uris
try:
res = hdlr.push( URIproxy , p_args_proxy, session=session)
print ( res )
res_uris[res_uris_idx].append(res)
except:
pass
def push(URI, arc_id, p_args={}, session=requests.Session()):
global handlers
global res_uris
try:
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(
Thread(
target=push_proxy,
args=(handlers[handler], str(URI), p_args, res_uris_idx, ),
kwargs={'session': session}))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
th.start()
for th in threads:
th.join()
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
except:
del res_uris[res_uris_idx]
pass
return ["bad request"]
def start(port=SERVER_PORT, host=SERVER_IP):
global SERVER_PORT
global SERVER_IP
SERVER_PORT = port
SERVER_IP = host
app.run(
host=host,
port=port,
threaded=True,
debug=True,
use_reloader=False)
def load_handlers():
global handlers
handlers = {}
# add the path of the handlers to the system so they can be imported
sys.path.append(str(PATH_HANDLER))
# create a list of handlers.
for file in PATH_HANDLER.glob('*_handler.py'):
name = file.stem
prefix = name.replace('_handler', '')
mod = importlib.import_module(name)
mod_class = getattr(mod, prefix.upper() + '_handler')
# finally an object is created
handlers[prefix] = mod_class()
# exclude all disabled archives
for handler in list(handlers): # handlers.keys():
if not handlers[handler].enabled:
del handlers[handler]
def args_parser():
global SERVER_PORT
global SERVER_IP
# parsing arguments
class MyParser(argparse.ArgumentParser):
def error(self, message):
sys.stderr.write('error: %s\n' % message)
self.print_help()
sys.exit(2)
def printm(self):
sys.stderr.write('')
self.print_help()
sys.exit(2)
parser = MyParser()
# arc_handler = 0
for handler in handlers:
# add archives identifiers to the list of options
# arc_handler += 1
if handler == 'warc':
parser.add_argument('--' + handler, nargs='?',
help=handlers[handler].name)
else:
parser.add_argument('--' + handler, action='store_true', default=False,
help='Use ' + handlers[handler].name)
if (handlers[handler].api_required):
parser.add_argument(
'--' +
handler +
'_api_key',
nargs='?',
help='An API KEY is required by ' +
handlers[handler].name)
parser.add_argument(
'-v',
'--version',
help='Report the version of archivenow',
action='version',
version='ArchiveNow ' +
archiveNowVersion)
if len(handlers) > 0:
parser.add_argument('--all', action='store_true', default=False,
help='Use all possible archives ')
parser.add_argument('--server', action='store_true', default=False,
help='Run archiveNow as a Web Service ')
parser.add_argument('URI', nargs='?', help='URI of a web resource')
parser.add_argument('--host', nargs='?', help='A server address')
if 'warc' in handlers.keys():
parser.add_argument('--agent', nargs='?', help='Use "wget" or "squidwarc" for WARC generation')
parser.add_argument(
'--port',
nargs='?',
help='A port number to run a Web Service')
args = parser.parse_args()
else:
print ('\n Error: No enabled archive handler found\n')
sys.exit(0)
arc_opt = 0
# start the server
if getattr(args, 'server'):
if getattr(args, 'port'):
SERVER_PORT = int(args.port)
if getattr(args, 'host'):
SERVER_IP = str(args.host)
start(port=SERVER_PORT, host=SERVER_IP)
else:
if not getattr(args, 'URI'):
print (parser.error('too few arguments'))
res = []
# get the args passed to push function like API KEY if provided
PUSH_ARGS = {}
for handler in handlers:
if (handlers[handler].api_required):
if getattr(args, handler + '_api_key'):
PUSH_ARGS[
handler +
'_api_key'] = getattr(
args,
handler +
'_api_key')
else:
if getattr(args, handler):
print (
parser.error(
'An API Key is required by ' +
handlers[handler].name))
orginal_warc_value = getattr(args, 'warc')
if handler == 'warc':
PUSH_ARGS['warc'] = getattr(args, 'warc')
if PUSH_ARGS['warc'] == None:
valid_chars = "-_.()/ %s%s" % (string.ascii_letters, string.digits)
PUSH_ARGS['warc'] = ''.join(c for c in str(args.URI).strip() if c in valid_chars)
PUSH_ARGS['warc'] = PUSH_ARGS['warc'].replace(' ','_').replace('/','_').replace('__','_') # I don't like spaces in filenames.
PUSH_ARGS['warc'] = PUSH_ARGS['warc']+'_'+str(uuid.uuid4())[:8]
if PUSH_ARGS['warc'][-1] == '_':
PUSH_ARGS['warc'] = PUSH_ARGS['warc'][:-1]
agent = 'wget'
tmp_agent = getattr(args, 'agent')
if tmp_agent == 'squidwarc':
agent = tmp_agent
PUSH_ARGS['agent'] = agent
# sys.exit(0)
# push to all possible archives
if getattr(args, 'all'):
arc_opt = 1
res = push(str(args.URI).strip(), 'all', PUSH_ARGS)
else:
# push to the chosen archives
for handler in handlers:
if getattr(args, handler):
arc_opt += 1
for i in push(str(args.URI).strip(), handler, PUSH_ARGS):
res.append(i)
# push to the defult archive
if (len(handlers) > 0) and (arc_opt == 0):
# set the default; it ia by default or the first archive in the
# list if not found
if 'ia' in handlers:
res = push(str(args.URI).strip(), 'ia', PUSH_ARGS)
else:
res = push(str(args.URI).strip(),
handlers.keys()[0], PUSH_ARGS)
# print (parser.printm())
# else:
# for rs in res:
# print (rs)
load_handlers()
if __name__ == '__main__':
args_parser()
================================================
FILE: archivenow/handlers/cc_handler.py
================================================
import requests
import json
class CC_handler(object):
def __init__(self):
self.enabled = True
self.name = 'The Perma.cc Archive'
self.api_required = True
def push(self, uri_org, p_args=[], session=requests.Session()):
msg = ''
try:
APIKEY = p_args['cc_api_key']
r = session.post('https://api.perma.cc/v1/archives/?api_key='+APIKEY, timeout=120,
data=json.dumps({"url":uri_org}),
headers={'Content-type': 'application/json'},
allow_redirects=True)
r.raise_for_status()
if 'Location' in r.headers:
return 'https://perma.cc/'+r.headers['Location'].rsplit('/',1)[1]
else:
for r2 in r.history:
if 'Location' in r2.headers:
return 'https://perma.cc/'+r2.headers['Location'].rsplit('/',1)[1]
entity_json = r.json()
if 'guid' in entity_json:
return str('https://perma.cc/'+entity_json['guid'])
msg = "Error ("+self.name+ "): No HTTP Location header is returned in the response"
except Exception as e:
if (msg == '') and ('_api_key' in str(e)):
msg = "Error (" + self.name+ "): " + 'An API Key is required '
elif (msg == ''):
msg = "Error (" + self.name+ "): " + str(e)
pass;
return msg
================================================
FILE: archivenow/handlers/ia_handler.py
================================================
import requests
class IA_handler(object):
def __init__(self):
self.enabled = True
self.name = 'The Internet Archive'
self.api_required = False
def push(self, uri_org, p_args=[], session=requests.Session()):
msg = ''
try:
uri = 'https://web.archive.org/save/' + uri_org
archiveTodayUserAgent = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
}
# push into the archive
# r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent)
if ('user-agent' in session.headers) and (not session.headers['User-Agent'].lower().startswith('python-requests/')):
r = session.get(uri, timeout=120, allow_redirects=True)
else:
r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent)
r.raise_for_status()
# extract the link to the archived copy
if (r != None):
if "Location" in r.headers:
return r.headers["Location"]
elif "Content-Location" in r.headers:
if (r.headers["Content-Location"]).startswith("/web/"):
return "https://web.archive.org"+r.headers["Content-Location"]
else:
try:
uri_from_content = "https://web.archive.org" + r.text.split('var redirUrl = "',1)[1].split('"',1)[0]
except:
uri_from_content = r.headers["Content-Location"]
#pass;
return uri_from_content
else:
for r2 in r.history:
if 'Location' in r2.headers:
return r.url
#return r2.headers['Location']
if 'Content-Location' in r2.headers:
return r.url
#return r2.headers['Content-Location']
msg = "("+self.name+ "): No HTTP Location/Content-Location header is returned in the response"
except Exception as e:
if msg == '':
msg = "Error (" + self.name+ "): " + str(e)
pass
return msg
================================================
FILE: archivenow/handlers/is_handler.py
================================================
import os
import requests
import sys
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
class IS_handler(object):
def __init__(self):
self.enabled = True
self.name = 'The Archive.is'
self.api_required = False
def push(self, uri_org, p_args=[], session=requests.Session()):
msg = ""
try:
options = Options()
options.headless = True # Run in background
driver = webdriver.Firefox(options = options)
driver.get("https://archive.is")
elem = driver.find_element_by_id("url") # Find the form to place a URL to be archived
elem.send_keys(uri_org) # Place the URL in the input box
saveButton = driver.find_element_by_xpath("/html/body/center/div/form[1]/div[3]/input") # Find the submit button
saveButton.click() # Click the submit button
# After clicking submit, there may be an additional page that pops up and asks if you are sure you want
# to archive that page since it was archived X amount of time ago. We need to wait for that page to
# load and click submit again.
delay = 30 # seconds
try:
nextSaveButton = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "/html/body/center/div[4]/center/div/div[2]/div/form/div/input")))
nextSaveButton.click()
except TimeoutException:
pass
# The page takes a while to archive, so keep checking if the loading page is still displayed.
loading = True
while loading:
if not 'wip' in driver.current_url and not 'submit' in driver.current_url:
loading = False
# After the loading screen is gone and the page is archived, the current URL
# will be the URL to the archived page.
msg = driver.current_url;
driver.quit()
except:
'''
exc_type, exc_obj, exc_tb = sys.exc_info()
fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1]
print((fname, exc_tb.tb_lineno, sys.exc_info() ))
'''
msg = "Unable to complete request."
return msg
================================================
FILE: archivenow/handlers/mg_handler.py
================================================
# encoding: utf-8
import os
import requests
import sys
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
class MG_handler(object):
def __init__(self):
self.enabled = True
self.name = 'Megalodon.jp'
self.api_required = False
def push(self, uri_org, p_args=[], session=requests.Session()):
msg = ""
options = Options()
options.headless = True # Run in background
driver = webdriver.Firefox(options = options)
driver.get("https://megalodon.jp/?url=" + uri_org)
try:
addButton = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[8]/form/div[1]/input[2]")
addButton.click() # Click the add button
except :
print("Unable to archive this page at this time.")
raise
stillOnPage = True
while stillOnPage:
try:
button = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[1]/div/h3")
except:
stillOnPage = False
try:
error = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[3]/div/a/h3")
msg = "We apologize for the inconvenience. Currently, acquisitions that are considered \"robots\" in the acquisition of certain conditions are prohibited."
raise
sys.exit()
except:
pass
# The page takes a while to archive, so keep checking if the loading page is still displayed.
loading = True
while loading:
try:
loadingPage = driver.find_element_by_xpath("/html/body/div[2]/div/div[1]/a/img")
loading = False
except:
loading = True
# After the loading screen is gone and the page is archived, the current URL
# will be the URL to the archived page.
if msg == "":
print(driver.current_url)
return msg
================================================
FILE: archivenow/handlers/warc_handler.py
================================================
import requests
import os.path
import distutils.spawn
class WARC_handler(object):
def __init__(self):
self.enabled = True
self.name = 'Generate WARC file'
self.api_required = False
def push(self, uri_org, p_args=[], session=requests.Session()):
msg = ''
if p_args['agent'] == 'squidwarc':
# squidwarc
#if not distutils.spawn.find_executable("squidwarc"):
# return 'wget is not installed!'
os.system('python ~/squidwarc_one_page/generte_warcs.py 9222 "'+uri_org+'" '+p_args['warc']+'.warc &> /dev/null')
if os.path.exists(p_args['warc']):
return p_args['warc']
elif os.path.exists(p_args['warc']+'.warc'):
return p_args['warc']+'.warc'
else:
return 'squidwarc failed to generate the WARC file'
else:
if not distutils.spawn.find_executable("wget"):
return 'wget is not installed!'
# wget
os.system('wget -E -H -k -p -q --delete-after --no-warc-compression --warc-file="'+p_args['warc']+'" "'+uri_org+'"')
if os.path.exists(p_args['warc']):
return p_args['warc']
elif os.path.exists(p_args['warc']+'.warc'):
return p_args['warc']+'.warc'
else:
return 'wget failed to generate the WARC file'
================================================
FILE: archivenow/templates/api.txt
================================================
<!-- <h4 id="archivenow_api">Archive Now API</h4>
<h5 id="archivenow_api1">To push a web page into particular web archive, use the following URL:</h5>
<pre>
http://{server}:{port}/{archive-id}/{URI}
</pre>
<h5 id="archivenow_api2">Archive identifier (use "all" for all archives):</h5>
<table style="width:30%">
<tr>
<th>Archive</th>
<th>Identifier</th>
</tr>
<tr>
<td>Internet Archive</td>
<td>ia</td>
</tr>
<tr>
<td>Archive.is</td>
<td>is</td>
</tr>
<tr>
<td>Perma.cc</td>
<td>cc</td>
</tr>
</table> -->
<!-- <h5 id="archivenow_api3">Example, capture http://www.example.com by Internet Archive: </h5>
<pre>
curl -i http://127.0.0.1:12345/ia/http://www.example.com
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Fri, 10 Nov 2017 22:36:26 GMT
{
"results": [
"https://web.archive.org/web/20171110223626/http://www.example.com"
]
}
</pre>
<h5 id="archivenow_api4">Example, capture http://www.example.com by all four archive (An API KEY is required by Perma.cc): </h5>
<pre>
curl -i 127.0.0.1:12345/all/http://www.example.com?cc_api_key=8r820...
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 207
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Fri, 10 Nov 2017 22:42:08 GMT
{
"results": [
"https://perma.cc/QX65-CFDD",
"https://web.archive.org/web/20171110223626/http://www.example.com",
"http://archive.is/ff17A",
"http://www.webcitation.org/6uschXwlI"
]
}
</pre> -->
================================================
FILE: archivenow/templates/index.html
================================================
<html>
<head>
<style>
.reveal-if-active {
opacity: 0;
max-height: 0;
overflow: hidden;
font-size: 14px;
-webkit-transform: scale(0.8);
transform: scale(0.8);
-webkit-transition: 0.5s;
transition: 0.5s;
}
.reveal-if-active label {
margin: 0 0 3px 22px;
display: block;
font-size: smaller;
}
.reveal-if-active input[type=text] {
width: 300px;
}
input[id="choice-archive4"]:checked ~ .reveal-if-active {
opacity: 1;
max-height: 120px;
padding: 0px 0px;
-webkit-transform: scale(1);
transform: scale(1);
overflow: visible;
}
table {
margin: 14px auto;
opacity: 0;
}
table, th, td {
border-collapse: collapse;
}
th, td {
padding: 1px;
text-align: left;
font-family: "My Custom Font", Verdana, Tahoma;
font-size: 12px;
}
tr{
border-bottom: 1px solid #ccc;
border-top: 1px solid #ccc;
}
#title {
display: block;
text-align: center;
padding: 22px 0 0 0
}
.url{
display: block;
text-align: center;
}
#text_url{
width:333px;
font-size: 12.5px;
}
#select_label{
padding: 0px 270px 0 0;
text-align: center;
margin-bottom: 8px;
}
#choices{
text-align: center;
padding: 0px 20px 0px 0px;
margin-left: 133px;
}
#choices2{
text-align: left;
display: inline-block;
}
#perma_cc_api{
margin: -2px 93px 0px 21px;
}
#submitdiv{
text-align: center;
padding: 20px 242px 0 0;
margin: 0 0 0 38px;
}
input[type=submit] {
width: 5em;
height: 2em;
font-size: 12px;
background-color: gainsboro;
margin: 0px 0px 0px 13px;
}
#errors{
font-size: smaller;
color: brown;
padding: 6px 0px 3px 104px;
}
.img1{
width: 13px;
opacity: 0;
}
.img2{
width: 13px;
opacity: 0;
}
.img3{
width: 13px;
opacity: 0;
}
.img5{
width: 13px;
opacity: 0;
}
.img6{
width: 13px;
opacity: 0;
}
.img4{
width: 13px;
opacity: 0;
}
#apilink{
font-size: smaller;
padding-top: 39px;
}
</style>
</head>
<body>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"></script>
<h3 id="title"> Preserve a web page in web archives </h3>
<div class="url">
<label for="text_url" id="label_url">URL</label>
<input type="text" id="text_url" required>
</div>
<div>
<p id="select_label">Select archives:</p>
<div id="choices">
<div id="choices2">
<input type="checkbox" id="choice-archive1" checked > Internet Archive <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img1" id="img1"> <br>
<input type="checkbox" id="choice-archive2" checked > Archive.is <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img2" id="img2"> <br>
<input type="checkbox" id="choice-archive6" checked > Megalodon.jp <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img6" id="img6"> <br>
<input type="checkbox" id="choice-archive4" > Perma.cc <img src={{ url_for('static', filename = "ajax-loader.gif") }} class="img4" id="img4">
<div class="reveal-if-active">
<label for="perma_cc_api">Permaa.cc requires <a href="https://perma.cc/settings/tools" target="_blank"> an API Key </a></label>
<input type="text" id="perma_cc_api">
</div>
</div>
</div>
</div>
<div id="submitdiv">
<input type="submit" value="Submit" onClick="push_archive();">
<input type="submit" value="Reset" onClick="reset();">
<div id ="errors"></div>
</div>
<table id="results" width="600">
<thead>
<tr>
<th scope="col" width="130">Archive</th>
<th scope="col" width="450">Link to the archived page</th>
</tr>
</thead>
</table>
<div id="apilink"><a href="/api" target="_blank">Archive Now API</a></div>
<script type="text/javascript">
document.getElementById('perma_cc_api').value = localStorage.getItem("permaccapikey");
if (localStorage.getItem("check_archive_1") !== null){
if (localStorage.getItem("check_archive_1") == 'true'){
document.getElementById('choice-archive1').checked = true
}else{
document.getElementById('choice-archive1').checked = false
}
}
if (localStorage.getItem("check_archive_2") !== null){
if (localStorage.getItem("check_archive_2") == 'true'){
document.getElementById('choice-archive2').checked = true
}else{
document.getElementById('choice-archive2').checked = false
}
}
if (localStorage.getItem("check_archive_3") !== null){
if (localStorage.getItem("check_archive_3") == 'true'){
document.getElementById('choice-archive3').checked = true
}else{
document.getElementById('choice-archive3').checked = false
}
}
if (localStorage.getItem("check_archive_5") !== null){
if (localStorage.getItem("check_archive_5") == 'true'){
document.getElementById('choice-archive5').checked = true
}else{
document.getElementById('choice-archive5').checked = false
}
}
if (localStorage.getItem("check_archive_6") !== null){
if (localStorage.getItem("check_archive_6") == 'true'){
document.getElementById('choice-archive6').checked = true
}else{
document.getElementById('choice-archive6').checked = false
}
}
if (localStorage.getItem("check_archive_4") !== null){
if (localStorage.getItem("check_archive_4") == 'true'){
document.getElementById('choice-archive4').checked = true
}else{
document.getElementById('choice-archive4').checked = false
}
}
function reset() {
window.location.reload();
}
function push_archive() {
document.getElementById('errors').innerHTML="";
localStorage.setItem("check_archive_1", false);
localStorage.setItem("check_archive_2", false);
localStorage.setItem("check_archive_3", false);
localStorage.setItem("check_archive_5", false);
localStorage.setItem("check_archive_6", false);
localStorage.setItem("check_archive_4", false);
var arr = []
var table = document.getElementById('results');
for (var r = 1, n = table.rows.length; r < n; r++) {
if(table.rows[r].cells[0].innerHTML.indexOf("https://archive.org") !== -1){
arr.push("ia");
}
if(table.rows[r].cells[0].innerHTML.indexOf("https://archive.is") !== -1){
arr.push("is");
}
if(table.rows[r].cells[0].innerHTML.indexOf("https://megalodon.jp") !== -1){
arr.push("mg");
}
if(table.rows[r].cells[0].innerHTML.indexOf("https://www.webcitation.org") !== -1){
arr.push("wc");
}
if(table.rows[r].cells[0].innerHTML.indexOf("https://perma.cc") !== -1){
arr.push("cc");
}
}
function validateURL(textval) {
var urlregex = /^(https?|ftp):\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?(((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?)(:\d*)?)(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)?(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?$/i;
return urlregex.test(textval);
}
if (validateURL(document.getElementById('text_url').value) == false){
document.getElementById('text_url').focus();
document.getElementById('errors').innerHTML="*Enter a correct URL*";
return;
}
if (document.getElementById('choice-archive4').checked == true){ // perma.cc
if(document.getElementById('perma_cc_api').value.trim() == ""){
document.getElementById('perma_cc_api').focus();
document.getElementById('errors').innerHTML="*Enter your Perma.cc API Key*";
return;
}
}
var selected_archives = 0;
if (document.getElementById('choice-archive1').checked == true){
selected_archives = selected_archives + 1;
if(arr.indexOf("ia") == -1){
document.getElementById('img1').style.opacity = 1
$.ajax({
type: "GET",
url: "ia/"+document.getElementById('text_url').value,
success: function(json) {
if (validateURL(json['results'][0]) == true){
var table=document.getElementById("results");
var row=table.insertRow(-1);
var cell1=row.insertCell(0);
var cell2=row.insertCell(1);
cell1.innerHTML='<a href="https://archive.org" target="_blank"> Internet Archive </a>'
cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
document.getElementById('results').style.opacity = 1
document.getElementById('img1').style.opacity = 0
}
},
complete: function(){
document.getElementById('img1').style.opacity = 0
}
});
}
localStorage.setItem("check_archive_1", true);
}
if (document.getElementById('choice-archive2').checked == true){
selected_archives = selected_archives + 1;
if(arr.indexOf("is") == -1){
document.getElementById('img2').style.opacity = 1
$.ajax({
type: "GET",
url: "is/"+document.getElementById('text_url').value,
success: function(json) {
if (validateURL(json['results'][0]) == true){
var table=document.getElementById("results");
var row=table.insertRow(-1);
var cell1=row.insertCell(0);
var cell2=row.insertCell(1);
cell1.innerHTML='<a href="https://archive.is" target="_blank"> Archive.is </a>'
cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
document.getElementById('results').style.opacity = 1
document.getElementById('img2').style.opacity = 0
}
},
complete: function(){
document.getElementById('img2').style.opacity = 0
}
});
}
localStorage.setItem("check_archive_2", true);
}
if (document.getElementById('choice-archive6').checked == true){
selected_archives = selected_archives + 1;
if(arr.indexOf("mg") == -1){
document.getElementById('img6').style.opacity = 1
$.ajax({
type: "GET",
url: "mg/"+document.getElementById('text_url').value,
success: function(json) {
if (validateURL(json['results'][0]) == true){
var table=document.getElementById("results");
var row=table.insertRow(-1);
var cell1=row.insertCell(0);
var cell2=row.insertCell(1);
cell1.innerHTML='<a href="https://megalodon.jp" target="_blank"> Megalodon.jp </a>'
cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
document.getElementById('results').style.opacity = 1
document.getElementById('img6').style.opacity = 0
}
},
complete: function(){
document.getElementById('img6').style.opacity = 0
}
});
}
localStorage.setItem("check_archive_6", true);
}
if (document.getElementById('choice-archive4').checked == true){
selected_archives = selected_archives + 1;
if(arr.indexOf("cc") == -1){
document.getElementById('img4').style.opacity = 1
$.ajax({
type: "GET",
url: "cc/"+document.getElementById('text_url').value+'?cc_api_key='+document.getElementById('perma_cc_api').value,
success: function(json) {
if (validateURL(json['results'][0]) == true){
var table=document.getElementById("results");
var row=table.insertRow(-1);
var cell1=row.insertCell(0);
var cell2=row.insertCell(1);
cell1.innerHTML='<a href="https://perma.cc" target="_blank"> Perma.cc </a>'
cell2.innerHTML='<a href="'+json['results'][0]+'" target="_blank"> '+json['results'][0]+' </a>'
document.getElementById('results').style.opacity = 1
document.getElementById('img4').style.opacity = 0
}
},
complete: function(){
document.getElementById('img4').style.opacity = 0
}
});
}
localStorage.setItem("permaccapikey", document.getElementById('perma_cc_api').value);
localStorage.setItem("check_archive_4", true);
}
if (selected_archives == 0){
document.getElementById('errors').innerHTML="*Select at least one archive*";
return;
}
}
</script>
</body>
</html>
================================================
FILE: requirements.txt
================================================
flask
requests
pathlib
selenium
================================================
FILE: setup.py
================================================
#!/usr/bin/env python
from setuptools import setup, find_packages
from archivenow import __version__
long_description = open('README.rst').read()
desc = """A Python library to push web resources into public web archives"""
setup(
name='archivenow',
version=__version__,
description=desc,
long_description=long_description,
author='Mohamed Aturban',
author_email='maturban@cs.odu.edu',
url='https://github.com/maturban/archivenow',
packages=find_packages(),
license="MIT",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Programming Language :: Python',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'License :: OSI Approved :: MIT License'
],
install_requires=[
'flask',
'requests'
],
package_data={
'archivenow': [
'handlers/*.*',
'templates/*.*',
'static/*.*'
]
},
entry_points='''
[console_scripts]
archivenow=archivenow.archivenow:args_parser
'''
)
gitextract_3tv8j6jl/ ├── .dockerignore ├── .gitignore ├── Dockerfile ├── LICENSE ├── README.rst ├── archivenow/ │ ├── __init__.py │ ├── archivenow.py │ ├── handlers/ │ │ ├── cc_handler.py │ │ ├── ia_handler.py │ │ ├── is_handler.py │ │ ├── mg_handler.py │ │ └── warc_handler.py │ └── templates/ │ ├── api.txt │ └── index.html ├── requirements.txt └── setup.py
SYMBOL INDEX (23 symbols across 6 files)
FILE: archivenow/archivenow.py
function bad_request (line 36) | def bad_request(error=None):
function listArchives_server (line 55) | def listArchives_server(handlers):
function pushit (line 76) | def pushit(path):
function push_proxy (line 121) | def push_proxy(hdlr, URIproxy, p_args_proxy, res_uris_idx, session=reque...
function push (line 130) | def push(URI, arc_id, p_args={}, session=requests.Session()):
function start (line 176) | def start(port=SERVER_PORT, host=SERVER_IP):
function load_handlers (line 189) | def load_handlers():
function args_parser (line 210) | def args_parser():
FILE: archivenow/handlers/cc_handler.py
class CC_handler (line 4) | class CC_handler(object):
method __init__ (line 6) | def __init__(self):
method push (line 11) | def push(self, uri_org, p_args=[], session=requests.Session()):
FILE: archivenow/handlers/ia_handler.py
class IA_handler (line 3) | class IA_handler(object):
method __init__ (line 5) | def __init__(self):
method push (line 10) | def push(self, uri_org, p_args=[], session=requests.Session()):
FILE: archivenow/handlers/is_handler.py
class IS_handler (line 13) | class IS_handler(object):
method __init__ (line 15) | def __init__(self):
method push (line 20) | def push(self, uri_org, p_args=[], session=requests.Session()):
FILE: archivenow/handlers/mg_handler.py
class MG_handler (line 13) | class MG_handler(object):
method __init__ (line 15) | def __init__(self):
method push (line 20) | def push(self, uri_org, p_args=[], session=requests.Session()):
FILE: archivenow/handlers/warc_handler.py
class WARC_handler (line 5) | class WARC_handler(object):
method __init__ (line 7) | def __init__(self):
method push (line 12) | def push(self, uri_org, p_args=[], session=requests.Session()):
Condensed preview — 16 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (57K chars).
[
{
"path": ".dockerignore",
"chars": 35,
"preview": ".git\n.gitignore\nLICENSE\nDockerfile\n"
},
{
"path": ".gitignore",
"chars": 56,
"preview": ".DS_Store\narchivenow.egg-info/\nbuild/\ndist/\n__pycache__\n"
},
{
"path": "Dockerfile",
"chars": 277,
"preview": "ARG PYTAG=latest\nFROM python:${PYTAG}\nLABEL maintainer \"Mohamed Aturban <mohsci1@yahoo.com>\"\n\nWORKDIR /app\nCOPY requirem"
},
{
"path": "LICENSE",
"chars": 1107,
"preview": "MIT License\n\nCopyright (c) 2017 ODU Web Science / Digital Libraries Research Group\n\nPermission is hereby granted, free o"
},
{
"path": "README.rst",
"chars": 11905,
"preview": "Archive Now (archivenow)\n=============================\nA Tool To Push Web Resources Into Web Archives\n------------------"
},
{
"path": "archivenow/__init__.py",
"chars": 34,
"preview": "__version__ = '2020.7.18.12.19.44'"
},
{
"path": "archivenow/archivenow.py",
"chars": 11434,
"preview": "#!/usr/bin/env python\nimport os\nimport re\nimport sys\nimport uuid\nimport glob\nimport json\nimport importlib\nimport argpars"
},
{
"path": "archivenow/handlers/cc_handler.py",
"chars": 1589,
"preview": "import requests\nimport json\n\nclass CC_handler(object):\n\n def __init__(self):\n self.enabled = True\n self"
},
{
"path": "archivenow/handlers/ia_handler.py",
"chars": 2459,
"preview": "import requests\n\nclass IA_handler(object):\n\n def __init__(self):\n self.enabled = True\n self.name = 'The"
},
{
"path": "archivenow/handlers/is_handler.py",
"chars": 2610,
"preview": "import os\nimport requests\nimport sys\nfrom selenium.webdriver.firefox.options import Options\nfrom selenium import webdriv"
},
{
"path": "archivenow/handlers/mg_handler.py",
"chars": 2288,
"preview": "# encoding: utf-8\nimport os\nimport requests\nimport sys\nfrom selenium.webdriver.firefox.options import Options\nfrom selen"
},
{
"path": "archivenow/handlers/warc_handler.py",
"chars": 1421,
"preview": "import requests\nimport os.path\nimport distutils.spawn\n\nclass WARC_handler(object):\n\n def __init__(self):\n self"
},
{
"path": "archivenow/templates/api.txt",
"chars": 1544,
"preview": "<!-- <h4 id=\"archivenow_api\">Archive Now API</h4>\n <h5 id=\"archivenow_api1\">To push a web page into particular web ar"
},
{
"path": "archivenow/templates/index.html",
"chars": 15774,
"preview": "<html>\n<head>\n<style>\n\n.reveal-if-active {\n opacity: 0;\n max-height: 0;\n overflow: hidden;\n font-size: 14px;\n -webk"
},
{
"path": "requirements.txt",
"chars": 31,
"preview": "flask\nrequests\npathlib\nselenium"
},
{
"path": "setup.py",
"chars": 1249,
"preview": "#!/usr/bin/env python\n\nfrom setuptools import setup, find_packages\nfrom archivenow import __version__\n\nlong_description "
}
]
About this extraction
This page contains the full source code of the oduwsdl/archivenow GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 16 files (52.6 KB), approximately 13.7k tokens, and a symbol index with 23 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.