Repository: oduwsdl/archivenow Branch: master Commit: dbc688f4f238 Files: 16 Total size: 52.6 KB Directory structure: gitextract_3tv8j6jl/ ├── .dockerignore ├── .gitignore ├── Dockerfile ├── LICENSE ├── README.rst ├── archivenow/ │ ├── __init__.py │ ├── archivenow.py │ ├── handlers/ │ │ ├── cc_handler.py │ │ ├── ia_handler.py │ │ ├── is_handler.py │ │ ├── mg_handler.py │ │ └── warc_handler.py │ └── templates/ │ ├── api.txt │ └── index.html ├── requirements.txt └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .dockerignore ================================================ .git .gitignore LICENSE Dockerfile ================================================ FILE: .gitignore ================================================ .DS_Store archivenow.egg-info/ build/ dist/ __pycache__ ================================================ FILE: Dockerfile ================================================ ARG PYTAG=latest FROM python:${PYTAG} LABEL maintainer "Mohamed Aturban " WORKDIR /app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY . ./ RUN chmod a+x ./archivenow/archivenow.py ENTRYPOINT ["./archivenow/archivenow.py"] ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2017 ODU Web Science / Digital Libraries Research Group Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.rst ================================================ Archive Now (archivenow) ============================= A Tool To Push Web Resources Into Web Archives ---------------------------------------------- Archive Now (**archivenow**) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers". Update January 2021 ~~~~~~~~~ Originally, **archivenow** was configured to push to 6 different public web archives. The two removed web archives are `WebCite `_ and `archive.st `_. WebCite was removed from **archivenow** as they are no longer accepting archiving requests. Archive.st was removed from **archivenow** due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to `archive.today `_ and `megalodon.jp `_ from **archivenow** has been updated. In order to push to `archive.today `_ and `megalodon.jp `_, `Selenium `_ is used. As explained below, this library can be used through: - Command Line Interface (CLI) - A Web Service - A Docker Container - Python Installing ---------- The latest release of **archivenow** can be installed using pip: .. code-block:: bash $ pip install archivenow The latest development version containing changes not yet released can be installed from source: .. code-block:: bash $ git clone git@github.com:oduwsdl/archivenow.git $ cd archivenow $ pip install -r requirements.txt $ pip install ./ In order to push to `archive.today `_ and `megalodon.jp `_, **archivenow** must use `Selenium `_, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and **archivenow** with `Firefox `_ and Firefox's corresponding `GeckoDriver `_. You can download the latest versions of `Firefox `_ and the `GeckoDriver `_ to use with **archivenow**. After installing the driver, you can push to `archive.today `_ and `megalodon.jp `_ from **archivenow**. CLI USAGE --------- Usage of sub-commands in **archivenow** can be accessed through providing the `-h` or `--help` flag, like any of the below. .. code-block:: bash $ archivenow -h usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]] [--is] [--ia] [--warc [WARC]] [-v] [--all] [--server] [--host [HOST]] [--agent [AGENT]] [--port [PORT]] [URI] positional arguments: URI URI of a web resource optional arguments: -h, --help show this help message and exit --mg Use Megalodon.jp --cc Use The Perma.cc Archive --cc_api_key [CC_API_KEY] An API KEY is required by The Perma.cc Archive --is Use The Archive.is --ia Use The Internet Archive --warc [WARC] Generate WARC file -v, --version Report the version of archivenow --all Use all possible archives --server Run archiveNow as a Web Service --host [HOST] A server address --agent [AGENT] Use "wget" or "squidwarc" for WARC generation --port [PORT] A port number to run a Web Service Examples -------- Example 1 ~~~~~~~~~ To save the web page (www.foxnews.com) in the Internet Archive: .. code-block:: bash $ archivenow --ia www.foxnews.com https://web.archive.org/web/20170209135625/http://www.foxnews.com Example 2 ~~~~~~~~~ By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided: .. code-block:: bash $ archivenow www.foxnews.com https://web.archive.org/web/20170215164835/http://www.foxnews.com Example 3 ~~~~~~~~~ To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is: .. code-block:: bash $ archivenow --ia --is www.foxnews.com https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc Example 4 ~~~~~~~~~ To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file: .. code-block:: bash $ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key http://archive.is/dcnan https://perma.cc/53CC-5ST8 https://web.archive.org/web/20181002081445/https://nypost.com/ https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/ https_nypost.com__96ec2300.warc Example 5 ~~~~~~~~~ To download the web page (https://nypost.com/) and create a WARC file: .. code-block:: bash $ archivenow --warc=mypage --agent=wget https://nypost.com/ mypage.warc Server ------ You can run **archivenow** as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 12345) .. code-block:: bash $ archivenow --server Running on http://0.0.0.0:12345/ (Press CTRL+C to quit) Example 6 ~~~~~~~~~ To save the web page (www.foxnews.com) in The Internet Archive through the web service: .. code-block:: bash $ curl -i http://0.0.0.0:12345/ia/www.foxnews.com HTTP/1.0 200 OK Content-Type: application/json Content-Length: 95 Server: Werkzeug/0.11.15 Python/2.7.10 Date: Tue, 02 Oct 2018 08:20:18 GMT { "results": [ "https://web.archive.org/web/20181002082007/http://www.foxnews.com" ] } Example 7 ~~~~~~~~~ To save the web page (www.foxnews.com) in all configured archives though the web service: .. code-block:: bash $ curl -i http://0.0.0.0:12345/all/www.foxnews.com HTTP/1.0 200 OK Content-Type: application/json Content-Length: 385 Server: Werkzeug/0.11.15 Python/2.7.10 Date: Tue, 02 Oct 2018 08:23:53 GMT { "results": [ "Error (The Perma.cc Archive): An API Key is required ", "http://archive.is/ukads", "https://web.archive.org/web/20181002082007/http://www.foxnews.com", "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ", "http://www.webcitation.org/72rbKsX8B" ] } Example 8 ~~~~~~~~~ Because an API Key is required by Perma.cc, the HTTP request should be as follows: .. code-block:: bash $ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key Or use only Perma.cc: .. code-block:: bash $ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key Running as a Docker Container ----------------------------- .. code-block:: bash $ docker image pull oduwsdl/archivenow Different ways to run archivenow .. code-block:: bash $ docker container run -it --rm oduwsdl/archivenow -h Accessible at 127.0.0.1:12345: .. code-block:: bash $ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0 Accessible at 127.0.0.1:22222: .. code-block:: bash $ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0 .. image:: http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif :width: 10pt To save the web page (http://www.cnn.com) in The Internet Archive .. code-block:: bash $ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com Python Usage ------------ .. code-block:: bash >>> from archivenow import archivenow Example 9 ~~~~~~~~~~ To save the web page (www.foxnews.com) in all configured archives: .. code-block:: bash >>> archivenow.push("www.foxnews.com","all") ['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required] Example 10 ~~~~~~~~~~ To save the web page (www.foxnews.com) in The Perma.cc: .. code-block:: bash >>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"}) ['https://perma.cc/8YYC-C7RM'] Example 11 ~~~~~~~~~~ To start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')): .. code-block:: bash >>> archivenow.start() 2017-02-09 15:02:37 Running on http://127.0.0.1:12345 (Press CTRL+C to quit) Configuring a new archive or removing existing one -------------------------------------------------- Additional archives may be added by creating a handler file in the "handlers" directory. For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write: .. code-block:: python archivenow.push("www.cnn.com","ma") In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing `handler files`_ for examples on how to organized a newly configured archive handler. Removing an archive can be done by one of the following options: - Removing the archive handler file from the folder "handlers" - Renaming the archive handler file to other name that does not end with "_handler.py" - Setting the variable "enabled" to "False" inside the handler file Notes ----- The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource. For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (*C*) of this URI. IA will then return *C* for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes. .. _handler files: https://github.com/oduwsdl/archivenow/tree/master/archivenow/handlers Citing Project -------------- .. code-block:: latex @INPROCEEDINGS{archivenow-jcdl2018, AUTHOR = {Mohamed Aturban and Mat Kelly and Sawood Alam and John A. Berlin and Michael L. Nelson and Michele C. Weigle}, TITLE = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation}, BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries}, SERIES = {{JCDL} '18}, PAGES = {321--322}, MONTH = {June}, YEAR = {2018}, ADDRESS = {Fort Worth, Texas, USA}, URL = {https://doi.org/10.1145/3197026.3203880}, DOI = {10.1145/3197026.3203880} } ================================================ FILE: archivenow/__init__.py ================================================ __version__ = '2020.7.18.12.19.44' ================================================ FILE: archivenow/archivenow.py ================================================ #!/usr/bin/env python import os import re import sys import uuid import glob import json import importlib import argparse import string import requests from threading import Thread from flask import request, Flask, jsonify, render_template from pathlib import Path #from __init__ import __version__ as archiveNowVersion archiveNowVersion = '2020.7.18.12.19.44' # archive handlers path PATH = Path(os.path.dirname(os.path.abspath(__file__))) PATH_HANDLER = PATH / 'handlers' # for the web app app = Flask(__name__) # create handlers for enabled archives global handlers handlers = {} # defult value for server/port SERVER_IP = '0.0.0.0' SERVER_PORT = 12345 def bad_request(error=None): message = { 'status': 400, 'message': 'Error in processing the request', } resp = jsonify(message) resp.status_code = 400 return resp # def getServer_IP_PORT(): # u = str(SERVER_IP) # if str(SERVER_PORT) != '80': # u = u + ":" + str(SERVER_PORT) # if 'http' != u[0:4]: # u = 'http://' + u # return u def listArchives_server(handlers): uri_args = '' if 'cc' in handlers: if handlers['cc'].enabled and handlers['cc'].api_required: uri_args = '?cc_api_key={Your-Perma.cc-API-Key}' li = {"archives": [{ # getServer_IP_PORT() + "id": "all", "GET":'/all/' + '{URI}'+uri_args, "archive-name": "All enabled archives"}]} for handler in handlers: if handlers[handler].enabled: uri_args2 = '' if handler == 'cc': uri_args2 = uri_args li["archives"].append({ #getServer_IP_PORT() + "id": handler, "archive-name": handlers[handler].name, "GET": '/' + handler + '/' + '{URI}'+uri_args2}) return li @app.route('/', defaults={'path': ''}, methods=['GET']) @app.route('/', methods=['GET']) def pushit(path): # no path; return a list of avaliable archives if path == '': #resp = jsonify(listArchives_server(handlers)) #resp.status_code = 200 return render_template('index.html') #return resp # get request with path elif (path == 'api'): resp = jsonify(listArchives_server(handlers)) resp.status_code = 200 return resp elif (path == "ajax-loader.gif"): return render_template('ajax-loader.gif') else: try: # get the args passed to push function like API KEY if provided PUSH_ARGS = {} for k in request.args.keys(): PUSH_ARGS[k] = request.args[k] s = str(path).split('/', 1) arc_id = s[0] URI = request.url.split('/', 4)[4] # include query params, too if 'herokuapp.com' in request.host: PUSH_ARGS['from_heroku'] = True # To push into archives resp = {"results": push(URI, arc_id, PUSH_ARGS)} if len(resp["results"]) == 0: return bad_request() else: # what to return resp = jsonify(resp) resp.status_code = 200 return resp except Exception as e: pass return bad_request() res_uris = {} def push_proxy(hdlr, URIproxy, p_args_proxy, res_uris_idx, session=requests.Session()): global res_uris try: res = hdlr.push( URIproxy , p_args_proxy, session=session) print ( res ) res_uris[res_uris_idx].append(res) except: pass def push(URI, arc_id, p_args={}, session=requests.Session()): global handlers global res_uris try: # push to all possible archives res_uris_idx = str(uuid.uuid4()) res_uris[res_uris_idx] = [] ### if arc_id == 'all': ### for handler in handlers: ### if (handlers[handler].api_required): # pass args like key API ### res.append(handlers[handler].push(str(URI), p_args)) ### else: ### res.append(handlers[handler].push(str(URI))) ### else: # push to the chosen archives threads = [] for handler in handlers: if (arc_id == handler) or (arc_id == 'all'): ### if (arc_id == handler): ### and (handlers[handler].api_required): #res.append(handlers[handler].push(str(URI), p_args)) #push_proxy( handlers[handler], str(URI), p_args, res_uris_idx) threads.append( Thread( target=push_proxy, args=(handlers[handler], str(URI), p_args, res_uris_idx, ), kwargs={'session': session})) ### elif (arc_id == handler): ### res.append(handlers[handler].push(str(URI))) for th in threads: th.start() for th in threads: th.join() res = res_uris[res_uris_idx] del res_uris[res_uris_idx] return res except: del res_uris[res_uris_idx] pass return ["bad request"] def start(port=SERVER_PORT, host=SERVER_IP): global SERVER_PORT global SERVER_IP SERVER_PORT = port SERVER_IP = host app.run( host=host, port=port, threaded=True, debug=True, use_reloader=False) def load_handlers(): global handlers handlers = {} # add the path of the handlers to the system so they can be imported sys.path.append(str(PATH_HANDLER)) # create a list of handlers. for file in PATH_HANDLER.glob('*_handler.py'): name = file.stem prefix = name.replace('_handler', '') mod = importlib.import_module(name) mod_class = getattr(mod, prefix.upper() + '_handler') # finally an object is created handlers[prefix] = mod_class() # exclude all disabled archives for handler in list(handlers): # handlers.keys(): if not handlers[handler].enabled: del handlers[handler] def args_parser(): global SERVER_PORT global SERVER_IP # parsing arguments class MyParser(argparse.ArgumentParser): def error(self, message): sys.stderr.write('error: %s\n' % message) self.print_help() sys.exit(2) def printm(self): sys.stderr.write('') self.print_help() sys.exit(2) parser = MyParser() # arc_handler = 0 for handler in handlers: # add archives identifiers to the list of options # arc_handler += 1 if handler == 'warc': parser.add_argument('--' + handler, nargs='?', help=handlers[handler].name) else: parser.add_argument('--' + handler, action='store_true', default=False, help='Use ' + handlers[handler].name) if (handlers[handler].api_required): parser.add_argument( '--' + handler + '_api_key', nargs='?', help='An API KEY is required by ' + handlers[handler].name) parser.add_argument( '-v', '--version', help='Report the version of archivenow', action='version', version='ArchiveNow ' + archiveNowVersion) if len(handlers) > 0: parser.add_argument('--all', action='store_true', default=False, help='Use all possible archives ') parser.add_argument('--server', action='store_true', default=False, help='Run archiveNow as a Web Service ') parser.add_argument('URI', nargs='?', help='URI of a web resource') parser.add_argument('--host', nargs='?', help='A server address') if 'warc' in handlers.keys(): parser.add_argument('--agent', nargs='?', help='Use "wget" or "squidwarc" for WARC generation') parser.add_argument( '--port', nargs='?', help='A port number to run a Web Service') args = parser.parse_args() else: print ('\n Error: No enabled archive handler found\n') sys.exit(0) arc_opt = 0 # start the server if getattr(args, 'server'): if getattr(args, 'port'): SERVER_PORT = int(args.port) if getattr(args, 'host'): SERVER_IP = str(args.host) start(port=SERVER_PORT, host=SERVER_IP) else: if not getattr(args, 'URI'): print (parser.error('too few arguments')) res = [] # get the args passed to push function like API KEY if provided PUSH_ARGS = {} for handler in handlers: if (handlers[handler].api_required): if getattr(args, handler + '_api_key'): PUSH_ARGS[ handler + '_api_key'] = getattr( args, handler + '_api_key') else: if getattr(args, handler): print ( parser.error( 'An API Key is required by ' + handlers[handler].name)) orginal_warc_value = getattr(args, 'warc') if handler == 'warc': PUSH_ARGS['warc'] = getattr(args, 'warc') if PUSH_ARGS['warc'] == None: valid_chars = "-_.()/ %s%s" % (string.ascii_letters, string.digits) PUSH_ARGS['warc'] = ''.join(c for c in str(args.URI).strip() if c in valid_chars) PUSH_ARGS['warc'] = PUSH_ARGS['warc'].replace(' ','_').replace('/','_').replace('__','_') # I don't like spaces in filenames. PUSH_ARGS['warc'] = PUSH_ARGS['warc']+'_'+str(uuid.uuid4())[:8] if PUSH_ARGS['warc'][-1] == '_': PUSH_ARGS['warc'] = PUSH_ARGS['warc'][:-1] agent = 'wget' tmp_agent = getattr(args, 'agent') if tmp_agent == 'squidwarc': agent = tmp_agent PUSH_ARGS['agent'] = agent # sys.exit(0) # push to all possible archives if getattr(args, 'all'): arc_opt = 1 res = push(str(args.URI).strip(), 'all', PUSH_ARGS) else: # push to the chosen archives for handler in handlers: if getattr(args, handler): arc_opt += 1 for i in push(str(args.URI).strip(), handler, PUSH_ARGS): res.append(i) # push to the defult archive if (len(handlers) > 0) and (arc_opt == 0): # set the default; it ia by default or the first archive in the # list if not found if 'ia' in handlers: res = push(str(args.URI).strip(), 'ia', PUSH_ARGS) else: res = push(str(args.URI).strip(), handlers.keys()[0], PUSH_ARGS) # print (parser.printm()) # else: # for rs in res: # print (rs) load_handlers() if __name__ == '__main__': args_parser() ================================================ FILE: archivenow/handlers/cc_handler.py ================================================ import requests import json class CC_handler(object): def __init__(self): self.enabled = True self.name = 'The Perma.cc Archive' self.api_required = True def push(self, uri_org, p_args=[], session=requests.Session()): msg = '' try: APIKEY = p_args['cc_api_key'] r = session.post('https://api.perma.cc/v1/archives/?api_key='+APIKEY, timeout=120, data=json.dumps({"url":uri_org}), headers={'Content-type': 'application/json'}, allow_redirects=True) r.raise_for_status() if 'Location' in r.headers: return 'https://perma.cc/'+r.headers['Location'].rsplit('/',1)[1] else: for r2 in r.history: if 'Location' in r2.headers: return 'https://perma.cc/'+r2.headers['Location'].rsplit('/',1)[1] entity_json = r.json() if 'guid' in entity_json: return str('https://perma.cc/'+entity_json['guid']) msg = "Error ("+self.name+ "): No HTTP Location header is returned in the response" except Exception as e: if (msg == '') and ('_api_key' in str(e)): msg = "Error (" + self.name+ "): " + 'An API Key is required ' elif (msg == ''): msg = "Error (" + self.name+ "): " + str(e) pass; return msg ================================================ FILE: archivenow/handlers/ia_handler.py ================================================ import requests class IA_handler(object): def __init__(self): self.enabled = True self.name = 'The Internet Archive' self.api_required = False def push(self, uri_org, p_args=[], session=requests.Session()): msg = '' try: uri = 'https://web.archive.org/save/' + uri_org archiveTodayUserAgent = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" } # push into the archive # r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent) if ('user-agent' in session.headers) and (not session.headers['User-Agent'].lower().startswith('python-requests/')): r = session.get(uri, timeout=120, allow_redirects=True) else: r = session.get(uri, timeout=120, allow_redirects=True, headers=archiveTodayUserAgent) r.raise_for_status() # extract the link to the archived copy if (r != None): if "Location" in r.headers: return r.headers["Location"] elif "Content-Location" in r.headers: if (r.headers["Content-Location"]).startswith("/web/"): return "https://web.archive.org"+r.headers["Content-Location"] else: try: uri_from_content = "https://web.archive.org" + r.text.split('var redirUrl = "',1)[1].split('"',1)[0] except: uri_from_content = r.headers["Content-Location"] #pass; return uri_from_content else: for r2 in r.history: if 'Location' in r2.headers: return r.url #return r2.headers['Location'] if 'Content-Location' in r2.headers: return r.url #return r2.headers['Content-Location'] msg = "("+self.name+ "): No HTTP Location/Content-Location header is returned in the response" except Exception as e: if msg == '': msg = "Error (" + self.name+ "): " + str(e) pass return msg ================================================ FILE: archivenow/handlers/is_handler.py ================================================ import os import requests import sys from selenium.webdriver.firefox.options import Options from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException class IS_handler(object): def __init__(self): self.enabled = True self.name = 'The Archive.is' self.api_required = False def push(self, uri_org, p_args=[], session=requests.Session()): msg = "" try: options = Options() options.headless = True # Run in background driver = webdriver.Firefox(options = options) driver.get("https://archive.is") elem = driver.find_element_by_id("url") # Find the form to place a URL to be archived elem.send_keys(uri_org) # Place the URL in the input box saveButton = driver.find_element_by_xpath("/html/body/center/div/form[1]/div[3]/input") # Find the submit button saveButton.click() # Click the submit button # After clicking submit, there may be an additional page that pops up and asks if you are sure you want # to archive that page since it was archived X amount of time ago. We need to wait for that page to # load and click submit again. delay = 30 # seconds try: nextSaveButton = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.XPATH, "/html/body/center/div[4]/center/div/div[2]/div/form/div/input"))) nextSaveButton.click() except TimeoutException: pass # The page takes a while to archive, so keep checking if the loading page is still displayed. loading = True while loading: if not 'wip' in driver.current_url and not 'submit' in driver.current_url: loading = False # After the loading screen is gone and the page is archived, the current URL # will be the URL to the archived page. msg = driver.current_url; driver.quit() except: ''' exc_type, exc_obj, exc_tb = sys.exc_info() fname = os.path.split(exc_tb.tb_frame.f_code.co_filename)[1] print((fname, exc_tb.tb_lineno, sys.exc_info() )) ''' msg = "Unable to complete request." return msg ================================================ FILE: archivenow/handlers/mg_handler.py ================================================ # encoding: utf-8 import os import requests import sys from selenium.webdriver.firefox.options import Options from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException class MG_handler(object): def __init__(self): self.enabled = True self.name = 'Megalodon.jp' self.api_required = False def push(self, uri_org, p_args=[], session=requests.Session()): msg = "" options = Options() options.headless = True # Run in background driver = webdriver.Firefox(options = options) driver.get("https://megalodon.jp/?url=" + uri_org) try: addButton = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[8]/form/div[1]/input[2]") addButton.click() # Click the add button except : print("Unable to archive this page at this time.") raise stillOnPage = True while stillOnPage: try: button = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[1]/div/h3") except: stillOnPage = False try: error = driver.find_element_by_xpath("/html/body/div[2]/div[2]/div[3]/div/a/h3") msg = "We apologize for the inconvenience. Currently, acquisitions that are considered \"robots\" in the acquisition of certain conditions are prohibited." raise sys.exit() except: pass # The page takes a while to archive, so keep checking if the loading page is still displayed. loading = True while loading: try: loadingPage = driver.find_element_by_xpath("/html/body/div[2]/div/div[1]/a/img") loading = False except: loading = True # After the loading screen is gone and the page is archived, the current URL # will be the URL to the archived page. if msg == "": print(driver.current_url) return msg ================================================ FILE: archivenow/handlers/warc_handler.py ================================================ import requests import os.path import distutils.spawn class WARC_handler(object): def __init__(self): self.enabled = True self.name = 'Generate WARC file' self.api_required = False def push(self, uri_org, p_args=[], session=requests.Session()): msg = '' if p_args['agent'] == 'squidwarc': # squidwarc #if not distutils.spawn.find_executable("squidwarc"): # return 'wget is not installed!' os.system('python ~/squidwarc_one_page/generte_warcs.py 9222 "'+uri_org+'" '+p_args['warc']+'.warc &> /dev/null') if os.path.exists(p_args['warc']): return p_args['warc'] elif os.path.exists(p_args['warc']+'.warc'): return p_args['warc']+'.warc' else: return 'squidwarc failed to generate the WARC file' else: if not distutils.spawn.find_executable("wget"): return 'wget is not installed!' # wget os.system('wget -E -H -k -p -q --delete-after --no-warc-compression --warc-file="'+p_args['warc']+'" "'+uri_org+'"') if os.path.exists(p_args['warc']): return p_args['warc'] elif os.path.exists(p_args['warc']+'.warc'): return p_args['warc']+'.warc' else: return 'wget failed to generate the WARC file' ================================================ FILE: archivenow/templates/api.txt ================================================ ================================================ FILE: archivenow/templates/index.html ================================================

Preserve a web page in web archives

Select archives:

Internet Archive
Archive.is
Megalodon.jp
Perma.cc
Archive Link to the archived page
================================================ FILE: requirements.txt ================================================ flask requests pathlib selenium ================================================ FILE: setup.py ================================================ #!/usr/bin/env python from setuptools import setup, find_packages from archivenow import __version__ long_description = open('README.rst').read() desc = """A Python library to push web resources into public web archives""" setup( name='archivenow', version=__version__, description=desc, long_description=long_description, author='Mohamed Aturban', author_email='maturban@cs.odu.edu', url='https://github.com/maturban/archivenow', packages=find_packages(), license="MIT", classifiers=[ 'Development Status :: 5 - Production/Stable', 'Programming Language :: Python', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', 'License :: OSI Approved :: MIT License' ], install_requires=[ 'flask', 'requests' ], package_data={ 'archivenow': [ 'handlers/*.*', 'templates/*.*', 'static/*.*' ] }, entry_points=''' [console_scripts] archivenow=archivenow.archivenow:args_parser ''' )