Repository: mobarski/ask-my-pdf
Branch: main
Commit: 470d31969331
Files: 16
Total size: 40.5 KB
Directory structure:
gitextract_zi78g5yh/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── src/
├── ai.py
├── cache.py
├── css.py
├── feedback.py
├── gui.py
├── model.py
├── pdf.py
├── prompts.py
├── run.bat
├── run.sh
├── stats.py
└── storage.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
**/__pycache__
**/*secret*.*
data/**
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2023 Maciej Obarski
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Ask my PDF
Thank you for your interest in my application. Please be aware that this is only a **Proof of Concept system** and may contain bugs or unfinished features. If you like this app you can ❤️ [follow me](https://twitter.com/KerbalFPV) on Twitter for news and updates.
### Ask my PDF - Question answering system built on top of GPT3
🎲 The primary use case for this app is to assist users in answering questions about board game rules based on the instruction manual. While the app can be used for other tasks, helping users with board game rules is particularly meaningful to me since I'm an avid fan of board games myself. Additionally, this use case is relatively harmless, even in cases where the model may experience hallucinations.
🌐 The app can be accessed on the Streamlit Community Cloud at https://ask-my-pdf.streamlit.app/. 🔑 However, to use the app, you will need your own [OpenAI's API key](https://platform.openai.com/account/api-keys).
📄 The app implements the following academic papers:
- [In-Context Retrieval-Augmented Language Models](https://arxiv.org/abs/2302.00083) aka **RALM**
- [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496) aka **HyDE** (Hypothetical Document Embeddings)
### Installation
1. Clone the repo:
`git clone https://github.com/mobarski/ask-my-pdf`
2. Install dependencies:
`pip install -r ask-my-pdf/requirements.txt`
3. Run the app:
`cd ask-my-pdf/src`
`run.sh` or `run.bat`
### High-level documentation
#### RALM + HyDE

#### RALM + HyDE + context

### Environment variables used for configuration
##### General configuration:
- **STORAGE_SALT** - cryptograpic salt used when deriving user/folder name and encryption key from API key, hexadecimal notation, 2-16 characters
- **STORAGE_MODE** - index storage mode: S3, LOCAL, DICT (default)
- **STATS_MODE** - usage stats storage mode: REDIS, DICT (default)
- **FEEDBACK_MODE** - user feedback storage mode: REDIS, NONE (default)
- **CACHE_MODE** - embeddings cache mode: S3, DISK, NONE (default)
##### Local filesystem configuration (storage / cache):
- **STORAGE_PATH** - directory path for index storage
- **CACHE_PATH** - directory path for embeddings cache
##### S3 configuration (storage / cache):
- **S3_REGION** - region code
- **S3_BUCKET** - bucket name (storage)
- **S3_SECRET** - secret key
- **S3_KEY** - access key
- **S3_URL** - URL
- **S3_PREFIX** - object name prefix
- **S3_CACHE_BUCKET** - bucket name (cache)
- **S3_CACHE_PREFIX** - object name prefix (cache)
##### Redis configuration (for persistent usage statistics / user feedback):
- **REDIS_URL** - Redis DB URL (redis[s]://:password@host:port/[db])
##### Community version related options:
- **OPENAI_KEY** - API key used for the default user
- **COMMUNITY_DAILY_USD** - default user's daily budget
- **COMMUNITY_USER** - default user's code
================================================
FILE: requirements.txt
================================================
git+https://github.com/mobarski/ai-bricks.git
streamlit
pypdf
scikit-learn
numpy
pycryptodome
boto3
redis
retry
================================================
FILE: src/ai.py
================================================
from ai_bricks.api import openai
import stats
import os
DEFAULT_USER = os.getenv('COMMUNITY_USER','')
def use_key(key):
openai.use_key(key)
usage_stats = stats.get_stats(user=DEFAULT_USER)
def set_user(user):
global usage_stats
usage_stats = stats.get_stats(user=user)
openai.set_global('user', user)
openai.add_callback('after', stats_callback)
def complete(text, **kw):
model = kw.get('model','gpt-3.5-turbo')
llm = openai.model(model)
llm.config['pre_prompt'] = 'output only in raw text' # for chat models
resp = llm.complete(text, **kw)
resp['model'] = model
return resp
def embedding(text, **kw):
model = kw.get('model','text-embedding-ada-002')
llm = openai.model(model)
resp = llm.embed(text, **kw)
resp['model'] = model
return resp
def embeddings(texts, **kw):
model = kw.get('model','text-embedding-ada-002')
llm = openai.model(model)
resp = llm.embed_many(texts, **kw)
resp['model'] = model
return resp
tokenizer_model = openai.model('text-davinci-003')
def get_token_count(text):
return tokenizer_model.token_count(text)
def stats_callback(out, resp, self):
model = self.config['model']
usage = resp['usage']
usage['call_cnt'] = 1
if 'text' in out:
usage['completion_chars'] = len(out['text'])
elif 'texts' in out:
usage['completion_chars'] = sum([len(text) for text in out['texts']])
# TODO: prompt_chars
# TODO: total_chars
if 'rtt' in out:
usage['rtt'] = out['rtt']
usage['rtt_cnt'] = 1
usage_stats.incr(f'usage:v4:[date]:[user]', {f'{k}:{model}':v for k,v in usage.items()})
usage_stats.incr(f'hourly:v4:[date]', {f'{k}:{model}:[hour]':v for k,v in usage.items()})
#print('STATS_CALLBACK', usage, flush=True) # XXX
def get_community_usage_cost():
data = usage_stats.get(f'usage:v4:[date]:{DEFAULT_USER}')
used = 0.0
used += 0.04 * data.get('total_tokens:gpt-4',0) / 1000 # prompt_price=0.03 but output_price=0.06
used += 0.02 * data.get('total_tokens:text-davinci-003',0) / 1000
used += 0.002 * data.get('total_tokens:text-curie-001',0) / 1000
used += 0.002 * data.get('total_tokens:gpt-3.5-turbo',0) / 1000
used += 0.0004 * data.get('total_tokens:text-embedding-ada-002',0) / 1000
return used
================================================
FILE: src/cache.py
================================================
from retry import retry
from binascii import hexlify,unhexlify
import pickle
import zlib
import io
import os
# pip install boto3
import boto3
import botocore
class Cache:
"Dummy / Base Cache"
def __init__(self):
pass
def put(self, key, obj):
pass
def get(self, key):
return None
def has(self, key):
return False
def delete(self, key):
pass
def serialize(self, obj):
pickled = pickle.dumps(obj)
compressed = self.compress(pickled)
return compressed
def deserialize(self, data):
pickled = self.decompress(data)
obj = pickle.loads(pickled)
return obj
def compress(self, data):
return zlib.compress(data)
def decompress(self, data):
return zlib.decompress(data)
def encode(self, name):
return hexlify(name.encode('utf8')).decode('utf8')
def decode(self, name):
return unhexlify(name).decode('utf8')
def call(self, key, fun, *a, **kw):
if self.has(key):
return self.get(key)
else:
resp = fun(*a, **kw)
self.put(key, resp)
return resp
class DiskCache(Cache):
"Local disk based cache"
def __init__(self, root):
self.root = root
def path(self, key):
return os.path.join(self.root, self.encode(key))
def put(self, key, obj):
path = self.path(key)
data = self.serialize(obj)
with open(path, 'wb') as f:
f.write(data)
def get(self, key):
path = self.path(key)
with open(path, 'rb') as f:
data = f.read()
obj = self.deserialize(data)
return obj
def has(self, key):
path = self.path(key)
return os.path.exists(path)
def delete(self, key):
path = self.path(key)
os.remove(path)
class S3Cache(Cache):
"S3 based cache"
def __init__(self, **kw):
bucket = kw.get('bucket') or os.getenv('S3_CACHE_BUCKET','ask-my-pdf')
prefix = kw.get('prefix') or os.getenv('S3_CACHE_PREFIX','cache/x1')
region = kw.get('region') or os.getenv('S3_REGION','sfo3')
url = kw.get('url') or os.getenv('S3_URL',f'https://{region}.digitaloceanspaces.com')
key = os.getenv('S3_KEY','')
secret = os.getenv('S3_SECRET','')
#
if not key or not secret:
raise Exception("No S3 credentials in environment variables!")
#
self.session = boto3.session.Session()
self.s3 = self.session.client('s3',
config=botocore.config.Config(s3={'addressing_style': 'virtual'}),
region_name=region,
endpoint_url=url,
aws_access_key_id=key,
aws_secret_access_key=secret,
)
self.bucket = bucket
self.prefix = prefix
def get_s3_key(self, key):
return f'{self.prefix}/{key}'
def put(self, key, obj):
s3_key = self.get_s3_key(key)
data = self.serialize(obj)
f = io.BytesIO(data)
self.s3.upload_fileobj(f, self.bucket, s3_key)
def get(self, key, default=None):
s3_key = self.get_s3_key(key)
f = io.BytesIO()
try:
self.s3.download_fileobj(self.bucket, s3_key, f)
except:
f.close()
return default
f.seek(0)
data = f.read()
obj = self.deserialize(data)
return obj
def has(self, key):
s3_key = self.get_s3_key(key)
try:
self.s3.head_object(Bucket=self.bucket, Key=s3_key)
return True
except:
return False
def delete(self, key):
self.s3.delete_object(
Bucket = self.bucket,
Key = self.get_s3_key(key))
def get_cache(**kw):
mode = os.getenv('CACHE_MODE','').upper()
path = os.getenv('CACHE_PATH','')
if mode == 'DISK':
return DiskCache(path)
elif mode == 'S3':
return S3Cache(**kw)
else:
return Cache()
if __name__=="__main__":
#cache = DiskCache('__pycache__')
cache = S3Cache()
cache.put('xxx',{'a':1,'b':22})
print('get xxx', cache.get('xxx'))
print('has xxx', cache.has('xxx'))
print('has yyy', cache.has('yyy'))
print('delete xxx', cache.delete('xxx'))
print('has xxx', cache.has('xxx'))
print('get xxx', cache.get('xxx'))
#
================================================
FILE: src/css.py
================================================
v1 = """
/* feedback checkbox */
.css-18fuwiq {
position: relative;
padding-top: 6px;
}
.css-949r0i {
position: relative;
padding-top: 6px;
}
"""
================================================
FILE: src/feedback.py
================================================
import datetime
import hashlib
import redis
import os
from retry import retry
def hexdigest(text):
return hashlib.md5(text.encode('utf8')).hexdigest()
def as_int(x):
return int(x) if x is not None else None
class Feedback:
"Dummy feedback adapter"
def __init__(self, user):
...
def send(self, score, ctx, details=False):
...
def get_score(self):
return 0
class RedisFeedback(Feedback):
"Redis feedback adapter"
def __init__(self, user):
REDIS_URL = os.getenv('REDIS_URL')
if not REDIS_URL:
raise Exception('No Redis configuration in environment variables!')
super().__init__(user)
self.db = redis.Redis.from_url(REDIS_URL)
self.user = user
@retry(tries=5, delay=0.1)
def send(self, score, ctx, details=False):
p = self.db.pipeline()
dist_list = ctx.get('debug',{}).get('model.query.resp',{}).get('dist_list',[])
# feedback
index = ctx.get('index',{})
data = {}
data['user'] = self.user
data['task-prompt-version'] = ctx.get('task_name')
data['model'] = ctx.get('model')
data['model-embeddings'] = ctx.get('model_embed')
data['task-prompt'] = ctx.get('task')
data['temperature'] = ctx.get('temperature')
data['frag-size'] = ctx.get('frag_size')
data['frag-cnt'] = ctx.get('max_frags')
data['frag-n-before'] = ctx.get('n_frag_before')
data['frag-n-after'] = ctx.get('n_frag_after')
data['filename'] = ctx.get('filename')
data['filehash'] = index.get('hash') or index.get('filehash')
data['filesize'] = index.get('filesize')
data['n-pages'] = index.get('n_pages')
data['n-texts'] = index.get('n_texts')
data['use-hyde'] = as_int(ctx.get('use_hyde'))
data['use-hyde-summary'] = as_int(ctx.get('use_hyde_summary'))
data['question'] = ctx.get('question')
data['answer'] = ctx.get('answer')
data['hyde-summary'] = index.get('summary')
data['resp-dist-list'] = '|'.join([f"{x:0.3f}" for x in dist_list])
fb_hash = hexdigest(str(list(sorted(data.items()))))
#
data['score'] = score
data['datetime'] = str(datetime.datetime.now())
key1 = f'feedback:v2:{fb_hash}'
if not details:
for k in ['question','answer','hyde-summary']:
data[k] = ''
p.hset(key1, mapping=data)
# feedback-daily
date = datetime.date.today()
key2 = f'feedback-daily:v1:{date}:{"positive" if score > 0 else "negative"}'
p.sadd(key2, fb_hash)
# feedback-score
key3 = f'feedback-score:v2:{self.user}'
p.sadd(key3, fb_hash)
p.execute()
@retry(tries=5, delay=0.1)
def get_score(self):
key = f'feedback-score:v2:{self.user}'
return self.db.scard(key)
def get_feedback_adapter(user):
MODE = os.getenv('FEEDBACK_MODE','').upper()
if MODE=='REDIS':
return RedisFeedback(user)
else:
return Feedback(user)
================================================
FILE: src/gui.py
================================================
__version__ = "0.4.8.3"
app_name = "Ask my PDF"
# BOILERPLATE
import streamlit as st
st.set_page_config(layout='centered', page_title=f'{app_name} {__version__}')
ss = st.session_state
if 'debug' not in ss: ss['debug'] = {}
import css
st.write(f'<style>{css.v1}</style>', unsafe_allow_html=True)
header1 = st.empty() # for errors / messages
header2 = st.empty() # for errors / messages
header3 = st.empty() # for errors / messages
# IMPORTS
import prompts
import model
import storage
import feedback
import cache
import os
from time import time as now
# HANDLERS
def on_api_key_change():
api_key = ss.get('api_key') or os.getenv('OPENAI_KEY')
model.use_key(api_key) # TODO: empty api_key
#
if 'data_dict' not in ss: ss['data_dict'] = {} # used only with DictStorage
ss['storage'] = storage.get_storage(api_key, data_dict=ss['data_dict'])
ss['cache'] = cache.get_cache()
ss['user'] = ss['storage'].folder # TODO: refactor user 'calculation' from get_storage
model.set_user(ss['user'])
ss['feedback'] = feedback.get_feedback_adapter(ss['user'])
ss['feedback_score'] = ss['feedback'].get_score()
#
ss['debug']['storage.folder'] = ss['storage'].folder
ss['debug']['storage.class'] = ss['storage'].__class__.__name__
ss['community_user'] = os.getenv('COMMUNITY_USER')
if 'user' not in ss and ss['community_user']:
on_api_key_change() # use community key
# COMPONENTS
def ui_spacer(n=2, line=False, next_n=0):
for _ in range(n):
st.write('')
if line:
st.tabs([' '])
for _ in range(next_n):
st.write('')
def ui_info():
st.markdown(f"""
# Ask my PDF
version {__version__}
Question answering system built on top of GPT3.
""")
ui_spacer(1)
st.write("Made by [Maciej Obarski](https://www.linkedin.com/in/mobarski/).", unsafe_allow_html=True)
ui_spacer(1)
st.markdown("""
Thank you for your interest in my application.
Please be aware that this is only a Proof of Concept system
and may contain bugs or unfinished features.
If you like this app you can ❤️ [follow me](https://twitter.com/KerbalFPV)
on Twitter for news and updates.
""")
ui_spacer(1)
st.markdown('Source code can be found [here](https://github.com/mobarski/ask-my-pdf).')
def ui_api_key():
if ss['community_user']:
st.write('## 1. Optional - enter your OpenAI API key')
t1,t2 = st.tabs(['community version','enter your own API key'])
with t1:
pct = model.community_tokens_available_pct()
st.write(f'Community tokens available: :{"green" if pct else "red"}[{int(pct)}%]')
st.progress(pct/100)
st.write('Refresh in: ' + model.community_tokens_refresh_in())
st.write('You can sign up to OpenAI and/or create your API key [here](https://platform.openai.com/account/api-keys)')
ss['community_pct'] = pct
ss['debug']['community_pct'] = pct
with t2:
st.text_input('OpenAI API key', type='password', key='api_key', on_change=on_api_key_change, label_visibility="collapsed")
else:
st.write('## 1. Enter your OpenAI API key')
st.text_input('OpenAI API key', type='password', key='api_key', on_change=on_api_key_change, label_visibility="collapsed")
def index_pdf_file():
if ss['pdf_file']:
ss['filename'] = ss['pdf_file'].name
if ss['filename'] != ss.get('fielname_done'): # UGLY
with st.spinner(f'indexing {ss["filename"]}'):
index = model.index_file(ss['pdf_file'], ss['filename'], fix_text=ss['fix_text'], frag_size=ss['frag_size'], cache=ss['cache'])
ss['index'] = index
debug_index()
ss['filename_done'] = ss['filename'] # UGLY
def debug_index():
index = ss['index']
d = {}
d['hash'] = index['hash']
d['frag_size'] = index['frag_size']
d['n_pages'] = len(index['pages'])
d['n_texts'] = len(index['texts'])
d['summary'] = index['summary']
d['pages'] = index['pages']
d['texts'] = index['texts']
d['time'] = index.get('time',{})
ss['debug']['index'] = d
def ui_pdf_file():
st.write('## 2. Upload or select your PDF file')
disabled = not ss.get('user') or (not ss.get('api_key') and not ss.get('community_pct',0))
t1,t2 = st.tabs(['UPLOAD','SELECT'])
with t1:
st.file_uploader('pdf file', type='pdf', key='pdf_file', disabled=disabled, on_change=index_pdf_file, label_visibility="collapsed")
b_save()
with t2:
filenames = ['']
if ss.get('storage'):
filenames += ss['storage'].list()
def on_change():
name = ss['selected_file']
if name and ss.get('storage'):
with ss['spin_select_file']:
with st.spinner('loading index'):
t0 = now()
index = ss['storage'].get(name)
ss['debug']['storage_get_time'] = now()-t0
ss['filename'] = name # XXX
ss['index'] = index
debug_index()
else:
#ss['index'] = {}
pass
st.selectbox('select file', filenames, on_change=on_change, key='selected_file', label_visibility="collapsed", disabled=disabled)
b_delete()
ss['spin_select_file'] = st.empty()
def ui_show_debug():
st.checkbox('show debug section', key='show_debug')
def ui_fix_text():
st.checkbox('fix common PDF problems', value=True, key='fix_text')
def ui_temperature():
#st.slider('temperature', 0.0, 1.0, 0.0, 0.1, key='temperature', format='%0.1f')
ss['temperature'] = 0.0
def ui_fragments():
#st.number_input('fragment size', 0,2000,200, step=100, key='frag_size')
st.selectbox('fragment size (characters)', [0,200,300,400,500,600,700,800,900,1000], index=3, key='frag_size')
b_reindex()
st.number_input('max fragments', 1, 10, 4, key='max_frags')
st.number_input('fragments before', 0, 3, 1, key='n_frag_before') # TODO: pass to model
st.number_input('fragments after', 0, 3, 1, key='n_frag_after') # TODO: pass to model
def ui_model():
models = ['gpt-3.5-turbo','gpt-4','text-davinci-003','text-curie-001']
st.selectbox('main model', models, key='model', disabled=not ss.get('api_key'))
st.selectbox('embedding model', ['text-embedding-ada-002'], key='model_embed') # FOR FUTURE USE
def ui_hyde():
st.checkbox('use HyDE', value=True, key='use_hyde')
def ui_hyde_summary():
st.checkbox('use summary in HyDE', value=True, key='use_hyde_summary')
def ui_task_template():
st.selectbox('task prompt template', prompts.TASK.keys(), key='task_name')
def ui_task():
x = ss['task_name']
st.text_area('task prompt', prompts.TASK[x], key='task')
def ui_hyde_prompt():
st.text_area('HyDE prompt', prompts.HYDE, key='hyde_prompt')
def ui_question():
st.write('## 3. Ask questions'+(f' to {ss["filename"]}' if ss.get('filename') else ''))
disabled = False
st.text_area('question', key='question', height=100, placeholder='Enter question here', help='', label_visibility="collapsed", disabled=disabled)
# REF: Hypotetical Document Embeddings
def ui_hyde_answer():
# TODO: enter or generate
pass
def ui_output():
output = ss.get('output','')
st.markdown(output)
def ui_debug():
if ss.get('show_debug'):
st.write('### debug')
st.write(ss.get('debug',{}))
def b_ask():
c1,c2,c3,c4,c5 = st.columns([2,1,1,2,2])
if c2.button('👍', use_container_width=True, disabled=not ss.get('output')):
ss['feedback'].send(+1, ss, details=ss['send_details'])
ss['feedback_score'] = ss['feedback'].get_score()
if c3.button('👎', use_container_width=True, disabled=not ss.get('output')):
ss['feedback'].send(-1, ss, details=ss['send_details'])
ss['feedback_score'] = ss['feedback'].get_score()
score = ss.get('feedback_score',0)
c5.write(f'feedback score: {score}')
c4.checkbox('send details', True, key='send_details',
help='allow question and the answer to be stored in the ask-my-pdf feedback database')
#c1,c2,c3 = st.columns([1,3,1])
#c2.radio('zzz',['👍',r'...',r'👎'],horizontal=True,label_visibility="collapsed")
#
disabled = (not ss.get('api_key') and not ss.get('community_pct',0)) or not ss.get('index')
if c1.button('get answer', disabled=disabled, type='primary', use_container_width=True):
question = ss.get('question','')
temperature = ss.get('temperature', 0.0)
hyde = ss.get('use_hyde')
hyde_prompt = ss.get('hyde_prompt')
if ss.get('use_hyde_summary'):
summary = ss['index']['summary']
hyde_prompt += f" Context: {summary}\n\n"
task = ss.get('task')
max_frags = ss.get('max_frags',1)
n_before = ss.get('n_frag_before',0)
n_after = ss.get('n_frag_after',0)
index = ss.get('index',{})
with st.spinner('preparing answer'):
resp = model.query(question, index,
task=task,
temperature=temperature,
hyde=hyde,
hyde_prompt=hyde_prompt,
max_frags=max_frags,
limit=max_frags+2,
n_before=n_before,
n_after=n_after,
model=ss['model'],
)
usage = resp.get('usage',{})
usage['cnt'] = 1
ss['debug']['model.query.resp'] = resp
ss['debug']['resp.usage'] = usage
ss['debug']['model.vector_query_time'] = resp['vector_query_time']
q = question.strip()
a = resp['text'].strip()
ss['answer'] = a
output_add(q,a)
st.experimental_rerun() # to enable the feedback buttons
def b_clear():
if st.button('clear output'):
ss['output'] = ''
def b_reindex():
# TODO: disabled
if st.button('reindex'):
index_pdf_file()
def b_reload():
if st.button('reload prompts'):
import importlib
importlib.reload(prompts)
def b_save():
db = ss.get('storage')
index = ss.get('index')
name = ss.get('filename')
api_key = ss.get('api_key')
disabled = not api_key or not db or not index or not name
help = "The file will be stored for about 90 days. Available only when using your own API key."
if st.button('save encrypted index in ask-my-pdf', disabled=disabled, help=help):
with st.spinner('saving to ask-my-pdf'):
db.put(name, index)
def b_delete():
db = ss.get('storage')
name = ss.get('selected_file')
# TODO: confirm delete
if st.button('delete from ask-my-pdf', disabled=not db or not name):
with st.spinner('deleting from ask-my-pdf'):
db.delete(name)
#st.experimental_rerun()
def output_add(q,a):
if 'output' not in ss: ss['output'] = ''
q = q.replace('$',r'\$')
a = a.replace('$',r'\$')
new = f'#### {q}\n{a}\n\n'
ss['output'] = new + ss['output']
# LAYOUT
with st.sidebar:
ui_info()
ui_spacer(2)
with st.expander('advanced'):
ui_show_debug()
b_clear()
ui_model()
ui_fragments()
ui_fix_text()
ui_hyde()
ui_hyde_summary()
ui_temperature()
b_reload()
ui_task_template()
ui_task()
ui_hyde_prompt()
ui_api_key()
ui_pdf_file()
ui_question()
ui_hyde_answer()
b_ask()
ui_output()
ui_debug()
================================================
FILE: src/model.py
================================================
from sklearn.metrics.pairwise import cosine_distances
import datetime
from collections import Counter
from time import time as now
import hashlib
import re
import io
import os
import pdf
import ai
def use_key(api_key):
ai.use_key(api_key)
def set_user(user):
ai.set_user(user)
def query_by_vector(vector, index, limit=None):
"return (ids, distances and texts) sorted by cosine distance"
vectors = index['vectors']
texts = index['texts']
#
sim = cosine_distances([vector], vectors)[0]
#
id_dist_list = list(enumerate(sim))
id_dist_list.sort(key=lambda x:x[1])
id_list = [x[0] for x in id_dist_list][:limit]
dist_list = [x[1] for x in id_dist_list][:limit]
text_list = [texts[x] for x in id_list] if texts else ['ERROR']*len(id_list)
return id_list, dist_list, text_list
def get_vectors(text_list):
"transform texts into embedding vectors"
batch_size = 128
vectors = []
usage = Counter()
for i,texts in enumerate(batch(text_list, batch_size)):
resp = ai.embeddings(texts)
v = resp['vectors']
u = resp['usage']
u['call_cnt'] = 1
usage.update(u)
vectors.extend(v)
return {'vectors':vectors, 'usage':dict(usage), 'model':resp['model']}
def index_file(f, filename, fix_text=False, frag_size=0, cache=None):
"return vector index (dictionary) for a given PDF file"
# calc md5
h = hashlib.md5()
h.update(f.read())
md5 = h.hexdigest()
filesize = f.tell()
f.seek(0)
#
t0 = now()
pages = pdf.pdf_to_pages(f)
t1 = now()
if fix_text:
for i in range(len(pages)):
pages[i] = fix_text_problems(pages[i])
texts = split_pages_into_fragments(pages, frag_size)
t2 = now()
if cache:
cache_key = f'get_vectors:{md5}:{frag_size}:{fix_text}'
resp = cache.call(cache_key, get_vectors, texts)
else:
resp = get_vectors(texts)
t3 = now()
vectors = resp['vectors']
summary_prompt = f"{texts[0]}\n\nDescribe the document from which the fragment is extracted. Omit any details.\n\n" # TODO: move to prompts.py
summary = ai.complete(summary_prompt)
t4 = now()
usage = resp['usage']
out = {}
out['frag_size'] = frag_size
out['n_pages'] = len(pages)
out['n_texts'] = len(texts)
out['texts'] = texts
out['pages'] = pages
out['vectors'] = vectors
out['summary'] = summary['text']
out['filename'] = filename
out['filehash'] = f'md5:{md5}'
out['filesize'] = filesize
out['usage'] = usage
out['model'] = resp['model']
out['time'] = {'pdf_to_pages':t1-t0, 'split_pages':t2-t1, 'get_vectors':t3-t2, 'summary':t4-t3}
out['size'] = len(texts) # DEPRECATED -> filesize
out['hash'] = f'md5:{md5}' # DEPRECATED -> filehash
return out
def split_pages_into_fragments(pages, frag_size):
"split pages (list of texts) into smaller fragments (list of texts)"
page_offset = [0]
for p,page in enumerate(pages):
page_offset += [page_offset[-1]+len(page)+1]
# TODO: del page_offset[-1] ???
if frag_size:
text = ' '.join(pages)
return text_to_fragments(text, frag_size, page_offset)
else:
return pages
def text_to_fragments(text, size, page_offset):
"split single text into smaller fragments (list of texts)"
if size and len(text)>size:
out = []
pos = 0
page = 1
p_off = page_offset.copy()[1:]
eos = find_eos(text)
if len(text) not in eos:
eos += [len(text)]
for i in range(len(eos)):
if eos[i]-pos>size:
text_fragment = f'PAGE({page}):\n'+text[pos:eos[i]]
out += [text_fragment]
pos = eos[i]
if eos[i]>p_off[0]:
page += 1
del p_off[0]
# ugly: last iter
text_fragment = f'PAGE({page}):\n'+text[pos:eos[i]]
out += [text_fragment]
#
out = [x for x in out if x]
return out
else:
return [text]
def find_eos(text):
"return list of all end-of-sentence offsets"
return [x.span()[1] for x in re.finditer('[.!?。]\s+',text)]
###############################################################################
def fix_text_problems(text):
"fix common text problems"
text = re.sub('\s+[-]\s+','',text) # word continuation in the next line
return text
def query(text, index, task=None, temperature=0.0, max_frags=1, hyde=False, hyde_prompt=None, limit=None, n_before=1, n_after=1, model=None):
"get dictionary with the answer for the given question (text)."
out = {}
if hyde:
# TODO: model param
out['hyde'] = hypotetical_answer(text, index, hyde_prompt=hyde_prompt, temperature=temperature)
# TODO: usage
# RANK FRAGMENTS
if hyde:
resp = ai.embedding(out['hyde']['text'])
# TODO: usage
else:
resp = ai.embedding(text)
# TODO: usage
v = resp['vector']
t0 = now()
id_list, dist_list, text_list = query_by_vector(v, index, limit=limit)
dt0 = now()-t0
# BUILD PROMPT
# select fragments
N_BEFORE = 1 # TODO: param
N_AFTER = 1 # TODO: param
selected = {} # text id -> rank
for rank,id in enumerate(id_list):
for x in range(id-n_before, id+1+n_after):
if x not in selected and x>=0 and x<index['size']:
selected[x] = rank
selected2 = [(id,rank) for id,rank in selected.items()]
selected2.sort(key=lambda x:(x[1],x[0]))
# build context
SEPARATOR = '\n---\n'
context = ''
context_len = 0
frag_list = []
for id,rank in selected2:
frag = index['texts'][id]
frag_len = ai.get_token_count(frag)
if context_len+frag_len <= 3000: # TODO: remove hardcode
context += SEPARATOR + frag # add separator and text fragment
frag_list += [frag]
context_len = ai.get_token_count(context)
out['context_len'] = context_len
prompt = f"""
{task or 'Task: Answer question based on context.'}
Context:
{context}
Question: {text}
Answer:""" # TODO: move to prompts.py
# GET ANSWER
resp2 = ai.complete(prompt, temperature=temperature, model=model)
answer = resp2['text']
usage = resp2['usage']
# OUTPUT
out['vector_query_time'] = dt0
out['id_list'] = id_list
out['dist_list'] = dist_list
out['selected'] = selected
out['selected2'] = selected2
out['frag_list'] = frag_list
#out['query.vector'] = resp['vector']
out['usage'] = usage
out['prompt'] = prompt
out['model'] = resp2['model']
# CORE
out['text'] = answer
return out
def hypotetical_answer(text, index, hyde_prompt=None, temperature=0.0):
"get hypotethical answer for the question (text)"
hyde_prompt = hyde_prompt or 'Write document that answers the question.'
prompt = f"""
{hyde_prompt}
Question: "{text}"
Document:""" # TODO: move to prompts.py
resp = ai.complete(prompt, temperature=temperature)
return resp
def community_tokens_available_pct():
used = ai.get_community_usage_cost()
limit = float(os.getenv('COMMUNITY_DAILY_USD',0))
pct = (100.0 * (limit-used) / limit) if limit else 0
pct = max(0, pct)
pct = min(100, pct)
return pct
def community_tokens_refresh_in():
x = datetime.datetime.now()
dt = (x.replace(hour=23, minute=59, second=59) - x).seconds
h = dt // 3600
m = dt % 3600 // 60
return f"{h} h {m} min"
# util
def batch(data, n):
for i in range(0, len(data), n):
yield data[i:i+n]
if __name__=="__main__":
print(text_to_fragments("to jest. test tego. programu", size=3, page_offset=[0,5,10,15,20]))
================================================
FILE: src/pdf.py
================================================
"PDF adapter"
import pypdf
def pdf_to_pages(file):
"extract text (pages) from pdf file"
pages = []
pdf = pypdf.PdfReader(file)
for p in range(len(pdf.pages)):
page = pdf.pages[p]
text = page.extract_text()
pages += [text]
return pages
================================================
FILE: src/prompts.py
================================================
# INFO: some prompts are still in model.py
# TODO: Ignore OCR problems in the text below.
TASK = {
'v6': (
"Answer the question truthfully based on the text below. "
"Include verbatim quote and a comment where to find it in the text (page number). "
#"After the quote write a step by step explanation in a new paragraph. "
"After the quote write a step by step explanation. "
"Use bullet points. "
#"After that try to rephrase the original question so it might give better results. "
),
'v5': (
"Answer the question truthfully based on the text below. "
"Include at least one verbatim quote (marked with quotation marks) and a comment where to find it in the text (ie name of the section and page number). "
"Use ellipsis in the quote to omit irrelevant parts of the quote. "
"After the quote write (in the new paragraph) a step by step explanation to be sure we have the right answer "
"(use bullet-points in separate lines)" #, adjust the language for a young reader). "
"After the explanation check if the Answer is consistent with the Context and doesn't require external knowledge. "
"In a new line write 'SELF-CHECK OK' if the check was successful and 'SELF-CHECK FAILED' if it failed. "
),
'v4':
"Answer the question truthfully based on the text below. " \
"Include verbatim quote and a comment where to find it in the text (ie name of the section and page number). " \
"After the quote write an explanation (in the new paragraph) for a young reader.",
'v3': 'Answer the question truthfully based on the text below. Include verbatim quote and a comment where to find it in the text (ie name of the section and page number).',
'v2': 'Answer question based on context. The answers sould be elaborate and based only on the context.',
'v1': 'Answer question based on context.',
# 'v5':
# "Generate a comprehensive and informative answer for a given question solely based on the provided document fragments. " \
# "You must only use information from the provided fragments. Use an unbiased and journalistic tone. Combine fragments together into coherent answer. " \
# "Do not repeat text. Cite fragments using [${number}] notation. Only cite the most relevant fragments that answer the question accurately. " \
# "If different fragments refer to different entities with the same name, write separate answer for each entity.",
}
HYDE = "Write an example answer to the following question. Don't write generic answer, just assume everything that is not known."
# TODO
SUMMARY = {
'v2':'Describe the document from which the fragment is extracted. Omit any details.',
'v1':'Describe the document from which the fragment is extracted. Do not describe the fragment, focus on figuring out what kind document it is.',
}
================================================
FILE: src/run.bat
================================================
set STORAGE_MODE=LOCAL
set CACHE_MODE=DISK
set STORAGE_PATH=../data/storage
set CACHE_PATH=../data/cache
streamlit run gui.py
================================================
FILE: src/run.sh
================================================
STORAGE_MODE=LOCAL
CACHE_MODE=DISK
STORAGE_PATH=../data/storage
CACHE_PATH=../data/cache
streamlit run gui.py
================================================
FILE: src/stats.py
================================================
import redis
from time import strftime
import os
from retry import retry
class Stats:
def __init__(self):
self.config = {}
def render(self, key):
variables = dict(
date = strftime('%Y-%m-%d'),
hour = strftime('%H'),
)
variables.update(self.config)
for k,v in variables.items():
key = key.replace('['+k+']',v)
return key
class DictStats(Stats):
def __init__(self, data_dict):
self.data = data_dict
self.config = {}
def incr(self, key, kv_dict):
data = self.data
key = self.render(key)
if key not in data:
data[key] = {}
for member,val in kv_dict.items():
member = self.render(member)
data[key][member] = data[key].get(member,0) + val
def get(self, key):
key = self.render(key)
return self.data.get(key, {})
class RedisStats(Stats):
def __init__(self):
REDIS_URL = os.getenv('REDIS_URL')
if not REDIS_URL:
raise Exception('No Redis configuration in environment variables!')
self.db = redis.Redis.from_url(REDIS_URL)
self.config = {}
@retry(tries=5, delay=0.1)
def incr(self, key, kv_dict):
# TODO: non critical code -> safe exceptions
key = self.render(key)
p = self.db.pipeline()
for member,val in kv_dict.items():
member = self.render(member)
self.db.zincrby(key, val, member)
p.execute()
@retry(tries=5, delay=0.1)
def get(self, key):
# TODO: non critical code -> safe exceptions
key = self.render(key)
items = self.db.zscan_iter(key)
return {k.decode('utf8'):v for k,v in items}
stats_data_dict = {}
def get_stats(**kw):
MODE = os.getenv('STATS_MODE','').upper()
if MODE=='REDIS':
stats = RedisStats()
else:
stats = DictStats(stats_data_dict)
stats.config.update(kw)
return stats
if __name__=="__main__":
s1 = get_stats(user='maciek')
s1.incr('aaa:[date]:[user]', dict(a=1,b=2))
s1.incr('aaa:[date]:[user]', dict(a=1,b=2))
print(s1.data)
print(s1.get('aaa:[date]:[user]'))
#
s2 = get_stats(user='kerbal')
s2.incr('aaa:[date]:[user]', dict(a=1,b=2))
s2.incr('aaa:[date]:[user]', dict(a=1,b=2))
print(s2.data)
print(s2.get('aaa:[date]:[user]'))
================================================
FILE: src/storage.py
================================================
"Storage adapter - one folder for each user / api_key"
# pip install pycryptodome
# REF: https://www.pycryptodome.org/src/cipher/aes
from Crypto.Cipher import AES
from Crypto.Util.Padding import pad,unpad
from retry import retry
from binascii import hexlify,unhexlify
import hashlib
import pickle
import zlib
import os
import io
# pip install boto3
import boto3
import botocore
SALT = unhexlify(os.getenv('STORAGE_SALT','00'))
class Storage:
"Encrypted object storage (base class)"
def __init__(self, secret_key):
k = secret_key.encode()
self.folder = hashlib.blake2s(k, salt=SALT, person=b'folder', digest_size=8).hexdigest()
self.passwd = hashlib.blake2s(k, salt=SALT, person=b'passwd', digest_size=32).hexdigest()
self.AES_MODE = AES.MODE_ECB # TODO: better AES mode ???
self.AES_BLOCK_SIZE = 16
def get(self, name, default=None):
"get one object from the folder"
safe_name = self.encode(name)
data = self._get(safe_name)
obj = self.deserialize(data)
return obj
def put(self, name, obj):
"put the object into the folder"
safe_name = self.encode(name)
data = self.serialize(obj)
self._put(safe_name, data)
return data
def list(self):
"list object names from the folder"
return [self.decode(name) for name in self._list()]
def delete(self, name):
"delete the object from the folder"
safe_name = self.encode(name)
self._delete(safe_name)
# IMPLEMENTED IN SUBCLASSES
def _put(self, name, data):
...
def _get(self, name):
...
def _delete(self, name):
pass
def _list(self):
...
# # #
def serialize(self, obj):
raw = pickle.dumps(obj)
compressed = self.compress(raw)
encrypted = self.encrypt(compressed)
return encrypted
def deserialize(self, encrypted):
compressed = self.decrypt(encrypted)
raw = self.decompress(compressed)
obj = pickle.loads(raw)
return obj
def encrypt(self, raw):
cipher = AES.new(unhexlify(self.passwd), self.AES_MODE)
return cipher.encrypt(pad(raw, self.AES_BLOCK_SIZE))
def decrypt(self, encrypted):
cipher = AES.new(unhexlify(self.passwd), self.AES_MODE)
return unpad(cipher.decrypt(encrypted), self.AES_BLOCK_SIZE)
def compress(self, data):
return zlib.compress(data)
def decompress(self, data):
return zlib.decompress(data)
def encode(self, name):
return hexlify(name.encode('utf8')).decode('utf8')
def decode(self, name):
return unhexlify(name).decode('utf8')
class DictStorage(Storage):
"Dictionary based storage"
def __init__(self, secret_key, data_dict):
super().__init__(secret_key)
self.data = data_dict
def _put(self, name, data):
if self.folder not in self.data:
self.data[self.folder] = {}
self.data[self.folder][name] = data
def _get(self, name):
return self.data[self.folder][name]
def _list(self):
# TODO: sort by modification time (reverse=True)
return list(self.data.get(self.folder,{}).keys())
def _delete(self, name):
del self.data[self.folder][name]
class LocalStorage(Storage):
"Local filesystem based storage"
def __init__(self, secret_key, path):
if not path:
raise Exception('No storage path in environment variables!')
super().__init__(secret_key)
self.path = os.path.join(path, self.folder)
if not os.path.exists(self.path):
os.makedirs(self.path)
def _put(self, name, data):
with open(os.path.join(self.path, name), 'wb') as f:
f.write(data)
def _get(self, name):
with open(os.path.join(self.path, name), 'rb') as f:
data = f.read()
return data
def _list(self):
# TODO: sort by modification time (reverse=True)
return os.listdir(self.path)
def _delete(self, name):
os.remove(os.path.join(self.path, name))
class S3Storage(Storage):
"S3 based encrypted storage"
def __init__(self, secret_key, **kw):
prefix = kw.get('prefix') or os.getenv('S3_PREFIX','index/x1')
region = kw.get('region') or os.getenv('S3_REGION','sfo3')
bucket = kw.get('bucket') or os.getenv('S3_BUCKET','ask-my-pdf')
url = kw.get('url') or os.getenv('S3_URL',f'https://{region}.digitaloceanspaces.com')
key = os.getenv('S3_KEY','')
secret = os.getenv('S3_SECRET','')
#
if not key or not secret:
raise Exception("No S3 credentials in environment variables!")
#
super().__init__(secret_key)
self.session = boto3.session.Session()
self.s3 = self.session.client('s3',
config=botocore.config.Config(s3={'addressing_style': 'virtual'}),
region_name=region,
endpoint_url=url,
aws_access_key_id=key,
aws_secret_access_key=secret,
)
self.bucket = bucket
self.prefix = prefix
def get_key(self, name):
return f'{self.prefix}/{self.folder}/{name}'
def _put(self, name, data):
key = self.get_key(name)
f = io.BytesIO(data)
self.s3.upload_fileobj(f, self.bucket, key)
def _get(self, name):
key = self.get_key(name)
f = io.BytesIO()
self.s3.download_fileobj(self.bucket, key, f)
f.seek(0)
return f.read()
def _list(self):
resp = self.s3.list_objects(
Bucket=self.bucket,
Prefix=self.get_key('')
)
contents = resp.get('Contents',[])
contents.sort(key=lambda x:x['LastModified'], reverse=True)
keys = [x['Key'] for x in contents]
names = [x.split('/')[-1] for x in keys]
return names
def _delete(self, name):
self.s3.delete_object(
Bucket=self.bucket,
Key=self.get_key(name)
)
def get_storage(api_key, data_dict):
"get storage adapter configured in environment variables"
mode = os.getenv('STORAGE_MODE','').upper()
path = os.getenv('STORAGE_PATH','')
if mode=='S3':
storage = S3Storage(api_key)
elif mode=='LOCAL':
storage = LocalStorage(api_key, path)
else:
storage = DictStorage(api_key, data_dict)
return storage
gitextract_zi78g5yh/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── src/
├── ai.py
├── cache.py
├── css.py
├── feedback.py
├── gui.py
├── model.py
├── pdf.py
├── prompts.py
├── run.bat
├── run.sh
├── stats.py
└── storage.py
SYMBOL INDEX (140 symbols across 8 files)
FILE: src/ai.py
function use_key (line 7) | def use_key(key):
function set_user (line 11) | def set_user(user):
function complete (line 17) | def complete(text, **kw):
function embedding (line 25) | def embedding(text, **kw):
function embeddings (line 32) | def embeddings(texts, **kw):
function get_token_count (line 40) | def get_token_count(text):
function stats_callback (line 43) | def stats_callback(out, resp, self):
function get_community_usage_cost (line 60) | def get_community_usage_cost():
FILE: src/cache.py
class Cache (line 13) | class Cache:
method __init__ (line 15) | def __init__(self):
method put (line 18) | def put(self, key, obj):
method get (line 21) | def get(self, key):
method has (line 24) | def has(self, key):
method delete (line 27) | def delete(self, key):
method serialize (line 30) | def serialize(self, obj):
method deserialize (line 35) | def deserialize(self, data):
method compress (line 40) | def compress(self, data):
method decompress (line 43) | def decompress(self, data):
method encode (line 46) | def encode(self, name):
method decode (line 49) | def decode(self, name):
method call (line 52) | def call(self, key, fun, *a, **kw):
class DiskCache (line 61) | class DiskCache(Cache):
method __init__ (line 64) | def __init__(self, root):
method path (line 67) | def path(self, key):
method put (line 70) | def put(self, key, obj):
method get (line 76) | def get(self, key):
method has (line 83) | def has(self, key):
method delete (line 87) | def delete(self, key):
class S3Cache (line 92) | class S3Cache(Cache):
method __init__ (line 95) | def __init__(self, **kw):
method get_s3_key (line 117) | def get_s3_key(self, key):
method put (line 120) | def put(self, key, obj):
method get (line 126) | def get(self, key, default=None):
method has (line 139) | def has(self, key):
method delete (line 147) | def delete(self, key):
function get_cache (line 153) | def get_cache(**kw):
FILE: src/feedback.py
function hexdigest (line 7) | def hexdigest(text):
function as_int (line 10) | def as_int(x):
class Feedback (line 13) | class Feedback:
method __init__ (line 15) | def __init__(self, user):
method send (line 17) | def send(self, score, ctx, details=False):
method get_score (line 19) | def get_score(self):
class RedisFeedback (line 22) | class RedisFeedback(Feedback):
method __init__ (line 24) | def __init__(self, user):
method send (line 33) | def send(self, score, ctx, details=False):
method get_score (line 79) | def get_score(self):
function get_feedback_adapter (line 84) | def get_feedback_adapter(user):
FILE: src/gui.py
function on_api_key_change (line 30) | def on_api_key_change():
function ui_spacer (line 53) | def ui_spacer(n=2, line=False, next_n=0):
function ui_info (line 61) | def ui_info():
function ui_api_key (line 81) | def ui_api_key():
function index_pdf_file (line 99) | def index_pdf_file():
function debug_index (line 109) | def debug_index():
function ui_pdf_file (line 122) | def ui_pdf_file():
function ui_show_debug (line 151) | def ui_show_debug():
function ui_fix_text (line 154) | def ui_fix_text():
function ui_temperature (line 157) | def ui_temperature():
function ui_fragments (line 161) | def ui_fragments():
function ui_model (line 169) | def ui_model():
function ui_hyde (line 174) | def ui_hyde():
function ui_hyde_summary (line 177) | def ui_hyde_summary():
function ui_task_template (line 180) | def ui_task_template():
function ui_task (line 183) | def ui_task():
function ui_hyde_prompt (line 187) | def ui_hyde_prompt():
function ui_question (line 190) | def ui_question():
function ui_hyde_answer (line 196) | def ui_hyde_answer():
function ui_output (line 200) | def ui_output():
function ui_debug (line 204) | def ui_debug():
function b_ask (line 210) | def b_ask():
function b_clear (line 263) | def b_clear():
function b_reindex (line 267) | def b_reindex():
function b_reload (line 272) | def b_reload():
function b_save (line 277) | def b_save():
function b_delete (line 288) | def b_delete():
function output_add (line 297) | def output_add(q,a):
FILE: src/model.py
function use_key (line 14) | def use_key(api_key):
function set_user (line 17) | def set_user(user):
function query_by_vector (line 20) | def query_by_vector(vector, index, limit=None):
function get_vectors (line 34) | def get_vectors(text_list):
function index_file (line 48) | def index_file(f, filename, fix_text=False, frag_size=0, cache=None):
function split_pages_into_fragments (line 96) | def split_pages_into_fragments(pages, frag_size):
function text_to_fragments (line 108) | def text_to_fragments(text, size, page_offset):
function find_eos (line 135) | def find_eos(text):
function fix_text_problems (line 141) | def fix_text_problems(text):
function query (line 146) | def query(text, index, task=None, temperature=0.0, max_frags=1, hyde=Fal...
function hypotetical_answer (line 223) | def hypotetical_answer(text, index, hyde_prompt=None, temperature=0.0):
function community_tokens_available_pct (line 234) | def community_tokens_available_pct():
function community_tokens_refresh_in (line 243) | def community_tokens_refresh_in():
function batch (line 251) | def batch(data, n):
FILE: src/pdf.py
function pdf_to_pages (line 5) | def pdf_to_pages(file):
FILE: src/stats.py
class Stats (line 6) | class Stats:
method __init__ (line 7) | def __init__(self):
method render (line 10) | def render(self, key):
class DictStats (line 21) | class DictStats(Stats):
method __init__ (line 22) | def __init__(self, data_dict):
method incr (line 26) | def incr(self, key, kv_dict):
method get (line 35) | def get(self, key):
class RedisStats (line 40) | class RedisStats(Stats):
method __init__ (line 41) | def __init__(self):
method incr (line 49) | def incr(self, key, kv_dict):
method get (line 59) | def get(self, key):
function get_stats (line 67) | def get_stats(**kw):
FILE: src/storage.py
class Storage (line 23) | class Storage:
method __init__ (line 26) | def __init__(self, secret_key):
method get (line 33) | def get(self, name, default=None):
method put (line 40) | def put(self, name, obj):
method list (line 47) | def list(self):
method delete (line 51) | def delete(self, name):
method _put (line 57) | def _put(self, name, data):
method _get (line 59) | def _get(self, name):
method _delete (line 61) | def _delete(self, name):
method _list (line 63) | def _list(self):
method serialize (line 68) | def serialize(self, obj):
method deserialize (line 74) | def deserialize(self, encrypted):
method encrypt (line 80) | def encrypt(self, raw):
method decrypt (line 84) | def decrypt(self, encrypted):
method compress (line 88) | def compress(self, data):
method decompress (line 91) | def decompress(self, data):
method encode (line 94) | def encode(self, name):
method decode (line 97) | def decode(self, name):
class DictStorage (line 101) | class DictStorage(Storage):
method __init__ (line 104) | def __init__(self, secret_key, data_dict):
method _put (line 108) | def _put(self, name, data):
method _get (line 113) | def _get(self, name):
method _list (line 116) | def _list(self):
method _delete (line 120) | def _delete(self, name):
class LocalStorage (line 124) | class LocalStorage(Storage):
method __init__ (line 127) | def __init__(self, secret_key, path):
method _put (line 135) | def _put(self, name, data):
method _get (line 139) | def _get(self, name):
method _list (line 144) | def _list(self):
method _delete (line 148) | def _delete(self, name):
class S3Storage (line 152) | class S3Storage(Storage):
method __init__ (line 155) | def __init__(self, secret_key, **kw):
method get_key (line 178) | def get_key(self, name):
method _put (line 181) | def _put(self, name, data):
method _get (line 186) | def _get(self, name):
method _list (line 193) | def _list(self):
method _delete (line 204) | def _delete(self, name):
function get_storage (line 210) | def get_storage(api_key, data_dict):
Condensed preview — 16 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (45K chars).
[
{
"path": ".gitignore",
"chars": 37,
"preview": "**/__pycache__\n**/*secret*.*\ndata/**\n"
},
{
"path": "LICENSE",
"chars": 1071,
"preview": "MIT License\n\nCopyright (c) 2023 Maciej Obarski\n\nPermission is hereby granted, free of charge, to any person obtaining a "
},
{
"path": "README.md",
"chars": 3047,
"preview": "# Ask my PDF\n\n\n\nThank you for your interest in my application. Please be aware that this is only a **Proof of Concept sy"
},
{
"path": "requirements.txt",
"chars": 112,
"preview": "git+https://github.com/mobarski/ai-bricks.git\nstreamlit\npypdf\nscikit-learn\nnumpy\npycryptodome\nboto3\nredis\nretry\n"
},
{
"path": "src/ai.py",
"chars": 2179,
"preview": "from ai_bricks.api import openai\nimport stats\nimport os\n\nDEFAULT_USER = os.getenv('COMMUNITY_USER','')\n\ndef use_key(key)"
},
{
"path": "src/cache.py",
"chars": 3745,
"preview": "from retry import retry\n\nfrom binascii import hexlify,unhexlify\nimport pickle\nimport zlib\nimport io\nimport os\n\n# pip ins"
},
{
"path": "src/css.py",
"chars": 151,
"preview": "v1 = \"\"\"\n/* feedback checkbox */\n.css-18fuwiq {\n position: relative;\n padding-top: 6px;\n}\n.css-949r0i {\n position: relat"
},
{
"path": "src/feedback.py",
"chars": 2695,
"preview": "import datetime\nimport hashlib\nimport redis\nimport os\nfrom retry import retry\n\ndef hexdigest(text):\n\treturn hashlib.md5("
},
{
"path": "src/gui.py",
"chars": 10331,
"preview": "__version__ = \"0.4.8.3\"\napp_name = \"Ask my PDF\"\n\n\n# BOILERPLATE\n\nimport streamlit as st\nst.set_page_config(layout='cente"
},
{
"path": "src/model.py",
"chars": 7042,
"preview": "from sklearn.metrics.pairwise import cosine_distances\n\nimport datetime\nfrom collections import Counter\nfrom time import "
},
{
"path": "src/pdf.py",
"chars": 248,
"preview": "\"PDF adapter\"\n\nimport pypdf\n\ndef pdf_to_pages(file):\n\t\"extract text (pages) from pdf file\"\n\tpages = []\n\tpdf = pypdf.PdfR"
},
{
"path": "src/prompts.py",
"chars": 2785,
"preview": "# INFO: some prompts are still in model.py\n\n# TODO: Ignore OCR problems in the text below.\n\nTASK = {\n\t'v6': (\n\t\t\t\"Answer"
},
{
"path": "src/run.bat",
"chars": 128,
"preview": "set STORAGE_MODE=LOCAL\nset CACHE_MODE=DISK\n\nset STORAGE_PATH=../data/storage\nset CACHE_PATH=../data/cache\n\nstreamlit run"
},
{
"path": "src/run.sh",
"chars": 112,
"preview": "STORAGE_MODE=LOCAL\nCACHE_MODE=DISK\n\nSTORAGE_PATH=../data/storage\nCACHE_PATH=../data/cache\n\nstreamlit run gui.py\n"
},
{
"path": "src/stats.py",
"chars": 2078,
"preview": "import redis\nfrom time import strftime\nimport os\nfrom retry import retry\n\nclass Stats:\n\tdef __init__(self):\n\t\tself.confi"
},
{
"path": "src/storage.py",
"chars": 5689,
"preview": "\"Storage adapter - one folder for each user / api_key\"\n\n# pip install pycryptodome\n# REF: https://www.pycryptodome.org/s"
}
]
About this extraction
This page contains the full source code of the mobarski/ask-my-pdf GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 16 files (40.5 KB), approximately 12.1k tokens, and a symbol index with 140 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.