Repository: rberenguel/glancer Branch: main Commit: 303d7663c6ae Files: 15 Total size: 4.9 MB Directory structure: gitextract_0swtfc8v/ ├── .gitattributes ├── .gitignore ├── LICENSE ├── README.md ├── Setup.hs ├── example/ │ └── internals-pyspark-arrow.html ├── glancer.cabal ├── src/ │ ├── Captions.hs │ ├── Html.hs │ ├── Main.hs │ ├── Parser.hs │ └── Process.hs ├── stack.yaml ├── test/ │ └── Spec.hs └── this.code-workspace ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitattributes ================================================ * linguist-vendored *.hs linguist-vendored=false ================================================ FILE: .gitignore ================================================ .stack-work .hie ================================================ FILE: LICENSE ================================================ Copyright Ruben Berenguel (c) 2021 All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Author name here nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ================================================ FILE: README.md ================================================ # Glancer > **glancer**: > NOUN _informal_ a person who glances > **glance**: > VERB If you glance at something or someone, you look at them very quickly and then look away again immediately. > VERB If you glance through or at a newspaper, report, or book, you **spend a short time looking at it without reading it very carefully**. --- [Regularly updated examples](https://github.com/rberenguel/glances) --- - [Glancer](#glancer) - [Installation](#installation) - [Usage](#usage) - [Notes and TODOs](#notes-and-todos) --- The amount of online conferences has skyrocketed lately, I wonder why. This has caused my _Pending to watch_ list to balloon from 30-ish pending technical videos (which are already a lot) to more than 100. There are then 2 problems: 1. I have too many techie conference videos to watch 2. In a lot of cases I realise halfway through that the subject wasn't that interesting or that I already know the area to be covered. For a long while I have had a similar problem with written articles. I solved it by: 1. Forcing me to read a substantial amount by [writing a weekly list of the best ones](https://mostlymaths.net/tags/readings/) 2. Brutally stop reading any article that is not good enough to _possibly be_ in that list. This is easy in writing: you can quickly scan the text and decide if it looks interesting enough for a deep dive in a few seconds (tech article reads range from a few minutes to around half an hour, depending of how technical it may get). But there is no way of doing it in videos! You need to watch maybe 15-20 minutes to then realise "meh". _Glancer_ should help with this. Given a YouTube url, it will: - Download the corresponding video (to a temporary folder), - Download the auto-generated subtitles (assumes English, hardcoded), - Capture images from the video every N=30 seconds (hardcoded for the moment), - Convert the images to base64, - Create a standalone webpage with the screenshots on the left and the corresponding text on the right. The goal is to be able to glance at the talk to decide if you really want to watch it or not. The _standalone_ part of the created webpage is to make it easier to "watch"/"share" to my iPad/iPhone without having to move a folder full of images. The whole talk becomes just a 5-15 Mb HTML file. A couple of additional neat (for me at least) features: - Clicking/tapping on the image will enlarge it, in case you want to see some code block larger (I wanted hover, but it was too tricky on mobile). - Clicking on the arrow on the lower-right of the slid block will open the video on Youtube, at that moment in time. ## Installation ```bash git clone https://github.com/rberenguel/glancer cd glancer stack install ``` You will need to have installed/available in the path: - The `base64` executable (should be in all IX systems by default) - `cat` in `/bin/cat` (likewise) - [`yt-dlp`](https://github.com/yt-dlp/yt-dlp) installed. Note the _p_. There is currently a bug in the normal one with downloading auto-generated subtitles (again, this also happened a long time ago). - The [stack](https://docs.haskellstack.org/en/stable/install_and_upgrade/) Haskell build tool ## Usage ``` Usage: glancer URL FILEPATH Glancer Available options: URL Youtube URL FILEPATH HTML file name (don't add extension) -h,--help Show this help text ``` In other words, `glancer https://www.youtube.com/watch?v=JWQxd3YKWhs internals-pyspark-arrow` would create the webpage `internals-pyspark-arrow.html` in the current folder, after processing the talk I gave at Spark Summit 2019. You can see the generated file [here](https://www.mostlymaths.net/glancer/example/internals-pyspark-arrow.html). Sometimes `youtube-dlc` won't be able to find the embedded youtube video (I've seen this happen randomly in Spark Summit North America 2020 videos in databricks.com), in this case the process will fail. Try to feed it youtube urls directly. ## Notes and TODOs - [ ] Making the time between images customizable via the CLI (if I find out 30 is not good enough in general). - [ ] Add a test suite to harden subtitle parsing. I always think parsers will be small enough and that it will be "obvious" they work. It's never the case, at least I did it [right](https://github.com/rberenguel/haskset/blob/master/test/Spec.hs) [twice](https://github.com/rberenguel/bear-note-graph/blob/master/tests/test_parser.py). - [ ] Make the still images video-dependent (so several `glancer` commands can run concurrently, even if it's a bad idea) - [x] ~Some additional tweaks to the HTML/CSS (possibly adding some JS as well)~ ## Similar projects ### [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search) This project downloads the YouTube video, extracts every N-th frame and uses neural networks to classify the content of each slide, allowing you to search by text. Impressive! --- _Note_: This README is long and winding on purpose. ================================================ FILE: Setup.hs ================================================ import Distribution.Simple main = defaultMain ================================================ FILE: example/internals-pyspark-arrow.html ================================================

Internals of Speeding up PySpark with Arrow - Ruben Berenguel (Consultant)

Created with glancer

a lot of people here are you're working with a spark right your hands yeah it's nice so I will start with what is pandas but if you are working with Python you probably already know this one then what is our role this may be Neil or maybe it's not Neil then a bit of how this part works I mean we are in sparks emits so probably you all know this but it's always good to remind how the pieces fit together
to remind how the pieces fit together because I when I explained how Pais part works because maybe you don't know that one it useful and then finally I'll cover how our row is making a pie spark way way faster so I'm Ruben a mathematician I work as a consultant from advertising company and I think writing Python for something like 15 years I mean I don't know an amp I didn't exist back then then I move to other languages but I still write a lot of Python day-to-day scale and Python
of Python day-to-day scale and Python now so I'm Pi spark code base it's kind of a sweet spot because I have a bit of Python every Java Scala it's a cool place to be so what is pandas I mean you may know it already it's a Python that analysis library if you see a job offer that says Python data pandas is involved in the code base somewhere and it's efficient wide efficient it because it's columnar when it has a CNS - back and so
columnar when it has a CNS - back and so the mechanism written is specifically in Python I mean the high-level API is but the rest is written in a very efficient implementation corner appears a lot in modern databases modern storage systems and what's the point of something being columnar same machine we have a table with a lot of columns and we want to sum one of the columns if you Resta curve you - you have a cpu and you have a row you have a cache line inside of the CPU
if you're thinking in rows and you want to some one column you need to go row by row loading the data and a little bit to the point where you can actually sum it when you can something is stored column wise you don't need to do it slowly you can just get everything from the column bringing it into the cash and some that was a bit first if you were in the previous table ie you kick that's really heartening to pronounce but he
really heartening to pronounce but he was doing the integration of spark are with ro and he also explained how CN the instructions make this fast so Panda is columnar how does it work internally in a high-level because I don't want to get into the details because I they are kind of complicated you have a data frame if you use pandas you know what it is it's basically a table with some names for the columns you have different types internally pan that is going to group
these types in memory by column index s of up to one place and then you bring together all the floats put them sideways in the fitting this way and bring them in memory same for Strings objects anything that is not an internal data type for pandas it's stored as an object block and integers there are several pandas data types that are
stored differently in memory altogether the blocks form a hundred by a block manager and when you request something being done to a problem you tell to the block manager give me this column and run this operation now we've done quickly to our Oh what is our always a library so you are unlikely to be using it unless you are writing your own library I mean you are just a client through pandas or through spark to the
other library it cause language it's columnar it in memory the key point is that it's across language so you can use the same internal structure in different languages and pass from one to the other and it's a bit optimized I mean the whole point of it it is written in C and there are implementations in Grasse go C Java obviously are there are not a lot of projects that are using it internally but pandas is one of
using it internally but pandas is one of using it internally but pandas is one of them them spark park were to use it is for something and Romero for what task which is a distributed computation framework kind of similar to spark but written entirely in Python gray which is basic something similar to dusk there is a POC of data processing written in rust using only a row there's a connector from Java to Scala with Python that is still experimental and it's using our role as
experimental and it's using our role as well an integration between our own pandas is seamless this by design I mean ro was created by the same developers that sorry that started pandas so the internal memory layouts are similar enough that the operations are really enough that the operations are really fast fast how does it look like in a row again you have a table this time I didn't bother
with the types and each set of rows is called a tracker patch I mean you can always call our track or throw a recording database link also throws records and you bring all the columns put them sideways and submit the data and consider that track or batch that's done for a certain number of rows and this can be done in a streaming basically the metadata says up to this point is this column up to this point is
point is this column up to this point is that other column and you can essentially skip the first if you only one the last one so you do can do it in streaming by just getting blocks of rows one to the other so at the high level you can think of a narrow table as set of record matches it's not exactly that that internal internal it's more complicated than this one from a high level this is good enough to understand how our looks like well our olaf's
if you think about the destruction so we're using record matches in one side so blocks of growth on the other side we have a block manager that its handling vertical blocks they don't look exactly the same but in the end you can think of them as the same you can convert another table into a pandas dataframe real easily there's a method called two pandas I mean if you think of it that way it's super easy but internally if you look at implementation it's like you
you look at implementation it's like you get something from one site and shop it around the idea is that if you squint like hard enough these are just vertical things that have similar colors that motors the idea you're can convert cracker batches to block managers and you can convert block managers to freaking batches now further what is a spark we're gonna spark some it's you know this I'll do this quicker than the
know this I'll do this quicker than the previous one which was looking off interest with a computation framework open source it's somewhat easy to use and scale horizontally and vertically how does it work you have a cluster manager that is going to give you computational resources you have some kind of distributed storage can be Hadoop in general HDFS in general your
another is going to request resources to the cluster manager to compute some tasks so you have tasks you record resources and then some executors appear in the resources that you have been given but this doesn't really tell us much about spark the important part that made spark spark is the RDD it's the ability to recompute things when they
failed that before in MapReduce wasn't as easy so here we have a happier as easy so here we have a happier leading leading and when you have an RDD you have a basically five properties or five if you check out the class at the class level you have five required but three required methods that you need to provide for an RDD to be not abstract basically one is partitions this is
where the data is or how do they make where the data is or how do they make sense this one is optional it's the preferred locations so when you have these data you have here this can be I don't know you have low data file or you have created something in memory so it gets partitioned by executor preferred location is when you have some kind of sort or some operation that it's closer to the load the loaded data or the process data and basically it says where
process data and basically it says where you can find it because sometimes you want to compute something on a group it's better if you do the computation in the machine that holds that group you need a computed a computing method trying to bring partitions to other partitions after you have computation for each partition the new result of the computation of compute is a new RDD that has a computation being applied to the previous video you have the dependencies
this is the key point of the RDD when one of the partition disappears because the machine that was holding it disappears because of the internet being large you can recompute only the partition because you can trace back through compute and the dependencies where this was coming from so eventually you can go back to the first original data source and compute just only that until you make it to the lost machine and dependencies it's basically RT these tracing back in time we are compute
tracing back in time we are compute until you go through some data source or data provider and fellow you have a partition er as part is going to do something sensible if you don't provide it I need to be a one-to-one method that can distribute the partitions across the different machines so go back to buy spark I'll refer to PI spark as the Python API on top the spark otherwise I will say just a spark or a
Scala spark a spark is a Python API for the core of a spark which is waiting in a scalos you already know how is it working inside on one side you are wearing Python Python does not have a JVM I mean - runs in the Python built on machine using a library known as a PI 4j which as the Jay can tell you it's
something to do with Java every time you create some object in PI spark these get associated an object in the JVM so when you start the Python driver in Python in PI spark this happens at some point I wanted to show the code because this super short thing it's basically telling to the JVM start spark and keep listening and that's it and I'm super super sure because this is doing a lot of work basically the
doing a lot of work basically the starting spark it's connecting back to your Python driver and it's keeping a reference in memory to that JVM so you can keep track of whatever you dream down there now we have a happy RDD in Python like like we've had our diseases classes in a Scala we have a class called RDD in Python you may have
inside of it you can look it up by creating your I mean you can open a bit oknot booklet an RTD and check this each other it is going to have a private in Python style underscore jRD this is a reference to the object in JVM so you can always trace back the oddity that you created in Python didn't really create it in Python you created dignity in the JVM and it had a reference to
and well these J's appear everywhere in a lot of private properties on methods in the Python API and they are standard for Java I mean I'm not a huge fan of jvm languages I like a Scala and I like Python and that's it a Python it's not in the IBM of course the connection between these two is through this gateway this could be object it's created at the beginning and if it's passed through every time you create a new RDD I mean it's the main connection that
I mean it's the main connection that gets you back from Python to whatever is in spark in past part they are two entry points to create this so you create an RDD that is it you just type blah blah blah did he paralyzed whatever and then you have pi blind RDD this one it's kind of internal so every time you want to map the partitions of an RDD in python a pipeline target is going to be created and pipeline that leads the one that is
actually doing the drawing the work so it's a bit like the this matreoshka animated gif it because uh you are creating things in python and they are inside other things that are inside other things inside of the JVM it gets a bit confusing in the end you have Python are deities in the scholar side and pipelined rdd's in the python side this is transparent for the PI spark user it is just right Rd didn't map F when it works but it's important to know what it
inside this material I think because why not get into the play into the picture it will explain why it's Astro so we have an angry RDD because he's a bit it in Python we want to apply a map F F is a method in Python whatever this create a pipeline that ID its internal the F
it's pulled into our internal property called funk and we get a pointer to the previous RTD and we have a J or diddly of course this J our DD it's where the magic is happening we have a Python or DD in the JVM these are passing a Scala obviously it has some dependencies because all our T's need to have a
starting point this dependency goes back to the our DD we had originally because we are going to map it with an F it in Python whatever it is still an RDD in scholars you need to have dependencies and we have a computer methods because I need to be there the computing method needs to do something with F even if it's in Python you need to compute it it doesn't matter it what the magic is happening in the end all the complication it's how do you compute something in the JVM that it's actually
in Python well you want to run something in the executors and that's going to be done by something called Python rather Python run is going to start a Volker it basically creates Python process through a connects to be a socket send through the socket a lot of the data it makes the worker load all the libraries that you need and I had
drawings for this one but otherwise it makes no sense we are in Scala this time Python Runner is inside of call it created by compute so as soon as you want to materialize the the computation this creates a Python worker in the executor and data goes through a socket so the F is being applied in the Python worker and the data is being sent back to the JVM as you can imagine there's a
lot of serialization going both ways and as you can imagine serializing is expensive always I mean basically you are sending pickles down and pickles up in Python and pickling and unpicking is a slow very slow so workers process the streams of data connects back to the JVM load all the libraries the libraries need to match the driver this utilized pickle things and applies the function at Cynthia pull back yeah
do it to back this was covering our disease but dying yes you are not using our do this directly anymore I'm not use data frames I use data frames because then I can optimize my well I don't need to optimize it I mean catalyst is going to optimize code for me I want I defined a D a G and the D a G get optimized so if you're using data frames it when you get the advantages of
using a spark after 2.0 basically a plan is generated so you write all the transformations you want eventually Duke right or your count or you basically create an action this plan will convert into a logical plan optimized logical plan physical plan when you were to the physical plan something get executed and all the compute methods start running because underlying data frames that are still are get is the planner optimize my catalyst which is basically something that removes branches from trees if
possible so it when to it gets this logical plan and when you have a filter at the end it has to push it down as much as possible because the less data you move around the less data you're computing now depending on what you did in Python catalyst is going to choose either the Python UTF runner or the Python ro runner that's the good one the one that uses our row the other one it's the old version of what is using all
these pickling things so you have UDF Runner serialize be serialize and send back this is basically the same as the Python runner before and the only thing is that this is the data frame implementation that the first two they are good implementation but when you're using the other one I mean we just change the names we are sending record change the names we are sending record matches matches so what internally the in the worker we're converting onto pandas remember this is super fast because our own
pandas look the same internally we play f2 the call it is fast because it's columnar so everything it's kind of optimized for speed and we send a letter back that's again kind of cheap because arrow has a serializing format so all this movement and the computation in the Python side are faster already and it's like well according to the kind of official /
unofficial benchmarks you can get from 3x to 100x optimization just by changing a setting which is enabling arrow and using UV f defined as pandas either scholar or group maps or group what there are forth and you can use basically take the advantage of what you can do in pandas by converting to a rule I want to show a couple of examples so you see what you can do and what speed
ups you get basic example is just two pandas basically if you do have two pandas in general I mean what I'm doing here is just creating a range a dataset with a power of two amount of later so here it's a two to the 20 with a random color column and just converting it to pandas converting to pandas brings all the data to the driver so it's always a kind of a bad idea but sometimes you
kind of a bad idea but sometimes you need to do something in pandas I need to do in the driver and that's what you get I'm just doing this this is with arrow disabled and I want to compare how much faster is when I just change that to enabled the only thing you need to do if you want to take advantage of ro aside from having a spark at least two well I time lists for different powers
of of - it was in my local machine with local runners easiest benchmark possible and this gets to 18 seconds I couldn't do 23 or 24 because I the driver was dying and then I did the reversion darrel version is consistently 10 times faster just in the local thing super easy in just changing one setting so if you ever do to pandas just do this you get a awesome
get a awesome speed up and it works for larger datasets because of the compression of how are you handle things now a little bit of a more complicated one which is a group map you may know that doing group bias is always well it's not always about idea if you're using the scale API group I had a lot of improvement in 2.2 2.4 when the aggregations are critic are
computed they are computed incrementalist so you never store the whole group in memory one problem that these still has is that it needs to hold the whole group in memory so it's a bit problematic but what anyway what we want to do here is that I have created kind of some fake transactions with some spent amounts and I want to I don't know do some photo analysis I compare each transaction with a mean for the same
user because maybe if you are spending a lot more than the mean it's like hey you've been hacked and you are doing something you shouldn't do this is very easy with the new way you can write these pandas aggregations with are enabled they find a planner UDF it's a group map because for each group you want to map and the map is basically a pandas method that it's assigning something so in this case it's just the
mean what's this doing so here we have some executors with a driver and we are grouping and aggregating we have some lettuce had lettuce ended it in the driver because it was created in the driver it is distributed because when you do a group by operation you need to problem everything is grouped in memories so each of the groups need to
fit in the executors there's kind of ongoing work or there should be ongoing work soon about fixing this so it's incremental and not needing to usual the memory the thing is the design in the workers these are rows in DFS are they are row are parallel rows so it being done by a row it's computed in pandas and it's sent back and sent in the example the last thing was a two pandas to the driver did you send to the rival this is already computed it's on per group
already computed it's on per group you're finished so I wrote a similar version without using a pandas EDF I wrote this it's kind of bad it could be done a bit better but if you the problem is that if you want to do it well it gets really really messy so this is just group everything correct with a collect list so you have for each group or so
list so you have for each group or so for each user all the transactions for that user and then you convert everything to pandas pass-through numpy and get the mean the problem is that the group dot two pandas it's applied to the groups that are coming with a collect list and collect list if it works it is lo one when it's too large it blows executors well yeah in this case we are
still grouping so we are at least in the data as before we are sending the computation the problem is that right now we are doing the group in each machine as before this time it incremental so you don't need to fit everything in memory it's fine but when you get here you are sending all the data to the driver and still a lot of work to be done in the driver because you just have grouped you still need to
do the mean for each group so this needs to be sent to panders to the worker and is actually blows up I've it a quick benchmark of this one material in graphene because it's kind of a slow let me go back because it was in the so this implementation for two to the twenty five fails it can do it and for twenty million records which is what's written here it takes around
what's written here it takes around three minutes the producing the the other implementation with pandas was taking a couple of minutes for the two ten twenty two two two two two twenty five and just took one minute for twenty millions it's not a huge speed up it's like a three times x it's a bad code wise it's much better and actually works because you can do to to to to to to the twenty five which you couldn't do in
this case so the too long didn't read or too long didn't watch is that you just need to use our oh right pandas UDF's usually the problem is that you need to write your UDF's pandas UDS that may not be straightforward I mean that may be a problem but if you are doing two pandas just a navel down rule I mean that's a team win you can get ten paint improvements just for that part it
depends on your use case I mean if you are using PI spark for long enough you can use you can kind of figure out how to do this otherwise I mean there are other solutions but I think using PI's part it's getting better and better really really fast so I mean when I started with a spark that was like three and something years ago I used this color because back then it was like Python it to its low even if I was a seasoned Python developer if I had to
seasoned Python developer if I had to choose now I would use spice part because why not it's almost as fast as a scholar well that some reason you can find some resources in the presentation I recommend that you become a contributor to spark if you are not one already I mean you learn a lot of the internals and what's nice what's not I think it's a it's a good way to to be part of the community and I think it's
time for questions thank you so much for this session I had a question so when using pandas UDF it would use arrow for its optimization for the serialization and deserialization of data but in the actual execution plan of a spark job is there any way to validate that it is
actually using apache arrow because when you for example do two pandas or create a part data frame from a panda's data frame it actually states in the execution plan that it's using arrow evil python but not while in the pandas UDF that I've seen I don't remember for this color one but for the group map that flood map group in pandas that's the physical execution note okay and the
flat map group it basically suggests that it is using Apache arrow I think so maybe it has all robots all in the name because because when I turned the basically configuration off for the arrow execution enabled it still uses the same in the execution plan the flat map group yeah that that's a good question I didn't check that I didn't try to check the plan I just check the plan when I was using a robot either and check what it's doing I mean yeah this
check what it's doing I mean yeah this was something that I was working on then yeah yeah I don't know if there's a way to confirm that if you are using it from the plan only okay there should but maybe there's not sure all right okay let's let's check later maybe we can open a pull request or something yeah hi spark and sparked airframes and not pandas or maybe the working class is there any advantage to era or is arrow only if you trying to put anything into
only if you trying to put anything into pandas koalas is going to use our row if you enable it and it may be enabling it by default I don't know if it's by default but internally the colors code has the art row settings for the tests at least so it went to be an advantage here because it's doing two pandas at some point any other questions I think for the walkthrough to the basic of ice
for the walkthrough to the basic of ice part was very useful one of the things we found using panda UDF's even enabling arrows is the problem supporting complex data types as you move between Scotland Python even more I know the current workaround is put everything with JSON serializes as fast as possible do you have any updates or any clue other is moving in the error I've seen some cheetah tickets for a spark 3 that maybe are working around that but I haven't checked in a few
that but I haven't checked in a few months so I'm not sure but yeah I mean data types I mean our data type has had a lot of JIRA tickets on its own so yeah that's a problematic one definitive so you're using not pandas what improvement you can get there I mean if you are not using the PI part of a spark it shouldn't affect you so much because it never get into Python so it's just in
================================================ FILE: glancer.cabal ================================================ cabal-version: 2.2 name: glancer version: 0.1.0.0 -- synopsis: -- description: homepage: https://github.com/rberenguel/glancer#readme license: BSD-3-Clause license-file: LICENSE author: Ruben Berenguel copyright: 2021 Ruben Berenguel category: Misc build-type: Simple extra-source-files: README.md common deps build-depends: base >= 4.7 && < 5, text, bytestring, raw-strings-qq, megaparsec, directory, filepath, mtl, html-entities default-language: Haskell2010 ghc-options: -fwarn-incomplete-patterns -fwarn-unused-imports -Werror=incomplete-patterns -fwrite-ide-info -hiedir=.hie executable glancer import: deps hs-source-dirs: src main-is: Main.hs other-modules: Parser, Process, Html, Captions build-depends: optparse-applicative, hspec, containers, process, temporary,random, process-extras test-suite tests import: deps type: exitcode-stdio-1.0 hs-source-dirs: test, src main-is: Spec.hs build-depends: hspec, QuickCheck, hspec-megaparsec ================================================ FILE: src/Captions.hs ================================================ {-# LANGUAGE OverloadedStrings #-} module Captions where import Data.Coerce (coerce) import Data.Foldable (Foldable (toList)) import Data.Sequence (fromList, mapWithIndex) import Data.Text (intercalate, strip) import qualified Data.Text as T import Data.Text.Encoding (decodeUtf8) import Html (embody, heading) import Parser (Caption (block, txt), TimeBlock (..), TimeDef (..), inBlock, secs) import Process ( Dir (..), Url (..), Video (..), deleteImages, ) import System.IO ( hPutStrLn, stderr, ) import qualified System.Process.ByteString as B import Text.Printf (printf) convertToHTML :: Show a => Video -> Dir -> Either a [Caption] -> IO T.Text convertToHTML video dir parsed = case parsed of Right captions -> do html <- captionsToHTML video dir captions hPutStrLn stderr "Finished" return html Left err -> return (T.pack (show err)) captionsToHTML :: Video -> Dir -> [Caption] -> IO T.Text captionsToHTML video dir captions = do caps <- formatCaptions captions (coerce $ url video) dir deleteImages dir let embodied = embody video caps return (heading <> embodied) formatCaptions :: [Caption] -> Url -> Dir -> IO T.Text formatCaptions captions (Url url) (Dir dir) = do let toMap = fromList $ captionsPerSlide captions let listy = toList (mapWithIndex (imgCaps url (T.pack dir)) toMap) intercalate "\n" <$> sequence listy imgCaps :: Integral a => T.Text -> T.Text -> a -> [Caption] -> IO T.Text imgCaps url dir ind captions = do --let next = toInteger ind + 1 img <- slideBlock url dir (toInteger ind) --next let toVideo = toVideoBlock url (toInteger ind) return (img <> caps captions <> toVideo <> "
") slideBlock :: T.Text -> T.Text -> Integer -> IO T.Text slideBlock url dir shot = do let imgPath = T.unpack (dir <> "/glancer-img" <> T.pack (printf "%04d.jpg" shot)) (_, jpg, _) <- B.readProcessWithExitCode "/bin/cat" [imgPath] "" (_, base64, _) <- B.readProcessWithExitCode "base64" [] jpg let slide = "
\n" let div = "\t
\n" let img = "\t\t\n" let close = "\t
\n" let formatted = slide <> div <> img <> close return formatted toVideoBlock :: T.Text -> Integer -> T.Text toVideoBlock url shot = do let when = T.pack $ show (shotSeconds shot 30) let title = "title='Go to video at timestamp " <> when <> "s'" let diva = "
title <> "href='" <> url <> "&t=" <> when <> "s'>⇰
" diva caps :: [Caption] -> T.Text caps captions = "\t
\n" <> intercalate "\n" (stripped captions) <> "\n\t
" where stripped captions = map (("\t\t" <>) . strip . txt) captions captionsPerSlide :: [Caption] -> [[Caption]] captionsPerSlide captions = map (capsForShot (filtering captions) 30) (shots (filtering captions)) where filtering lst = filter (not . (T.isInfixOf "" . txt)) lst shots captions = [1 .. numShots captions 30 + 1] shotSeconds :: Integer -> Integer -> Integer shotSeconds shotNumber secsPerShot = shotNumber * secsPerShot hourMinuteSeconds :: Integer -> TimeDef hourMinuteSeconds seconds = TimeDef h m s 0 where h = seconds `div` 3600 m = seconds `mod` 3600 `div` 60 s = seconds `mod` 3600 `mod` 60 shotTime :: Integer -> Integer -> TimeBlock shotTime shotNumber secsPerShot = TimeBlock startTime endTime where startTime = hourMinuteSeconds start_ endTime = hourMinuteSeconds end_ start_ = shotSeconds (shotNumber - 1) secsPerShot end_ = shotSeconds shotNumber secsPerShot numShots :: [Caption] -> Integer -> Integer numShots caps secsPerShot = secs (end . block . last $ caps) `div` secsPerShot capsForShot :: [Caption] -> Integer -> Integer -> [Caption] capsForShot caps secsPerShot shotNumber = filter (inTimeBlock shotBlock) caps where shotBlock = shotTime shotNumber secsPerShot inTimeBlock blck cap = inBlock (start . block $ cap) blck && inBlock (end . block $ cap) blck ================================================ FILE: src/Html.hs ================================================ {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE QuasiQuotes #-} module Html where import qualified Data.Text as T import qualified HTMLEntities.Text as H import Process (Title (..), Url (..), Video (..)) import Text.RawString.QQ (r) heading :: T.Text heading = T.pack [r| |] embody :: Video -> T.Text -> T.Text embody (Video (Url url) (Title title) _) body = T.intercalate "\n" [ "\t", "\t\t
", "\t\t\t

" <> H.text title <> "

", "\t\t\t

Created with glancer

", body, "", "" ] ================================================ FILE: src/Main.hs ================================================ {-# LANGUAGE OverloadedStrings #-} module Main where import Captions (convertToHTML) import Data.Coerce (coerce) import Data.Text (pack) import qualified Data.Text as T import GHC.IO.Encoding (utf8) import qualified Options.Applicative as Ap import Parser (subsP) import Process ( Dir (..), Filename (..), Url (..), Video (..), processURL, ) import System.Directory (getHomeDirectory) import System.FilePath (joinPath, splitPath, (-<.>), ()) import System.IO ( IOMode (ReadMode), hGetContents, hPutStrLn, hSetEncoding, openFile, stderr, ) import Text.Megaparsec (parse) import Prelude data CLIConfig = CLIConfig { _url :: String, _filename :: String } cliConfig :: Ap.Parser CLIConfig cliConfig = CLIConfig <$> Ap.strArgument (Ap.metavar "URL" <> Ap.help "Youtube URL") <*> Ap.strArgument (Ap.metavar "FILEPATH" <> Ap.help "HTML file name (don't add extension)") getFullPath :: FilePath -> IO FilePath getFullPath s = case splitPath s of "~/" : t -> joinPath . (: t) <$> getHomeDirectory _ -> return s start :: CLIConfig -> IO () start (CLIConfig url filename) = do hPutStrLn stderr ("Looking for video in " ++ url) (dir_, video) <- processURL (Url $ T.pack url) let videoName = coerce (file video) let dir = coerce dir_ capsPath <- getFullPath (dir videoName -<.> "en.vtt") handle <- openFile capsPath ReadMode hSetEncoding handle utf8 contents <- hGetContents handle let parsed = parse subsP "" (pack contents) html <- convertToHTML video dir_ parsed destinationPath <- getFullPath (filename -<.> "html") hPutStrLn stderr ("Writing html to " ++ destinationPath) writeFile destinationPath (T.unpack html) hPutStrLn stderr ("Data written to " ++ destinationPath) main :: IO () main = do Main.start =<< Ap.execParser opts where opts = Ap.info (cliConfig Ap.<**> Ap.helper) ( Ap.fullDesc <> Ap.progDesc "Glancer" <> Ap.header "Why not" ) ================================================ FILE: src/Parser.hs ================================================ {-# LANGUAGE OverloadedStrings #-} module Parser where import Control.Monad (void) import qualified Data.Text as T import Data.Void (Void) import Text.Megaparsec ( MonadParsec (eof, lookAhead, try), Parsec, anySingle, manyTill, (<|>), ) import Text.Megaparsec.Char ( char, eol, numberChar, space, spaceChar, string, ) type Parser = Parsec Void T.Text data TimeDef = TimeDef { hours :: Integer, minutes :: Integer, seconds :: Integer, cents :: Integer } deriving (Show, Eq) secs :: TimeDef -> Integer secs td = 3600 * hours td + 60 * minutes td + seconds td + 1 instance Ord TimeDef where compare a b = compare (secs a) (secs b) data TimeBlock = TimeBlock { start :: TimeDef, end :: TimeDef } deriving (Show, Eq) inBlock :: TimeDef -> TimeBlock -> Bool inBlock td tb = (secs . start $ tb) <= secstd && secstd <= (secs . end $ tb) where secstd = secs td data Caption = Caption { block :: TimeBlock, txt :: T.Text } deriving (Show, Eq) hourBlockP :: Parser TimeDef hourBlockP = do hours <- colonSeparated minutes <- colonSeparated seconds <- read <$> manyTill numberChar (char '.') millis <- read <$> manyTill numberChar (void spaceChar <|> try (lookAhead (void eol))) return (TimeDef hours minutes seconds millis) where colonSeparated = read <$> manyTill numberChar (char ':') arrowP :: Parser T.Text arrowP = do string "-->" anyChar :: Parser Char anyChar = anySingle timeLineP :: Parser TimeBlock timeLineP = do start <- hourBlockP arrowP space end <- hourBlockP manyTill anyChar eol return (TimeBlock start end) captionP :: Parser Caption captionP = do space block <- timeLineP caption <- T.pack <$> manyTill anyChar (try $ lookAhead (void timeLineP <|> void eof)) return (Caption block caption) subsP :: Parser [Caption] subsP = do string "WEBVTT" manyTill anyChar (try $ lookAhead timeLineP) space manyTill captionP eof ================================================ FILE: src/Process.hs ================================================ {-# LANGUAGE OverloadedStrings #-} module Process where import Control.Monad (replicateM) import qualified Data.ByteString.Char8 as B import Data.Coerce (coerce) import qualified Data.Text as T import qualified Data.Text.Encoding as TE import System.Directory (getTemporaryDirectory) import System.FilePath ((-<.>), ()) import System.IO ( hPutStrLn, stderr, ) import System.Process (callCommand, callProcess, readProcess) import System.Random (Random (randomRIO)) newtype Url = Url T.Text newtype Title = Title T.Text newtype Id = Id T.Text newtype Dir = Dir String newtype Filename = Filename String data Video = Video { url :: Url, title :: Title, file :: Filename } getTitle :: Url -> IO Title getTitle (Url url) = do title <- readProcess "yt-dlp" ["-e", "--no-warnings", "--no-playlist", T.unpack url] [] return (Title $ TE.decodeUtf8 $ B.pack title) -- This deletes badly encoded characters, which is better than having them but worse than properly encoding them. Help appreciated getId :: Url -> IO Id getId (Url url) = do id <- readProcess "yt-dlp" ["--get-id", "--no-warnings", "--no-playlist", T.unpack url] [] return (Id $ T.pack id) youtubeURL :: Id -> Url youtubeURL (Id id) = Url ("https://www.youtube.com/watch?v=" <> id) generateVideo :: Video -> Dir -> IO () generateVideo (Video (Url url) _ (Filename videoName)) (Dir dir) = do callProcess command arguments where command = "yt-dlp" arguments = ["-q", "--no-playlist", "-f mp4", coerce ("-o" <> dir videoName -<.> "mp4"), "--sub-langs", "en", "--write-auto-sub", "--write-sub", "--no-warnings", "-k", "--no-cache-dir", T.unpack url] args :: Dir -> Filename -> [String] -> String -> [String] args (Dir dir) (Filename videoName) selector suffix = [ "-i", coerce (dir videoName -<.> "mp4")] ++ selector ++ [ coerce (dir "glancer-img" <> suffix -<.> "jpg"), "-hide_banner", "-loglevel", "panic" ] generateShots :: Dir -> Filename -> IO () generateShots dir video = do callProcess "ffmpeg" (args dir video ["-vf", "fps=1/30"] "%04d") callProcess "ffmpeg" (args dir video ["-vframes", "1", "-ss", "3"] "0000" ) deleteVideo :: Dir -> Filename -> IO () deleteVideo (Dir dir) (Filename videoName) = callProcess "rm" [coerce (dir videoName -<.> "mp4")] deleteImages :: Dir -> IO () deleteImages (Dir dir) = callCommand $ T.unpack ("rm " <> T.pack (dir "glancer-img*")) processURL :: Url -> IO (Dir, Video) processURL url = do dir <- Dir <$> getTemporaryDirectory videoName <- Filename <$> replicateM 10 (randomRIO ('a', 'z')) title <- getTitle url hPutStrLn stderr $ T.unpack ("The video is titled '" <> T.strip (coerce title) <> "'") id <- getId url let yourl = youtubeURL id hPutStrLn stderr $ T.unpack (T.strip ("Seems like the video is in " <> coerce yourl)) let video = Video yourl title videoName hPutStrLn stderr "Downloading video (this may take a while)" generateVideo video dir hPutStrLn stderr (T.unpack ("Downloaded video to " <> (T.pack . coerce) dir <> (T.pack . coerce) videoName <> "(.mp4|en.vtt)")) hPutStrLn stderr "Generating still images from video (this may take a while)" generateShots dir videoName hPutStrLn stderr "Generated images" deleteVideo dir videoName return (dir, video) ================================================ FILE: stack.yaml ================================================ # This file was automatically generated by 'stack init' # # Some commonly used options have been documented as comments in this file. # For advanced use and comprehensive documentation of the format, please see: # https://docs.haskellstack.org/en/stable/yaml_configuration/ # Resolver to choose a 'specific' stackage snapshot or a compiler version. # A snapshot resolver dictates the compiler version and the set of packages # to be used for project dependencies. For example: # # resolver: lts-3.5 # resolver: nightly-2015-09-21 # resolver: ghc-7.10.2 # # The location of a snapshot can be provided as a file or url. Stack assumes # a snapshot provided as a file might change, whereas a url resource does not. # # resolver: ./custom-snapshot.yaml # resolver: https://example.com/snapshots/2018-01-01.yaml resolver: lts-16.31 # User packages to be built. # Various formats can be used as shown in the example below. # # packages: # - some-directory # - https://example.com/foo/bar/baz-0.0.2.tar.gz # subdirs: # - auto-update # - wai packages: - . # Dependency packages to be pulled from upstream that are not in the resolver. # These entries can reference officially published versions as well as # forks / in-progress versions pinned to a git hash. For example: # # extra-deps: # - acme-missiles-0.3 # - git: https://github.com/commercialhaskell/stack.git # commit: e7b331f14bcffb8367cd58fbfc8b40ec7642100a # # extra-deps: [] # Override default flag values for local packages and extra-deps # flags: {} # Extra package databases containing global packages # extra-package-dbs: [] # Control whether we use the GHC we find on the path # system-ghc: true # # Require a specific version of stack, using version ranges # require-stack-version: -any # Default # require-stack-version: ">=2.3" # # Override the architecture used by stack, especially useful on Windows # arch: i386 # arch: x86_64 # # Extra directories used by stack for building # extra-include-dirs: [/path/to/dir] # extra-lib-dirs: [/path/to/dir] # # Allow a newer minor version of GHC than the snapshot specifies # compiler-check: newer-minor ================================================ FILE: test/Spec.hs ================================================ {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE QuasiQuotes #-} import Data.Text (pack) import qualified Parser as P import Test.Hspec ( hspec, describe, it, Spec ) import Test.Hspec.Megaparsec ( shouldParse ) import Text.Megaparsec ( parse ) import Text.RawString.QQ ( r ) main :: IO () main = hspec spec hourBlock = "01:07:46.029 " parsedHourBlock = P.TimeDef 1 7 46 29 timeBlock = pack [r|01:07:46.029 --> 01:07:52.319 align:start position:0% |] parsedTimeBlock = P.TimeBlock parsedHourBlock (P.TimeDef 1 7 52 319) captionBlock = pack [r| 01:07:46.029 --> 01:07:52.319 align:start position:0% really look into but we have considered |] parsedCaptionBlock = P.Caption parsedTimeBlock "really look into but we have considered\n" spec :: Spec spec = do describe "arrowP" $ do it "parses a time arrow" $ parse P.arrowP "" "-->" `shouldParse` "-->" describe "hourBlockP" $ do it "parses a hourBlock" $ parse P.hourBlockP "" hourBlock `shouldParse` parsedHourBlock describe "timeBlockP" $ do it "parses a timeBlock" $ parse P.timeLineP "" timeBlock `shouldParse` parsedTimeBlock describe "captionP" $ do it "parses a captionBlock" $ parse P.captionP "" captionBlock `shouldParse` parsedCaptionBlock ================================================ FILE: this.code-workspace ================================================ { "folders": [ { "path": "." } ], "settings": { "files.watcherExclude": { "**/target": true } } }