Full Code of icy/google-group-crawler for AI

master bc1d37a5a6f9 cached

7 files

23.5 KB

7.0k tokens

1 requests

Download .txt

Repository: icy/google-group-crawler
Branch: master
Commit: bc1d37a5a6f9
Files: 7
Total size: 23.5 KB

Directory structure:
gitextract_vkmkafky/

├── .travis.yml
├── CHANGELOG.md
├── README.md
├── contrib/
│   └── README.md
├── crawler.sh
└── tests/
    ├── curl-options.txt.enc
    └── tests.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: .travis.yml
================================================
sudo: required
language:
- bash
script:
- sudo apt-get install shellcheck
- shellcheck *.sh
- ( cd tests/ && openssl aes-256-cbc -K $encrypted_4d6c5775c90a_key -iv $encrypted_4d6c5775c90a_iv -in curl-options.txt.enc -out curl-options.txt -d ;)
- ./tests/tests.sh


================================================
FILE: CHANGELOG.md
================================================
## v2.0.0

* Using `curl` instead of `wget`
* Fix #36 (unable to read cookie file)
* Fix #34 (`413 Request Entity Too Large`)

## v1.2.2

* Loop detection: #24.
* Add test cases.
* Update documentation (Cookie issue.)
* Minor code improvements.
* Group with category support (#28, Thanks @LeeKevin)

## v1.2.1

* Fix bugs: #6 (compatibility issue),
    #13 (so large group),
    #16 (email exporting and third-party license issue)
* Fix script shebang.
* Google organization support.
* Ensure group name is in lowercase.
* Minor scripting improvements.

## v1.2.0

* Drop the use of `lynx` program. `wget` handles all download now.
* Accept `_WGET_OPTIONS` environment to control `wget` commands.
* Can work with private groups thanks to `_WGET_OPTIONS` environment.
* Rename script (`craw.sh` becomes `crawler.sh`.)
* Output important variables to the output script.
* Update documentation (`README.md`.)

## v1.0.1

* Provide fancy agent to `wget` and `lynx` command.
* Fix wrong URL of `rss` feed.
* Use `set -u` to avoid unbound variable.
* Fix display charset of `lynx` program. See #3.

## v1.0.0

* The first public version.


================================================
FILE: README.md
================================================
WARNING: This project doesn't work and it's deprecated. 
**Reason:** Ajax support is completely deprecated by Google
  See also https://github.com/icy/google-group-crawler/issues/42#issuecomment-889013487

[![Build Status](https://travis-ci.org/icy/google-group-crawler.svg?branch=master)](https://travis-ci.org/icy/google-group-crawler)

## Download all messages from Google Group archive

`google-group-crawler` is a `Bash-4` script to download all (original)
messages from a Google group archive.
Private groups require some cookies string/file.
Groups with adult contents haven't been supported yet.

* [Installation](#installation)
* [Usage](#usage)
  * [The first run](#the-first-run)
  * [Update your local archive thanks to rss feed](#update-your-local-archive-thanks-to-rss-feed)
  * [Private group or Group hosted by an organization](#private-group-or-group-hosted-by-an-organization)
  * [The hook](#the-hook)
  * [What to do with your local archive](#what-to-do-with-your-local-archive)
  * [Rescan the whole local archive](#rescan-the-whole-local-archive)
  * [Known problems](#known-problems)
* [Contributions](#contributions)
* [Similar projects](#similar-projects)
* [License](#license)
* [Author](#author)
* [For script hackers](#for-script-hackers)

## Installation

The script requires `bash-4`, `sort`, `curl`, `sed`, `awk`.

Make the script executable with `chmod 755` and put them in your path
(e.g, `/usr/local/bin/`.)

The script may not work on `Windows` environment as reported in
https://github.com/icy/google-group-crawler/issues/26.

## Usage

### The first run

For private group, please
[prepare your cookies file](#private-group-or-group-hosted-by-an-organization).

    # export _CURL_OPTIONS="-v"       # use curl options to provide e.g, cookies
    # export _HOOK_FILE="/some/path"  # provide a hook file, see in #the-hook

    # export _ORG="your.company"      # required, if you are using Gsuite
    export _GROUP="mygroup"           # specify your group
    ./crawler.sh -sh                  # first run for testing
    ./crawler.sh -sh > curl.sh        # save your script
    bash curl.sh                      # downloading mbox files

You can execute `curl.sh` script multiple times, as `curl` will skip
quickly any fully downloaded files.

### Update your local archive thanks to RSS feed

After you have an archive from the first run you only need to add the latest
messages as shown in the feed. You can do that with `-rss` option and the
additional `_RSS_NUM` environment variable:

    export _RSS_NUM=50                # (optional. See Tips & Tricks.)
    ./crawler.sh -rss > update.sh     # using rss feed for updating
    bash update.sh                    # download the latest posts

It's useful to follow this way frequently to update your local archive.

### Private group or Group hosted by an organization

To download messages from private group or group hosted by your organization,
you need to provide some cookie information to the script. In the past,
the script uses `wget` and the Netscape cookie file format,
now we are using `curl` with cookie string and a configuration file.

0. Open Firefox, press F12 to enable Debug mode and select Network tab
   from the Debug console of Firefox. (You may find a similar way for
   your favorite browser.)
1. Log in to your testing google account, and access your group.
   For example
     https://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public
   (replace `google-group-crawler-public` with your group name).
   Make sure you can read some contents with your own group URI.
2. Now from the Network tab in Debug console, select the address
   and select `Copy -> Copy Request Headers`. You will have a lot of
   things in the result, but please paste them in your text editor
   and select only `Cookie` part.
3. Now prepare a file `curl-options.txt` as below

        user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
        header = "Cookie: <snip>"

   Of course, replace the `<snip>` part with your own cookie strings.
   See `man curl` for more details of the file format.

2. Specify your cookie file by `_CURL_OPTIONS`:

        export _CURL_OPTIONS="-K /path/to/curl-options.txt"

   Now every hidden group can be downloaded :)

### The hook

If you want to execute a `hook` command after a `mbox` file is downloaded,
you can do as below.

1. Prepare a Bash script file that contains a definition of `__curl_hook`
   command. The first argument is to specify an output filename, and the
   second argument is to specify an URL. For example, here is simple hook

        # $1: output file
        # $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
        __curl_hook() {
          if [[ "$(stat -c %b "$1")" == 0 ]]; then
            echo >&2 ":: Warning: empty output '$1'"
          fi
        }

    In this example, the `hook` will check if the output file is empty,
    and send a warning to the standard error device.

2. Set your environment variable `_HOOK_FILE` which should be the path
   to your file. For example,

        export _GROUP=archlinuxvn
        export _HOOK_FILE=$HOME/bin/curl.hook.sh

   Now the hook file will be loaded in your future output of commands
   `crawler.sh -sh` or `crawler.sh -rss`.

### What to do with your local archive

The downloaded messages are found under `$_GROUP/mbox/*`.

They are in `RFC 822` format (possibly with obfuscated email addresses)
and they can be converted to `mbox` format easily before being imported
to your email clients  (`Thunderbird`, `claws-mail`, etc.)

You can also use [mhonarc](https://www.mhonarc.org/) ultility to convert
the downloaded to `HTML` files.

See also

* https://github.com/icy/google-group-crawler/issues/15#issuecomment-221018338
* https://github.com/icy/google-group-crawler/issues/35#issuecomment-580659966
* My script https://github.com/icy/bashy/blob/master/libs/raw2mbox.sh

### Rescan the whole local archive

Sometimes you may need to rescan / redownload all messages.
This can be done by removing all temporary files

    rm -fv $_GROUP/threads/t.*    # this is a must
    rm -fv $_GROUP/msgs/m.*       # see also Tips & Tricks

or you can use `_FORCE` option:

    _FORCE="true" ./crawler.sh -sh

Another option is to delete all files under `$_GROUP/` directory.
As usual, remember to backup before you delete some thing.

### Known problems

1. Fails on group with adult contents (https://github.com/icy/google-group-crawler/issues/14)
1. This script may not recover emails from public groups.
  When you use valid cookies, you may see the original emails
  if you are a manager of the group. See also https://github.com/icy/google-group-crawler/issues/16.
2. When cookies are used, the original emails may be recovered
  and you must filter them before making your archive public.
3. Script can't fetch from group whose name contains some special character (e.g, `+`)
  See also https://github.com/icy/google-group-crawler/issues/30

## Contributions

1. `parallel` support: @Pikrass has a script to download messages in parallel.
  It's discussed in the ticket https://github.com/icy/google-group-crawler/issues/32.
  The script: https://gist.github.com/Pikrass/f8462ff8a9af18f97f08d2a90533af31
2. `raw access denied`: @alexivkin mentioned he could use the `print` function
  to work-around the issue. See it here
  https://github.com/icy/google-group-crawler/issues/29#issuecomment-468810786

## Similar projects

* (website) [Google Takeout - Download all info for any groups you own](https://takeout.google.com/)
* (Shell/curl) [ggscrape - Download emails from a Google Group. Rescue your archives](https://git.scuttlebot.io/%25nkOkiGF0Dd321GmNqs6aW%2BWHaH9Uunq4m8dVfJuU%2Bps%3D.sha256)
* (Python/Webdriver) [scrape_google_groups.py  - A simple script to scrape a google group](https://gist.github.com/punchagan/7947337)
* (Python/webscraping.webkit) [gg-scrape - Liberate you data from google groups](https://github.com/jrholliday/gg-scrape)
* (Python/urllib) [gg_scraper](https://gitlab.com/mcepl/gg_scraper)
* (PHP/libcurl) [scraping-google-groups](http://saturnboy.com/2010/03/scraping-google-groups/)

## License

This work is released under the terms of a MIT license.

## Author

This script is written by Anh K. Huynh.

He wrote this script because he couldn't resolve the problem by using
`nodejs`, `phantomjs`, `Watir`.

New web technology just makes life harder, doesn't it?

## For script hackers

Please skip this section unless your really know to work with `Bash` and shells.

0. If you clean your files _(as below)_, you may notice that it will be
   very slow when re-downloading all files. You may consider to use
   the `-rss` option instead. This option will fetch data from a `rss` link.

   It's recommmeded to use the `-rss` option for daily update. By default,
   the number of items is 50. You can change it by the `_RSS_NUM` variable.
   However, don't use a very big number, because Google will ignore that.

1. Because Topics is a FIFO list, you only need to remove the last file.
   The script will re-download the last item, and if there is a new page,
   that page will be fetched.

        ls $_GROUP/msgs/m.* \
        | sed -e 's#\.[0-9]\+$##g' \
        | sort -u \
        | while read f; do
            last_item="$f.$( \
              ls $f.* \
              | sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
              | sort -n \
              | tail -1 \
            )";
            echo $last_item;
          done

2. The list of threads is a LIFO list. If you want to rescan your list,
   you will need to delete all files under `$_D_OUTPUT/threads/`

3. You can set the time for `mbox` output files, as below

        ls $_GROUP/mbox/m.* \
        | while read FILE; do \
            date="$( \
              grep ^Date: $FILE\
              | head -1\
              | sed -e 's#^Date: ##g' \
            )";
            touch -d "$date" $FILE;
          done

    This will be very useful, for example, when you want to use the
    `mbox` files with `mhonarc`.


================================================
FILE: contrib/README.md
================================================

## Fix dot in email addresses

By default, emails exported by the tool are not original because
Google's anti-spam mechanism removes some characters from them, for e.g,

    this.is.my.email@example.net    --> this.....@example.net

The `discourse` has a great script to fix this problem, as seen at

https://github.com/discourse/discourse/blob/648bcb6432ee1fbca0fc9d45c25c3d114f2a0892/script/import_scripts/mbox.rb

This script was imported to the `google-group-crawler` project, but it
was removed on Apr 24th 2017 due to license problem as described here

https://github.com/icy/google-group-crawler/issues/16#issuecomment-292509711

Removing is the best way to avoid duplication and future confusion.


================================================
FILE: crawler.sh
================================================
#!/usr/bin/env bash
#
# Purpose: Make a backup of Google Group [Google Group Crawler]
# Author : Anh K. Huynh
# Date   : 2013 Sep 22nd
# License: MIT license
#
# Copyright (c) 2013 - 2020 Ky-Anh Huynh <kyanh@viettug.org>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

# For your hack ;)
#
# Forum, list of all threads (topics), LIFO
#   https://groups.google.com/forum/?_escaped_fragment_=forum/archlinuxvn
#
# Topic, list of all messages in a thread (topic), FIFO
#   https://groups.google.com/forum/?_escaped_fragment_=topic/archlinuxvn/wXRTQFqBtlA
#
# Raw, a MH mail message:
#   https://groups.google.com/forum/message/raw?msg=archlinuxvn/_atKwaIFVGw/rnwjMJsA4ZYJ
#
# Specification:
#
#   1. https://developers.google.com/search/docs/ajax-crawling/docs/specification
#   2. (Deprecation notice) https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html
#
# Atom link
#
#   https://groups.google.com/forum/feed/archlinuxvn/msgs/atom.xml?num=100
#   https://groups.google.com/forum/feed/archlinuxvn/topics/atom.xml?num=50
#
#   Don't use a very big `num`. Google knows that and changes to 16.
#   The bad thing is that Google doesn't provide a link to a post.
#   It only provides link to a topic. Hence for two links above you
#   would get the same result: links to your topics...
#
# Rss link
#
#   https://groups.google.com/forum/feed/archlinuxvn/msgs/rss.xml?num=50
#   https://groups.google.com/forum/feed/archlinuxvn/topics/rss.xml?num=50
#
#   Rss link contains link to topic. That's great.
#

_short_url() {
  printf '%s\n' "${*//https:\/\/groups.google.com${_ORG:+\/a\/$_ORG}\/forum\/\?_escaped_fragment_=/}"
}

_links_dump() {
  # shellcheck disable=2086
  curl \
    --user-agent "$_USER_AGENT" \
    $_CURL_OPTIONS \
    -Lso- "$@" \
  | sed -e "s#['\"]#\\"$'\n#g' \
  | grep -E '^https?://' \
  | sort -u
}

# $1: output file [/path/to/directory/prefix]
# $2: url
_download_page() {
  local _f_output
  local _url="$2"
  local _surl=
  local __

  _surl="$(_short_url "$_url")"
  __=0
  while :; do
    _f_output="$1.${__}"
    if [[ -f "$_f_output" ]]; then
      if [[ -n "${_FORCE:-}" ]]; then
        echo >&2 ":: Updating '$_f_output' with '${_surl}'"
      else
        echo >&2 ":: Skipping '$_f_output' (downloaded with '${_surl}')"
        if ! _url="$(grep -E -- "_escaped_fragment_=((forum)|(topic)|(categories))/$_GROUP" "$_f_output")"; then
          break
        fi
        (( __ ++ ))
        continue
      fi
    else
      echo >&2 ":: Creating '$_f_output' with '${_surl}'"
    fi

    {
      echo >&2 ":: Fetching data from '$_url'..."
      _links_dump "$_url"
    } \
    | grep "https://" \
    | grep "/$_GROUP" \
    | awk '{print $NF}' \
    > "$_f_output"

    # Loop detection. See also
    #   https://github.com/icy/google-group-crawler/issues/24
    # FIXME: 2020/04: This isn't necessary after Google has changed something
    if [[ $__ -ge 1 ]]; then
      if diff "$_f_output" "$1.$(( __ - 1 ))" >/dev/null 2>&1; then
        echo >&2 ":: =================================================="
        echo >&2 ":: Loop detected. Your cookie may not work correctly."
        echo >&2 ":: You may want to generate new cookie file"
        echo >&2 ":: and/or remove all '#HttpOnly_' strings from it."
        echo >&2 ":: =================================================="
        exit 125
      fi
    fi

    if ! _url="$(grep -E -- "_escaped_fragment_=((forum)|(topic)|(categories))/$_GROUP" "$_f_output")"; then
      break
    fi

    (( __ ++ ))
  done
}

# Main routine
_main() {
  mkdir -pv "$_D_OUTPUT"/{threads,msgs,mbox}/ 1>&2 || exit 1

  echo >&2 ":: Downloading all topics (thread) pages..."
  # Each page contains a bunch of
  # topics sorted by time (the latest updated topic comes first.)
  #
  #  t.0 the first page   (the latest update)
  #  t.1 the second page
  #  (and so on)
  #
  _download_page "$_D_OUTPUT/threads/t" \
    "https://groups.google.com${_ORG:+/a/$_ORG}/forum/?_escaped_fragment_=categories/$_GROUP"

  echo >&2 ":: Downloading list of all messages..."
  #
  # Each thread (topic) file (`t.<number>`) contains a list of messages
  # sorted by time (the latest updated message comes first.)
  #
  #   t.0
  #     msg/m.{topic_id}.0  (the latest update)
  #     msg/m.{topic_id}.1
  #     (and so on)
  #
  #   t.1
  #     msg/m.{topic_id}.0  (the latest update [in this topic])
  #     msg/m.{topic_id}.1
  #     (and so on)
  #
  find "$_D_OUTPUT"/threads/ -type f -iname "t.[0-9]*" -exec cat {} \; \
  | grep '^https://' \
  | grep "/d/topic/$_GROUP" \
  | sort -u \
  | sed -e 's#/d/topic/#/forum/?_escaped_fragment_=topic/#g' \
  | while read -r _url; do
      _topic_id="${_url##*/}"
      _download_page "$_D_OUTPUT/msgs/m.${_topic_id}" "$_url"
      #                                 <--+------->
    done #                                 |
  #                                       /
  # FIXME: Sorting issue here -----------'

  echo >&2 ":: Gnerating command to download raw messages..."
  find "$_D_OUTPUT"/msgs/ -type f -iname "m.*" -exec cat {} \; \
  | grep '^https://' \
  | grep '/d/msg/' \
  | sort -u \
  | sed -e 's#/d/msg/#/forum/message/raw?msg=#g' \
  | while read -r _url; do
      _id="$(echo "$_url"| sed -e "s#.*=$_GROUP/##g" -e 's#/#.#g')"
      echo "__curl__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
    done
}

_rss() {
  mkdir -pv "$_D_OUTPUT"/{threads,msgs,mbox}/ 1>&2 || exit 1

  {
    echo >&2 ":: Fetching RSS data..."
    # shellcheck disable=2086
    curl \
      --user-agent "$_USER_AGENT" \
      $_CURL_OPTIONS \
      -Lso- "https://groups.google.com${_ORG:+/a/$_ORG}/forum/feed/$_GROUP/msgs/rss.xml?num=${_RSS_NUM}"
  } \
  | grep '<link>' \
  | grep 'd/msg/' \
  | sort -u \
  | sed \
      -e 's#<link>##g' \
      -e 's#</link>##g' \
  | while read -r _url; do
      # shellcheck disable=SC2001
      _id_origin="$(sed -e "s#.*$_GROUP/##g" <<<"$_url")"
      _url="https://groups.google.com${_ORG:+/a/$_ORG}/forum/message/raw?msg=$_GROUP/$_id_origin"
      _id="${_id_origin//\//.}"
      echo "__curl__ \"$_D_OUTPUT/mbox/m.${_id}\" \"$_url\""
    done
}

# $1: Output File
# $2: The URL
__curl__() {
  if [[ ! -f "$1" ]]; then
    >&2 echo ":: Downloading '$1'..."
    # shellcheck disable=2086
    curl -Ls \
      -A "$_USER_AGENT" \
      $_CURL_OPTIONS \
      "$2" -o "$1"
    __curl_hook "$1" "$2"
  else
    >&2 echo ":: Skipping '$1'..."
  fi
}

# $1: Output File
# $2: The URL
__curl_hook() {
  :
}

__sourcing_hook() {
  # shellcheck disable=1090
  source "$1" \
  || {
    echo >&2 ":: Error occurred when loading hook file '$1'"
    exit 1
  }
}

_ship_hook() {
  echo "#!/usr/bin/env bash"
  echo ""
  echo "export _ORG=\"\${_ORG:-$_ORG}\""
  echo "export _GROUP=\"\${_GROUP:-$_GROUP}\""
  echo "export _D_OUTPUT=\"\${_D_OUTPUT:-$_D_OUTPUT}\""
  echo "export _USER_AGENT=\"\${_USER_AGENT:-$_USER_AGENT}\""
  echo "export _CURL_OPTIONS=\"\${_CURL_OPTIONS:-$_CURL_OPTIONS}\""
  echo ""
  declare -f __curl_hook

  if [[ -f "${_HOOK_FILE:-}" ]]; then
    declare -f __sourcing_hook
    echo "__sourcing_hook $_HOOK_FILE"
  elif [[ -n "${_HOOK_FILE:-}" ]]; then
    echo >&2 ":: ${FUNCNAME[0]}: _HOOK_FILE ($_HOOK_FILE) does not exist."
    exit 1
  fi

  declare -f __curl__
}

_help() {
  echo "Please visit https://github.com/icy/google-group-crawler for details."
}

_has_command() {
  # well, this is exactly `for cmd in "$@"; do`
  for cmd do
    command -v "$cmd" >/dev/null 2>&1 || return 1
  done
}

_check() {
  local _requirements=
  _requirements="curl sort awk sed diff"
  # shellcheck disable=2086
  _has_command $_requirements \
  || {
    echo >&2 ":: Some program is missing. Please make sure you have $_requirements."
    return 1
  }

  if [[ -z "$_GROUP" ]]; then
    echo >&2 ":: Please use _GROUP environment variable to specify your google group"
    return 1
  fi
}

# An empty function. Can you tell me why is it?
__main__() { :; }

set -u

_ORG="${_ORG:-}"
_GROUP="${_GROUP:-}"
_D_OUTPUT="${_D_OUTPUT:-./${_ORG:+${_ORG}-}${_GROUP}/}"
# _GROUP="${_GROUP//+/%2B}"
_USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0}"
_CURL_OPTIONS="${_CURL_OPTIONS:-}"
_RSS_NUM="${_RSS_NUM:-50}"

export _ORG _GROUP _D_OUTPUT _USER_AGENT _CURL_OPTIONS _RSS_NUM

_check || exit

case ${1:-} in
"-h"|"--help")    _help;;
"-sh"|"--bash")   _ship_hook; _main;;
"-rss")           _ship_hook; _rss;;
*)                echo >&2 ":: Use '-h' or '--help' for more details";;
esac


================================================
FILE: tests/tests.sh
================================================
#!/usr/bin/env bash

_test_public_1() {
  export _GROUP="${_GROUP:-google-group-crawler-public}"
  export _D_OUTPUT="${_D_OUTPUT:-./${_ORG:+${_ORG}-}${_GROUP}/}"
  export _F_OUTPUT="${_F_OUTPUT:-./${_ORG:+${_ORG}-}${_GROUP}.sh}"
  export _GREP_MESSAGE="${_GREP_MESSAGE:-CICD passed}"

  echo >&2 ""
  echo >&2 ":: --> Testing Public Group $_GROUP (ORG: ${_ORG:-<empty>}) <--"
  echo >&2 ":: --> _CURL_OPTIONS: ${_CURL_OPTIONS:-<empty>}"
  echo >&2 ""
  echo >&2 ":: Removing $PWD/$_D_OUTPUT"
  rm -rf "$PWD/$_D_OUTPUT/"
  echo >&2 ":: Generating $_F_OUTPUT..."
  crawler.sh -sh > "$_F_OUTPUT" || return 1
  bash -n "$_F_OUTPUT" || return 1
  echo >&2 ":: Executing $_F_OUTPUT..."
  bash -x "$_F_OUTPUT" || return 1
  crawler.sh -rss || return 1

  grep -Ri "Message-Id:" "$_D_OUTPUT/mbox/" \
  || {
    echo >&2 ":: Unable to find any mail messages from $_D_OUTPUT/mbox/"
    return 1
  }

  grep -Ri "$_GREP_MESSAGE" "$_D_OUTPUT/mbox/" \
  || {
    echo >&2 ":: Unable to find string 'CICD passed' from $_D_OUTPUT/mbox/"
    return 1
  }
}

_test_reset() {
  unset _ORG
  unset _D_OUTPUT
  unset _F_OUTPUT
  unset _GREP_MESSAGE
  unset _CURL_OPTIONS
}

_test_public_1_with_cat() {
  (
    _test_reset
    export _GROUP="google-group-crawler-public2"
    _test_public_1
  )
}

_test_public_2_loop_detection() {
  (
    _test_reset
    export _ORG="viettug.org"
    export _GROUP="google-group-crawler-public2"
    _test_public_1
    [[ $? == 125 ]] \
    || {
      echo >&2 ":: Unable to detect a loop."
      return 1
    }
    echo >&2 ":: Loop detected when no cookie is provided. Test passed."
  )
}

_test_public_2_with_cookie() {
  (
    _test_reset
    export _ORG="viettug.org"
    export _GROUP="google-group-crawler-public2"
    export _CURL_OPTIONS="--config curl-options.txt"
    export _GREP_MESSAGE="This is a public group from a private organization"
    _test_public_1
  )
}

_test_private_1() {
  (
    _test_reset
    export _GROUP="google-group-crawler-private"
    export _CURL_OPTIONS="--config curl-options.txt"
    _test_public_1
  )
}

_main() { :; }

set -u

cd "$(dirname "${BASH_SOURCE[0]:-.}")/../tests/" || exit 1
export PATH="$PATH:$(pwd -P)/../"

_test_public_1 || exit 1
_test_public_1_with_cat || exit 1
#_test_public_2_loop_detection || exit 1
_test_public_2_with_cookie || exit 2
_test_private_1 || exit 3

Download .txt

gitextract_vkmkafky/

├── .travis.yml
├── CHANGELOG.md
├── README.md
├── contrib/
│   └── README.md
├── crawler.sh
└── tests/
    ├── curl-options.txt.enc
    └── tests.sh

Download .json

Condensed preview — 7 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (25K chars).

[
  {
    "path": ".travis.yml",
    "chars": 263,
    "preview": "sudo: required\nlanguage:\n- bash\nscript:\n- sudo apt-get install shellcheck\n- shellcheck *.sh\n- ( cd tests/ && openssl aes"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 1132,
    "preview": "## v2.0.0\n\n* Using `curl` instead of `wget`\n* Fix #36 (unable to read cookie file)\n* Fix #34 (`413 Request Entity Too La"
  },
  {
    "path": "README.md",
    "chars": 10115,
    "preview": "WARNING: This project doesn't work and it's deprecated. \n**Reason:** Ajax support is completely deprecated by Google\n  S"
  },
  {
    "path": "contrib/README.md",
    "chars": 706,
    "preview": "\n## Fix dot in email addresses\n\nBy default, emails exported by the tool are not original because\nGoogle's anti-spam mech"
  },
  {
    "path": "crawler.sh",
    "chars": 9504,
    "preview": "#!/usr/bin/env bash\n#\n# Purpose: Make a backup of Google Group [Google Group Crawler]\n# Author : Anh K. Huynh\n# Date   :"
  },
  {
    "path": "tests/tests.sh",
    "chars": 2342,
    "preview": "#!/usr/bin/env bash\n\n_test_public_1() {\n  export _GROUP=\"${_GROUP:-google-group-crawler-public}\"\n  export _D_OUTPUT=\"${_"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the icy/google-group-crawler GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 7 files (23.5 KB), approximately 7.0k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo