[
  {
    "path": ".gitignore",
    "content": "HELP.md\ntarget/\n!.mvn/wrapper/maven-wrapper.jar\n!**/src/main/**/target/\n!**/src/test/**/target/\n\n### STS ###\n.apt_generated\n.classpath\n.factorypath\n.project\n.settings\n.springBeans\n.sts4-cache\n\n### IntelliJ IDEA ###\n.idea\n*.iws\n*.iml\n*.ipr\n\n### NetBeans ###\n/nbproject/private/\n/nbbuild/\n/dist/\n/log/\n/nbdist/\n/.nb-gradle/\nbuild/\n!**/src/main/**/build/\n!**/src/test/**/build/\n\n### VS Code ###\n.vscode/\n\n\n# Logs\nlogs\n*.log\nnpm-debug.log*\nyarn-debug.log*\nyarn-error.log*\n\n# Runtime data\npids\n*.pid\n*.seed\n*.pid.lock\n\n# Directory for instrumented libs generated by jscoverage/JSCover\nlib-cov\n\n# Coverage directory used by tools like istanbul\ncoverage\n\n# nyc test coverage\n.nyc_output\n\n# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)\n.grunt\n\n# Bower dependency directory (https://bower.io/)\nbower_components\n\n# node-waf configuration\n.lock-wscript\n\n# Compiled binary addons (https://nodejs.org/api/addons.html)\nbuild/Release\n\n# Dependency directories\nnode_modules/\njspm_packages/\n\n# TypeScript v1 declaration files\ntypings/\n\n# Optional npm cache directory\n.npm\n\n# Optional eslint cache\n.eslintcache\n\n# Optional REPL history\n.node_repl_history\n\n# Output of 'npm pack'\n*.tgz\n\n# Yarn Integrity file\n.yarn-integrity\n\n# dotenv environment variables file\n.env\n.env.test\n\n# parcel-bundler cache (https://parceljs.org/)\n.cache\n\n# next.js build output\n.next\n\n# nuxt.js build output\n.nuxt\n\n# vuepress build output\n.vuepress/dist\n\n# Serverless directories\n.serverless/\n\n# FuseBox cache\n.fusebox/\n\n# DynamoDB Local files\n.dynamodb/\n\ntarget\nout/\n.myeclipse\n\n\n.DS_Store\nnode_modules\n\n\n\n# local env files\n.env.local\n.env.*.local\n\n# Log files\npnpm-debug.log*\n\n# Editor directories and files\n.vscode\n*.suo\n*.ntvs*\n*.njsproj\n*.sln\n*.sw?\npackage-lock.json\nwork/tomcat*\n/src/dist/\n/src/hf-mirror-cli.spec\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2024 冰点\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "### hf-mirror-cli 介绍\n由于 huggingface的仓库地址位于海外，每次下载dataset和model 太慢了，于是抽空开发了一个可以在windows开发环境，快速拉取huggingface 平台上的数据工具\n\n使用hugingface-cli 国内镜像，可以快速下载hugingface上的模型\n\n兼容`hugingface-cli`的用法\n### 功能说明\n- 支持环境检测包含网络，磁盘，镜像地址是否可用\n- 支持网络容错，在断网异常情况下，默认进行重试3次\n- 支持并发执行下载，默认最大并发为10\n- 支持断点续传\n- 在国内镜像无法使用的情况下支持从官网拉取\n- 打包可执行文件后，已经内置了运行环境不需要配置python环境和安装依赖\n  \n \n\n## 1. 使用教程\n#### 1. 第一种使用方法 \na. 安装 pip install hf-cli\n```shell\n pip install hf-cli\n\n```\nb. 直接使用\n```shell\n hf-cli Intel/dynamic_tinybert\n```\n或者 \n```shell\nhf-cli --model-id Intel/dynamic_tinybert\n```\n\nc. 遇到需要授权才能访问的model \n```shell\n hf-cli google/gemma-2b-it --token hf的token  --username 用户名\n```\n\n d. 使用效果\n   ![image](https://github.com/wangshuai67/hf-mirror-cli/assets/13214849/1dd10ad6-5f5e-467a-9d6b-e8eabbdc53f3)\n\n\n## 2. 默认使用的国内镜像地址 \n  默认的不用配置，如果需要自定义 配置环境变量HF_ENDPOINT=\"镜像地址\"\n  \n  默认为 https://hf-mirror.com/   \n  \n  站长[@padeoe](https://github.com/padeoe)\n\n## 3. 常见问题\n- 如果报错\n```shell\n严重告警：状态码401,模型model_id：google/gemma-2b-it未授权访问或模型ID不存在，请使用参数--token和--username\n```\n> 上面的报错 要么 模型Id输入错误，要么需要提供用户名和toke\n需要登录授权才能下载使用`hf-mirror-cli 模型ID  Access_Token`，在官网这里获取[Access Token](https://huggingface.co/settings/tokens)\n```shell\n> hf-mirror-cli google/gemma-2b-it --token HF的token --username 用户名\n```\n\n或\n\n```shell\npython .\\hf-mirror-cli.py google/gemma-2b-it --token HF的token --username 用户名\n```\n \n\n## 4. 下载效果\n   ![image](https://github.com/wangshuai67/hf-mirror-cli/assets/13214849/2fb4e410-0e34-4226-8f7d-52275895f10c)\n\n\n\n### 交流群\n![微信交流群](https://padeoe.com/wp-content/uploads/2023/11/%E5%9B%BE%E7%89%87_20231107095902.jpg)\n"
  },
  {
    "path": "requirements.txt",
    "content": "requests\ngitpython\ntqdm\ntransformers\nurllib3\n"
  },
  {
    "path": "src/hf-mirror-cli.py",
    "content": "\"\"\"\n@author 冰点\n@date 2024-3-3 17:15:08\n@desc 用于在window环境中快速下载模型\n\"\"\"\nimport errno\nimport os\nimport subprocess\nimport sys\nimport requests\nfrom git import Repo\nfrom tqdm import tqdm\nfrom transformers import file_utils\nfrom pathlib import Path\nimport concurrent.futures\nfrom urllib3.util.retry import Retry\nfrom requests.adapters import HTTPAdapter\nimport threading\nimport argparse\n\n#Some configuration\nMAX_CACHE_SIZE = 2**27 #128MB, the max size that we use to judge the remote size when no content-length \n\n# 设置环境变量\nHF_OFFICIAL_URL = 'https://huggingface.co'\nHF_MIRROR_URL = 'https://hf-mirror.com'\nos.environ[\"GIT_LFS_SKIP_SMUDGE\"] = \"1\"\nos.environ[\"HF_ENDPOINT\"] = HF_MIRROR_URL\n\n\"\"\"\n检查环境中是否安装了 git和git-lfs\n\"\"\"\n\n\ndef check_git_installation():\n    def is_tool_installed(name):\n        try:\n            devnull = open(os.devnull)\n            subprocess.Popen([name], stdout=devnull, stderr=devnull).communicate()\n        except OSError as e:\n            if e.errno == errno.ENOENT:\n                return False\n        return True\n\n    # 检查是否安装了Git\n    if not is_tool_installed(\"git\"):\n        print(\"警告：当前操作系统未安装git \"\n              \"可使用以下命令安装\"\n              \"'sudo apt install git' (for Ubuntu) \"\n              \"'brew install git' (for MacOS) \")\n\n    # 检查是否安装了Git LFS\n    if not is_tool_installed(\"git-lfs\"):\n        print(\"警告：当前操作系统未安装 Git LFS \"\n              \"可使用以下命令安装\"\n              \"-------'sudo apt install git-lfs' (for Ubuntu) \"\n              \"-------'brew install git-lfs' (for MacOS) \")\n        sys.exit(1)\n\n\n\"\"\"\n检查requests, git工具是否可用\n\"\"\"\n\n\ndef check_tool_availability():\n    try:\n        import requests, git\n    except ImportError as e:\n        print(f\"Required Python package is missing: {e.name}. Please install it first.\")\n        exit(1)\n\n\n\"\"\"\n检查镜像网站是否可用,如果不可用使用官方地址\n\"\"\"\n\n\ndef check_hfmirror_unavailable_url():\n    error_msg = f\"警告： HF-mirror镜像网站异常=【{HF_MIRROR_URL}】，切换为huggingface官网地址[{HF_OFFICIAL_URL}]\"\n    try:\n        response = requests.get(HF_MIRROR_URL)\n        if response.status_code != 200:\n            print(error_msg)\n            os.environ[\"HF_ENDPOINT\"] = HF_OFFICIAL_URL\n            print(f\"--->检查是官网huggingface.co否可用\")\n            check_huggingface_unavailable_url()\n    except requests.exceptions.RequestException:\n        print(error_msg)\n        print(f\"--->检查是官网huggingface.co否可用\")\n        check_huggingface_unavailable_url()\n\n\n\"\"\"\n检查huggingface官网是否可用 结束\n\"\"\"\n\n\ndef check_huggingface_unavailable_url():\n    error_msg = f\"警告：huggingface官网地址访问异常[{HF_OFFICIAL_URL}]，请检查网络或者代理是否正常\"\n    try:\n        response = requests.get(HF_OFFICIAL_URL)\n        if response.status_code != 200:\n            print(error_msg)\n            sys.exit(1)\n        else:\n            os.environ[\"HF_ENDPOINT\"] = HF_OFFICIAL_URL\n    except requests.exceptions.RequestException:\n        print(error_msg)\n        sys.exit(1)\n\n\n\"\"\"\n获取服务端的文件大小\n\"\"\"\n\n\ndef get_remote_file_size(url):\n    session=get_requests_retry_session()\n    try:\n        response = session.head(url, allow_redirects=False)\n        if response.status_code == 401:\n            print(f\"\\033[91m严重告警：状态码401,模型model_id：{model_id}未授权访问或模型ID不存在，请使用参数--token和--username\\033[0m\")\n            sys.exit(1)\n        if response.status_code == 302 or response.status_code == 301:\n            redirect_url = response.headers['Location']\n            redirect_response = session.head(redirect_url)\n            return int(redirect_response.headers['Content-Length'])\n        else:\n            return int(response.headers['Content-Length'])\n    except KeyError:\n        print(\"No content-length key. We need to use the session to calculate the size of the content,\" \\\n              \"but we only allow content that is less than 128 MB.\")\n        session = get_requests_retry_session()\n        response = session.get(url, stream=True, timeout=60)\n        size = 0\n        for chunk in response.iter_content(8192):\n            if chunk:\n                if size <= MAX_CACHE_SIZE:\n                    size += len(chunk)\n                else:\n                    return size\n        return size\n    except Exception as e:\n        \n        return -1\n\n\n\"\"\"\n检测磁盘大小\n\"\"\"\n\n\ndef check_disk_space(file_size, filename, url):\n    dir_path = os.getcwd()\n    one_gb = 1 * 1024 * 1024 * 1024\n    if os.name == 'posix':\n        stat = os.statvfs(dir_path)\n        free_space = stat.f_bavail * stat.f_frsize\n        free_space_mb = free_space / (1024 * 1024)\n        if free_space > 0 and free_space - file_size < one_gb:\n            print(f\"警告: 磁盘空间不足1GB，无法安全下载文件。fileName:{filename},url:{url},free_space:{free_space_mb}MB\")\n            sys.exit(1)\n        else:\n            print(f\"--->磁盘空间正常下载文件。剩余：{free_space_mb}MB\")\n\n    elif os.name == 'nt':\n        # windows操作系统，默认为开发环境，不做磁盘空闲容量检查\n        return\n    else:\n        print(\"\\n 未检测到操作系统类型，不做磁盘空闲容量检查\")\n        return\n\n\n\"\"\"\n获取一个可支持重试的请求工具,重试3次\n\"\"\"\n\n\ndef get_requests_retry_session(\n        retries=3,\n        backoff_factor=0.3,\n        status_forcelist=(500, 502, 504, 404),\n        session=None,\n):\n    session = session or requests.Session()\n    retry = Retry(\n        total=retries,\n        read=retries,\n        connect=retries,\n        backoff_factor=backoff_factor,\n        status_forcelist=status_forcelist,\n    )\n    if HF_TOKEN:\n        print(f\"downloading with username:{HF_USERNAME},token:{HF_TOKEN}\")\n        headers = {'Authorization': f'Bearer {HF_TOKEN}'}\n        session.headers.update(headers)\n    adapter = HTTPAdapter(max_retries=retry)\n    session.mount('http://', adapter)\n    session.mount('https://', adapter)\n    return session\n\n\n\"\"\"\n断点续传\n\"\"\"\n\n\ndef download_file_with_range(url, filename, start_byte, remote_file_size=None):\n    if remote_file_size is not None:\n        check_disk_space(remote_file_size, filename, url)\n    thread_name = threading.current_thread().name.replace(\"ThreadPoolExecutor-\",\"\")\n    print(f\"\\n线程-{thread_name}-下载-{url}\")\n    if remote_file_size is not None:\n        print(f\"\\n支持端点续传 {filename}，本地文件大小：{start_byte}，服务端文件大小：{remote_file_size}\")\n    headers = {'Range': f'bytes={start_byte}-'}\n    # 超时为1分钟，网络不稳定情况下也可以支持\n    session = get_requests_retry_session()\n    response = session.get(url, headers=headers, stream=True, timeout=60)\n    print(\"get response {}\".format(response.status_code))\n    progress_bar_file_name = os.path.basename(filename)\n    with open(filename, 'ab') as f:\n        total_size = int(response.headers.get('content-length', 0))\n        progress_bar = tqdm(total=total_size, unit='B', unit_scale=True, ncols=120, ascii=True,  desc=f\"<--- downloading {progress_bar_file_name}\")\n\n\n        for chunk in response.iter_content(chunk_size=8192):\n            if chunk:\n                f.write(chunk)\n                progress_bar.update(len(chunk))\n\n    progress_bar.close()\n    print(f\"完成下载 {filename}\")\n\n\n\"\"\"\n获取不到content-length 简单下载\n\"\"\"\n\n\ndef download_file_simple(url, filename):\n    thread_name = threading.current_thread().name\n    print(f\"线程-{thread_name} download_file_simple 开始下载-{url} \")\n    session = get_requests_retry_session()\n    response = session.get(url, stream=True, timeout=60)\n    check_disk_space(0, filename, url)\n    progress_bar_file_name = os.path.basename(filename)\n    with open(filename, 'wb') as f:\n        total_size = int(response.headers.get('content-length', 0))\n\n        if total_size != 0:\n            progress_bar = tqdm(total=total_size, unit='B', unit_scale=True, ncols=120, ascii=True, desc=f\"<--- downloading {progress_bar_file_name}\")\n        else:\n            progress_bar = tqdm(unit='B', unit_scale=True, ncols=120, ascii=True, desc=f\"<--- downloading {progress_bar_file_name}\")\n        for chunk in response.iter_content(chunk_size=8192):\n            if chunk:\n                f.write(chunk)\n                progress_bar.update(len(chunk))\n\n    progress_bar.close()\n    print(f\"完成下载 {filename}\")\n\n\n\"\"\"\n获取hfd下载的模型存放路径\n\"\"\"\n\n\ndef get_hfd_file_path():\n    default_cache_path = file_utils.default_cache_path\n    cache_path = Path(default_cache_path) / 'hfd'\n    if not cache_path.exists():\n        cache_path.mkdir(parents=True)\n    print(f\"--->当前huggingface模型的下载地址为{cache_path}\")\n    return cache_path\n\n\n\"\"\"\n判断是否需要并发下载\n\"\"\"\n\n\ndef should_use_concurrency(files):\n    return len(files) > 1\n\n\n\"\"\"\n并行执行下载任务\n\"\"\"\n# 提前定义包含5个线程的线程池\nexecutor = concurrent.futures.ThreadPoolExecutor(max_workers=10)\n\n\"\"\"\n 使用线程池异步执行\n\"\"\"\n\n\ndef execute_task(task, *args, **kwargs):\n    executor.submit(task, *args, **kwargs)\n\n\n\"\"\"\n下载模型\n\"\"\"\n\ndef download_model(model_id:str):\n    hf_endpoint = os.environ.get('HF_ENDPOINT', 'https://huggingface.co')\n    model_dir = model_id.split('/')[-1]\n    repo_url = f\"{hf_endpoint}/{model_id}\"\n    if not os.path.isdir(f\"{model_dir}/.git\"):  # Check if the repo has already been cloned\n        print(f\"--->开始 clone repo from {repo_url}\")\n        #Avoid the space in HF_TOKEN and HF_USERNAME \n        session = get_requests_retry_session()\n        response = session.get(f\"{repo_url}/info/refs?service=git-upload-pack\")\n        if response.status_code == 401 or response.status_code == 403:\n            if HF_TOKEN is None or HF_USERNAME is None:\n                print(f\"HTTP Status Code: {response.status_code}.\\nThe repository requires authentication, but --token and --username is not passed. Please get token from https://huggingface.co/settings/tokens.\\nExiting.\")\n                return\n            hf_domain = hf_endpoint.split(\"//\")[1]\n            repo_url=f\"https://{HF_USERNAME}:{HF_TOKEN}@{hf_domain}/{model_id}\"\n            print(f\"--->开始 clone repo from {repo_url}\")\n        elif response.status_code != 200:\n            print(f\"Unexpected HTTP status code: {response.status_code}. Exiting.\")\n            return\n        Repo.clone_from(repo_url, model_dir)\n        print(f\"--->完成 clone repo from {repo_url}\")\n    else:\n        print(f\"--->已经存在 repo_url={repo_url},检测断点续传\")\n        repo = Repo(model_dir)\n        origin = repo.remote(name='origin')\n        origin.pull()\n    os.chdir(model_dir)\n    print(f\"model_dir : {model_dir}\")\n    download_dir = os.getcwd()\n    if not os.path.exists(download_dir):\n        os.makedirs(download_dir)\n    print(f\"模型下载目录：{download_dir}\")\n    repo = Repo('.')\n    print(\"--->启动并行下载大文件......\")\n    lfs_files_cmd_result = repo.git.lfs('ls-files')\n    lines = lfs_files_cmd_result.split('\\n')\n    file_names = [line.split()[-1] for line in lines if line]\n    print(f\"--->大文件 文件数量{len(file_names)},file_names : {file_names}\")\n    download_url = f\"{hf_endpoint}/{model_id}\"\n    for index, filename in enumerate(file_names):\n        url = f\"{download_url}/resolve/main/{filename}\"\n        print(f\"------>开始下载第{index + 1}个文件: {filename}，url: {url}\")\n        if filename == \"\":\n            print(f\"LFS file name is empty skip\")\n            continue\n        download_path = os.path.join(download_dir, filename)\n        if os.path.exists(download_path):\n            local_file_size = os.path.getsize(download_path)\n            remote_file_size = get_remote_file_size(url)\n            if local_file_size < remote_file_size:\n                print(f\"\\nFile {filename} local_file_size={local_file_size}，remote_file_size={remote_file_size}\")\n                print(f\"\\nFile {filename} exists but is incomplete. Continuing download...\")\n                execute_task(download_file_with_range, url, download_path, local_file_size, remote_file_size)\n            elif remote_file_size == -1:\n                execute_task(download_file_simple, url, download_path)\n                continue\n            elif remote_file_size < local_file_size:\n                print(f\"\\nFile {filename}'s local_file_size is greater than the remote size\")\n                if local_file_size > MAX_CACHE_SIZE:\n                    print(f\"The {filename}'s local_file_size is greater than the max size we setting that is {MAX_CACHE_SIZE}, we will use the resume from the break point try to download it\")\n                    execute_task(download_file_with_range, url, download_path, local_file_size, remote_file_size=None)\n                else:\n                    print(f\"Unknown error. Please check the remote size for the file {filename} by the web. The local size is:{local_file_size}\")\n            elif local_file_size == remote_file_size:\n                print(f\"File {filename} exists and matches the size from the remote.\")\n            else:\n                print(f\"Download {filename} failed, unknown error\")\n\n\nprint(\"--->start-开始检查环境和网络\")\nprint(\"--->检查当前环境是否安装了git和git-lfs\")\ncheck_git_installation()\ncheck_tool_availability()\nparser = argparse.ArgumentParser()\nparser.add_argument(\"--token\", type=str, default=None)\nparser.add_argument(\"--username\", type=str, default=None)\nparser.add_argument(\"--model-id\", type=str, default=None, help=\"the id of the model, example: Intel/dynamic_tinybert\")\nparser.add_argument(\"modelId\", type=str, nargs='?', default=None)\nargs = parser.parse_args()\n\ntoken = args.token\nusername = args.username\n\n# If --model-id is not provided, use the positional argument modelId\nif args.model_id is None:\n    model_id = args.modelId\nelse:\n    model_id = args.model_id\n\nif model_id is None:\n    print(\"正确用法: hf-mirror-cli.exe --model-id <modelId> 或 hf-mirror-cli.exe <modelId> \\n示例: hf-mirror-cli.exe Intel/dynamic_tinybert\")\n    sys.exit(1)\n\n\n# 本地测试\n# model_id = \"google/gemma-2b-it\"\n# # hf-mirror-cli bigscience/bloom-560m\n# token = \"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\"\n# username = \"ssss\"\nmodel_id=model_id.strip() #这里建议去除两端的空格\nHF_TOKEN = os.environ.get('HF_TOKEN', token)\nHF_USERNAME = os.environ.get(\"HF_USERNAME\", username)\nif HF_TOKEN:\n    HF_TOKEN = HF_TOKEN.strip()\nif HF_USERNAME:\n    HF_USERNAME = HF_USERNAME.strip()\nbase_path = os.path.abspath(os.path.dirname(__file__))\nmodel_dir = os.path.join(base_path, model_id.split('/')[-1])\nmodel_cache_local_path = get_hfd_file_path()\nos.chdir(model_cache_local_path)\nprint(\"----->end-环境检查完毕正常\")\nprint(\"--->开始拉起下载模型数据并发任务\")\ndownload_model(model_id)\nprint(f\"model:{model_id} 下载完成后存放路径[{model_cache_local_path}]\")\n"
  },
  {
    "path": "依赖版本检查.txt",
    "content": "# 如果通过脚本本地运行异常，请核对一下依赖版本\n# 如果是windows 可执行文件不需要，已经内置好了\n# pyinstaller --onefile  hf-mirror-cli.py\n\ncertifi-2024.2.2\ncharset-normalizer-3.3.2\n colorama-0.4.6\n filelock-3.13.1\n fsspec-2024.2.0\n gitdb-4.0.11\n gitpython-3.1.42\n huggingface-hub-0.21.3\n idna-3.6\n numpy-1.26.4\n packaging-23.2\n pyyaml-6.0.1\n regex-2023.12.25\n requests-2.31.0\n safetensors-0.4.2\n smmap-5.0.1\n tokenizers-0.15.2\n tqdm-4.66.2\n transformers-4.38.2\n typing-extensions-4.10.0\n urllib3-2.2.1\n"
  }
]