Showing preview only (766K chars total). Download the full file or copy to clipboard to get everything.
Repository: newren/git-filter-repo
Branch: main
Commit: c1d8461ee34c
Files: 73
Total size: 736.0 KB
Directory structure:
gitextract_9o8dsut3/
├── .gitattributes
├── .github/
│ ├── dependabot.yml
│ └── workflows/
│ └── test.yml
├── .gitignore
├── COPYING
├── COPYING.gpl
├── COPYING.mit
├── Documentation/
│ ├── Contributing.md
│ ├── FAQ.md
│ ├── converting-from-bfg-repo-cleaner.md
│ ├── converting-from-filter-branch.md
│ ├── examples-from-user-filed-issues.md
│ └── git-filter-repo.txt
├── INSTALL.md
├── Makefile
├── README.md
├── contrib/
│ └── filter-repo-demos/
│ ├── README.md
│ ├── barebones-example
│ ├── bfg-ish
│ ├── clean-ignore
│ ├── convert-svnexternals
│ ├── filter-lamely
│ ├── insert-beginning
│ ├── lint-history
│ └── signed-off-by
├── git-filter-repo
├── pyproject.toml
└── t/
├── run_coverage
├── run_tests
├── t9390/
│ ├── basic
│ ├── basic-filename
│ ├── basic-mailmap
│ ├── basic-message
│ ├── basic-numbers
│ ├── basic-replace
│ ├── basic-ten
│ ├── basic-twenty
│ ├── degenerate
│ ├── degenerate-evil-merge
│ ├── degenerate-globme
│ ├── degenerate-keepme
│ ├── degenerate-keepme-noff
│ ├── degenerate-moduleA
│ ├── empty
│ ├── empty-keepme
│ ├── less-empty-keepme
│ ├── more-empty-keepme
│ ├── sample-mailmap
│ ├── sample-message
│ ├── sample-replace
│ ├── unusual
│ ├── unusual-filtered
│ └── unusual-mailmap
├── t9390-filter-repo-basics.sh
├── t9391/
│ ├── commit_info.py
│ ├── create_fast_export_output.py
│ ├── emoji-repo
│ ├── erroneous.py
│ ├── file_filter.py
│ ├── print_progress.py
│ ├── rename-master-to-develop.py
│ ├── splice_repos.py
│ ├── strip-cvs-keywords.py
│ └── unusual.py
├── t9391-filter-repo-lib-usage.sh
├── t9392-filter-repo-python-callback.sh
├── t9393/
│ ├── lfs
│ └── simple
├── t9393-filter-repo-rerun.sh
├── t9394/
│ └── date-order
├── t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh
├── test-lib-functions.sh
└── test-lib.sh
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
*.sh eol=lf
*.py eol=lf
/git-filter-repo eol=lf
/contrib/filter-repo-demos/[a-z]* eol=lf
/t/t9*/* eol=lf
================================================
FILE: .github/dependabot.yml
================================================
---
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "monthly"
================================================
FILE: .github/workflows/test.yml
================================================
name: Run tests
on: [push, pull_request]
jobs:
run-tests:
strategy:
matrix:
os: [ 'windows', 'ubuntu', 'macos' ]
fail-fast: false
runs-on: ${{ matrix.os }}-latest
steps:
- uses: actions/checkout@v4
- name: Setup python
uses: actions/setup-python@v5
with:
python-version: 3.x
- name: test
shell: bash
run: |
# setup-python puts `python` into the `PATH`, not `python3`, yet
# `git-filter-repo` expects `python3` in the `PATH`. Let's add
# a shim.
printf '#!/bin/sh\n\nexec python "$@"\n' >python3 &&
export PATH=$PWD:$PATH &&
if ! t/run_tests -q -v -x
then
mkdir failed &&
tar czf failed/failed.tar.gz t
exit 1
fi
- name: upload failed tests' directories
if: failure()
uses: actions/upload-artifact@v4
with:
name: failed-${{ matrix.os }}
path: failed
================================================
FILE: .gitignore
================================================
/Documentation/html/
/Documentation/man1/
/t/test-results
/t/trash directory*
/__pycache__/
================================================
FILE: COPYING
================================================
git-filter-repo itself and most the files in this repository (exceptions
noted below) are provided under the MIT license (see COPYING.mit).
The usage of the MIT license probably makes filter-repo compatible with
everything, but just in case, these files can also be used under whatever
open source license[1] that git.git or libgit2 use now or in the future
(currently GPL[2] and GPL-with-linking-exception[3]). Further, the
examples (in contrib/filter-repo-demos/ and t/t9391/) can also be used
under the same license that libgit2 provides their examples under (CC0,
currently[4]).
Exceptions:
- The test harness (t/test-lib.sh, t/test-lib-functions.sh) is a slightly
modified copy of git.git's test harness (the difference being that my
copy doesn't require a built version of 'git' to be present). These
are thus GPL2 (see COPYING.gpl), and are individually marked as such.
[1] ...as defined by the Open Source Initiative (https://opensource.org/)
[2] https://git.kernel.org/pub/scm/git/git.git/tree/COPYING
[3] https://github.com/libgit2/libgit2/blob/master/COPYING
[4] https://github.com/libgit2/libgit2/blob/master/examples/COPYING
================================================
FILE: COPYING.gpl
================================================
GNU GENERAL PUBLIC LICENSE
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users. This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it. (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software. If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.
Finally, any free program is threatened constantly by software
patents. We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary. To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and
modification follow.
GNU GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.
You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.
2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.
It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.
12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) year name of author
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License.
================================================
FILE: COPYING.mit
================================================
Copyright (c) 2009, 2018-2019
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: Documentation/Contributing.md
================================================
Welcome to the community!
Contributions need to meet the bar for inclusion in git.git. Although
filter-repo is not part of the git.git repository, I want to leave the
option open for it to be merged in the future. As such, any
contributions need to follow the same [guidelines for contribution to
git.git](https://git.kernel.org/pub/scm/git/git.git/tree/Documentation/SubmittingPatches),
with a few exceptions:
* While I
[hate](https://public-inbox.org/git/CABPp-BG2SkH0GrRYpHLfp2Wey91ThwQoTgf9UmPa9f5Szn+v3Q@mail.gmail.com/)
[GitHub](https://public-inbox.org/git/CABPp-BEcpasV4vBTm0uxQ4Vzm88MQAX-ArDG4e9QU8tEoNsZWw@mail.gmail.com/)
[PRs](https://public-inbox.org/git/CABPp-BEHy8c3raHwf9aFXvXN0smf_WwCcNiYxQBwh7W6An60qQ@mail.gmail.com/)
(as others point out, [it's mind-boggling in a bad way that
web-based Git hosting and code review systems do such a poor
job](http://nhaehnle.blogspot.com/2020/06/they-want-to-be-small-they-want-to-be.html)),
git-format-patch and git-send-email can be a beast and I have not
yet found time to modify Dscho's excellent
[GitGitGadget](https://github.com/gitgitgadget/gitgitgadget) to
work with git-filter-repo. As such:
* For very short single-commit changes, feel free to open GitHub PRs.
* For more involved changes, if format-patch or send-email give you
too much trouble, go ahead and open a GitHub PR and just mention
that email didn't work out.
* If emailing patches to the git list:
* Include "filter-repo" at the start of the subject,
e.g. "[filter-repo PATCH] Add packaging scripts for uploading to PyPI"
instead of just "[PATCH] Add packaging scripts for uploading to PyPI"
* CC me instead of the git maintainer
* Git's [CodingGuidlines for python
code](https://github.com/git/git/blob/v2.24.0/Documentation/CodingGuidelines#L482-L494)
are only partially applicable:
* python3 is a hard requirement; python2 is/was EOL at the end of
2019 and should not be used. (Commit 4d0264ab723c
("filter-repo: workaround python<2.7.9 exec bug", 2019-04-30)
was the last version of filter-repo that worked with python2).
* You can depend on anything in python 3.6 or earlier. I may bump
this minimum version over time, but do want to generally work
with the python3 version found in current enterprise Linux
distributions.
* In filter-repo, it's not just OK to use bytestrings, you are
expected to use them a lot. Using unicode strings result in
lots of ugly errors since input comes from filesystem names,
commit messages, file contents, etc., none of which are
guaranteed to be unicode. (Plus unicode strings require lots of
effort to verify, encode, and decode -- slowing the filtering
process down). I tried to work with unicode strings more
broadly in the code base multiple times; but it's just a bad
idea to use an abstraction that doesn't fit the data.
* I generally like [PEP
8](https://www.python.org/dev/peps/pep-0008/), but used
two-space indents for years before learning of it and have just
continued that habit. For consistency, contributions should also
use two-space indents and otherwise generally follow PEP 8.
There are a few extra things I would like folks to keep in mind:
* Please test line coverage if you add or modify code
* `make test` will run the testsuite under
[coverage3](https://pypi.org/project/coverage/) (which you will
need to install), and report on line coverage. Line coverage of
git-filter-repo needs to remain at 100%; line coverage of
contrib and test scripts can be ignored.
* Please do not be intimidated by detailed feedback:
* In the git community, I have been contributing for years and
have had hundreds of patches accepted but I still find that even
when I try to make patches perfect I am not surprised when I
have to spend as much or more time fixing up patches after
submitting them than I did figuring out the patches in the first
place. git folks tend to do thorough reviews, which has taught
me a lot, and I try to do the same for filter-repo. Plus, as
noted above, I want contributions from others to be acceptable
in git.git itself.
================================================
FILE: Documentation/FAQ.md
================================================
# Frequently Answered Questions
## Table of Contents
* [Why did `git-filter-repo` rewrite commit hashes?](#why-did-git-filter-repo-rewrite-commit-hashes)
* [Why did `git-filter-repo` rewrite more commit hashes than I expected?](#why-did-git-filter-repo-rewrite-more-commit-hashes-than-i-expected)
* [Why did `git-filter-repo` rewrite other branches too?](#why-did-git-filter-repo-rewrite-other-branches-too)
* [How should paths be specified?](#How-should-paths-be-specified)
* [Help! Can I recover or undo the filtering?](#help-can-i-recover-or-undo-the-filtering)
* [Can you change `git-filter-repo` to allow future folks to recover from `--force`'d rewrites?](#can-you-change-git-filter-repo-to-allow-future-folks-to-recover-from---forced-rewrites)
* [Can I use `git-filter-repo` to fix a repository with corruption?](#Can-I-use-git-filter-repo-to-fix-a-repository-with-corruption)
* [What kinds of problems does `git-filter-repo` not try to solve?](#What-kinds-of-problems-does-git-filter-repo-not-try-to-solve)
* [Filtering history but magically keeping the same commit IDs](#Filtering-history-but-magically-keeping-the-same-commit-IDs)
* [Bidirectional development between a filtered and unfiltered repository](#Bidirectional-development-between-a-filtered-and-unfiltered-repository)
* [Removing specific commits, or filtering based on the difference (a.k.a. patch or change) between commits](#Removing-specific-commits-or-filtering-based-on-the-difference-aka-patch-or-change-between-commits)
* [Filtering two different clones of the same repository and getting the same new commit IDs](#Filtering-two-different-clones-of-the-same-repository-and-getting-the-same-new-commit-IDs)
## Why did `git-filter-repo` rewrite commit hashes?
This is fundamental to how Git operates. In more detail...
Each commit in Git is a hash of its contents. Those contents include
the commit message, the author (name, email, and time authored), the
committer (name, email and time committed), the toplevel tree hash,
and the parent(s) of the commit. This means that if any of the commit
fields change, including the tree hash or the hash of the parent(s) of
the commit, then the hash for the commit will change.
(The same is true for files ("blobs") and trees stored in git as well;
each is a hash of its contents, so literally if anything changes, the
commit hash will change.)
If you attempt to write a commit (or tree or blob) object with an
incorrect hash, Git will reject it as corrupt.
## Why did `git-filter-repo` rewrite more commit hashes than I expected?
There are two aspects to this, or two possible underlying questions users
might be asking here:
* Why did commits newer than the ones I expected have their hash change?
* Why did commits older than the ones I expected have their hash change?
For the first question, see [why filter-repo rewrites commit
hashes](#why-did-git-filter-repo-rewrite-commit-hashes), and note that
if you modify some old commit, perhaps to remove a file, then obviously
that commit's hash must change. Further, since that commit will have a
new hash, any other commit with that commit as a parent will need to
have a new hash. That will need to chain all the way to the most recent
commits in history. This is fundamental to Git and there is nothing you
can do to change this.
For the second question, there are two causes: (1) the filter you
specified applies to the older commits too, or (2) git-fast-export and
git-fast-import (both of which git-filter-repo uses) canonicalize
history in various ways. The second cause means that even if you have
no filter, these tools sometimes change commit hashes. This can happen
in any of these cases:
* If you have signed commits, the signatures will be stripped
* If you have commits with extended headers, the extended headers will
be stripped (signed commits are actually a special case of this)
* If you have commits in an encoding other than UTF-8, they will by
default be re-encoded into UTF-8
* If you have a commit without an author, one will be added that
matches the committer.
* If you have trees that are not canonical (e.g. incorrect sorting
order), they will be canonicalized
If this affects you and you really only want to rewrite newer commits in
history, you can use the `--refs` argument to git-filter-repo to specify
a range of history that you want rewritten.
(For those attempting to be clever and use `--refs` for the first
question: Note that if you attempt to only rewrite a few old commits,
then all you'll succeed in is adding new commits that won't be part of
any branch and will be subject to garbage collection. The branches will
still hold on to the unrewritten versions of the commits. Thus, you
have to rewrite all the way to the branch tip for the rewrite to be
meaningful. Said another way, the `--refs` trick is only useful for
restricting the rewrite to newer commits, never for restricting the
rewrite to older commits.)
## Why did `git-filter-repo` rewrite other branches too?
git-filter-repo's name is git-filter-**_repo_**. Obviously it is going
to rewrite all branches by default.
`git-filter-repo` can restrict its rewriting to a subset of history,
such as a single branch, using the `--refs` option. However, using that
comes with the risk that one branch now has a different version of some
commits than other branches do; usually, when you rewrite history, you
want all branches that depend on what you are rewriting to be updated.
## How should paths be specified?
Arguments to `--path` should be paths as Git would report them, when run
from the toplevel of the git repository (explained more below after some
examples).
**Good** path examples:
* `README.md`
* `Documentation/README.md`
* `src/modules/flux/capacitor.rs`
You can find examples of valid path names from your repository by
running either `git diff --no-relative --name-only` or `git log
--no-relative --name-only --format=""`.
The following are basic rules about paths the way that Git reports and uses
them:
* do not use absolute paths
* always treats paths as relative to the toplevel of the repository
(do not add a leading slash, and do not specify paths relative to some
subdirectory of the repository even if that is your current working
directory)
* do not use the special directories `.` or `..` anywhere in your path
* do not use `\`, the Windows path separator, between directories and
files; always use `/` regardless of platform.
**Erroneous** path examples (do **_NOT_** use any of these styles):
* `/absolute/path/to/src/modules/program.c`
* `/src/modules/program.c`
* `src/docs/../modules/main.java`
* `scripts/config/./update.sh`
* `./tests/fixtures/image.jpg`
* `../src/main.rs`
* `C:\absolute\path\to\src\modules\program.c`
* `src\modules\program.c`
## Help! Can I recover or undo the filtering?
Sure, _if_ you followed the instructions. The instructions told you to
make a fresh clone before running git-filter-repo. If you did that (and
didn't force push your rewritten history back over the original), you
can just throw away your clone with the flubbed rewrite, and make a new
clone.
If you didn't make a fresh clone, and you didn't run with `--force`, you
would have seen the following warning:
```
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
[...]
Please operate on a fresh clone instead. If you want to proceed
anyway, use --force.
```
If you then added `--force`, well, you were warned.
If you didn't make a fresh clone, and you started with `--force`, and you
didn't think to read the description of the `--force` option:
```
Ignore fresh clone checks and rewrite history (an irreversible
operation, especially since it by default ends with an
immediate pruning of reflogs and old objects).
```
and you didn't read even the beginning of the manual
```
git-filter-repo destructively rewrites history
```
and you think it's okay to run a command with `--force` in it on
something you don't have a backup of, then now is the time to reasses
your life choices. `--force` should be a pretty clear warning sign.
(If someone on the internet suggested `--force`, you can complain at
_them_, but either way you should learn to carefully vet commands
suggested by others on the internet. Sadly, even sites like Stack
Overflow where someone really ought to be able to correct bad guidance
still unfortunately has a fair amount of this bad advice.)
See also the next question.
## Can you change `git-filter-repo` to allow future folks to recover from --force'd rewrites?
This will never be supported.
* Providing an alternate method to restore would require storing both
the original history and the new history, meaning that those who are
trying to shrink their repository size instead see it grow and have to
figure out extra steps to expunge the old history to see the actual
size savings. Experience with other tools showed that this was
frustrating and difficult to figure out for many users.
* Providing an alternate method to restore would mean that users who are
trying to purge sensitive data from their repository still find the
sensitive data after the rewrite because it hasn't actually been
purged. In order to actually purge it, they have to take extra steps.
Same as with the last bullet point, experience has shown that extra
steps to purge the extra information is difficult and error-prone.
This extra difficulty is particularly problematic when you're trying
to expunge sensitive data.
* Providing an alternate method to restore would also mean trying to
figure out what should be backed up and how. The obvious choices used
by previous tools only actually provided partial backups (reflogs
would be ignored for example, as would uncommitted changes whether
staged or not). The more you try to carefully backup everything, the
more difficult the restoration from backup will be. The only backup
mechanism I've found that seems reasonable, is making a separate
clone. That's expensive to do automatically for the user (especially
if the filtering is done via multiple invocations of the tool). Plus,
it's not clear where the clone should be stored, especially to avoid
the previous problems for size-reduction and sensitive-data-removal
folks.
* Providing an alternate method to restore would also mean providing
documentation on how to restore. Past methods by other tools in the
history rewriting space suggested that it was rather difficult for
users to figure out. Difficult enough, in fact, that users simply
didn't ever use them. They instead made a separate clone before
rewriting history and if they didn't like the rewrite, then they just
blew it away and made a new clone to work with. Since that was
observed to be the easy restoration method, I simply enforced it with
this tool, requiring users who look like they might not be operating
on a fresh clone to use the --force flag.
But more than all that, if there were an alternate method to restore,
why would you have needed to specify the --force flag? Doesn't its
existence (and the wording of its documentation) make it pretty clear on
its own that there isn't going to be a way to restore?
## Can I use `git-filter-repo` to fix a repository with corruption?
Some kinds of corruption can be fixed, in conjunction with `git
replace`. If `git fsck` reports warnings/errors for certain objects,
you can often [replace them and rewrite
history](examples-from-user-filed-issues.md#Handling-repository-corruption).
## What kinds of problems does `git-filter-repo` not try to solve?
This question is often asked in the form of "How do I..." or even
written as a statement such as "I found a bug with `git-filter-repo`;
the behavior I got was different than I expected..." But if you're
trying to do one of the things below, then `git-filter-repo` is behaving
as designed and either there is no solution to your problem, or you need
to use a different tool to solve your problem. The following subsections
address some of these common requests:
### Filtering history but magically keeping the same commit IDs
This is impossible. If you modify commits, or the files contained in
them, then you change their commit IDs; this is [fundamental to
Git](#why-did-git-filter-repo-rewrite-commit-hashes).
However, _if_ you don't need to modify commits, but just don't want to
download everything, then look into one of the following:
* [partial clones](https://git-scm.com/docs/partial-clone)
* the ugly, retarded hack known as [shallow clones](https://git-scm.com/docs/shallow)
* a massive hack like [cheap fake
clones](https://github.com/newren/sequester-old-big-blobs) that at
least let you put your evil overlord laugh to use
### Bidirectional development between a filtered and unfiltered repository
Some folks want to extract a subset of a repository, do development work
on it, then bring those changes back to the original repository, and
send further changes in both directions. Such a tool can be written
using fast-export and fast-import, but would need to make very different
design decisions than `git-filter-repo` did. Such a tool would be
capable of supporting this kind of development, but lose the ability
["to write arbitrary filters using a scripting
language"](https://josh-project.github.io/josh/#concept) and other
features that `git-filter-repo` has.
Such a tool exists; it's called [Josh](https://github.com/josh-project/josh).
Use it if this is your usecase.
### Removing specific commits, or filtering based on the difference (a.k.a. patch or change) between commits
You are probably looking for `git rebase`. `git rebase` operates on the
difference between commits ("diff"), allowing you to e.g. drop or modify
the diff, but then runs the risk of conflicts as it attempts to apply
future diffs. If you tweak one diff in the middle, since it just applies
more diffs for the remaining patches, you'll still see your changes at
the end.
filter-repo, by contrast, uses fast-export and fast-import. Those tools
treat every commit not as a diff but as a "use the same versions of most
files from the parent commit, but make these five files have these exact
contents". Since you don't have either the diff or ready access to the
version of files from the parent commit, that makes it hard to "undo"
part of the changes to some file. Further, if you attempt to drop an
entire commit or tweak the contents of those new files in that commit,
those changes will be reverted by the next commit in the stream that
mentions that file because handling the next commit does not involve
applying a diff but a "make this file have these exact contents". So,
filter-repo works well for things like removing a file entirely, but if
you want to make any tweaks to any files you have to make the exact same
tweak over and over for every single commit that touches that file.
In short, `git rebase` is the tool you want for removing specific
commits or otherwise operating on the diff between commits.
### Filtering two different clones of the same repository and getting the same new commit IDs
Sometimes two co-workers have a clone of the same repository and they
run the same `git-filter-repo` command, and they expect to get the same
new commit IDs. Often they do get the same new commit IDs, but
sometimes they don't.
When people get the same commit IDs, it is only by luck; not by design.
There are three reasons this is unsupported and will never be reliable:
* Different Git versions used could cause differences in filtering
Since `git fast-export` and `git fast-import` do various
canonicalizations of history, and these could change over time,
having different versions of Git installed can result in differences
in filtering.
* Different git-filter-repo versions used could cause differences in
filtering
Over time, `git-filter-repo` may include new filterings by default,
or fix existing filterings, or make any other number of changes. As
such, having different versions of `git-filter-repo` installed can
result in differences in filtering.
* Different amounts of the repository cloned or differences in
local-only commits can cause differences in filtering
If the clones weren't made at the same time, one clone may have more
commits than the other. Also, both may have made local commits the
other doesn't have. These additional commits could cause history to
be traversed in a different order, and filtering rules are allowed
to have order-dependent rules for how they filter. Further,
filtering rules are allowed to depend upon what history exists in
your clone. As one example, filter-repo's default to update commit
messages which refer to other commits by abbreviated hash, may be
unable to find these other commits in your clone but find them in
your coworkers' clone. Relatedly, filter-repo's update of
abbreviated hashes in commit messages only works for commits that
have already been filtered, and thus depends on the order in which
fast-export traverses the history.
`git-filter-repo` is designed as a _one_-shot history rewriting tool.
Once you have filtered one clone of the repository, you should not be
using it to filter other clones. All other clones of the repository
should either be discarded and recloned, or [have all their history
rebased on top of the rewritten
history](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#_make_sure_other_copies_are_cleaned_up_clones_of_colleagues).
<!--
## How do I see what was removed?
Run `git rev-list --objects --all` in both a separate fresh clone from
before the rewrite and in the repo where the rewrite was done. Then
find the objects that exist in the old but not the new.
-->
================================================
FILE: Documentation/converting-from-bfg-repo-cleaner.md
================================================
# Cheat Sheet: Converting from BFG Repo Cleaner
This document is aimed at folks who are familiar with BFG Repo Cleaner
and want to learn how to convert over to using filter-repo.
## Table of Contents
* [Half-hearted conversions](#half-hearted-conversions)
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
* [Basic Differences](#basic-differences)
* [Cheat Sheet: Conversion of Examples from BFG](#cheat-sheet-conversion-of-examples-from-bfg)
## Half-hearted conversions
You can switch most any BFG command to use filter-repo under the
covers by just replacing the `java -jar bfg.jar` part of the command
with [`bfg-ish`](../contrib/filter-repo-demos/bfg-ish).
bfg-ish is a reasonable tool, and provides a number of bug fixes and
features on top of bfg, but most of my focus is naturally on
filter-repo which has a number of capabilities lacking in bfg-ish.
## Intention of "equivalent" commands
BFG and filter-repo have a few differences, highlighted in the Basic
Differences section below, that make it hard to get commands that
behave identically. Rather than focusing on matching BFG output as
exactly as possible, I treat the BFG examples as idiomatic ways to
solve a certain type of problem with BFG, and express how one would
idiomatically solve the same problem in filter-repo. Sometimes that
means the results are not identical, but they are largely the same in
each case.
## Basic Differences
BFG operates directly on tree objects, which have no notion of their
leading path. Thus, it has no way of differentiating between
'README.md' at the toplevel versus in some subdirectory. You simply
operate on the basename of files and directories. This precludes
doing things like renaming files and directories or other bigger
restructures. By directly operating on trees, it also runs into
problems with loose vs. packed objects, loose vs. packed refs, not
understanding replace refs or grafts, and not understanding the index
and working tree as another data source.
With `git filter-repo`, you are essentially given an editing tool to
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
serialization of a repo, which operates on filenames including their
full paths from the toplevel of the repo. Directories are not
separately specified, so any directory-related filtering is done by
checking the leading path of each file. Further, you aren't limited
to the pre-defined filtering types, python callbacks which operate on
the data structures from the fast-export stream can be provided to do
just about anything you want. By leveraging fast-export and
fast-import, filter-repo gains automatic handling of objects and refs
whether they are packed or not, automatic handling of replace refs and
grafts, and future features that may appear. It also tries hard to
provide a full rewrite solution, so it takes care of additional
important concerns such as updating the index and working tree and
running an automatic gc for the user afterwards.
The "protection" and "privacy" defaults in BFG are something I
fundamentally disagreed with for a variety of reasons; see the
comments at the top of the
[bfg-ish](../contrib/filter-repo-demos/bfg-ish) script if you want
details. The bfg-ish script implemented these protection and privacy
options since it was designed to act like BFG, but still flipped the
default to the opposite of what BFG chose. I left the "protection"
and "non-private" features out of filter-repo entirely. This means a
number of things with filter-repo:
* any filters you specify will also be applied to HEAD, so that you
don't have a weird disconnect from your history transformations
only being applied to most commits
* `[formerly OLDHASH]` references are not munged into commit
messages; the replace refs that filter-repo adds are a much
cleaner way of looking up commits by old commit hashes.
* `Former-commit-id:` footers are not added to commit messages; the
replace refs that filter-repo adds are a much cleaner way of
looking up commits by old commit hashes.
* History is not littered with `<filename>.REMOVED.git-id` files.
BFG expects you to specify the repository to rewrite as its final
argument, whereas filter-repo expects you to cd into the repo and then
run filter-repo.
## Cheat Sheet: Conversion of Examples from BFG
### Stripping big blobs
```shell
java -jar bfg.jar --strip-blobs-bigger-than 100M some-big-repo.git
```
becomes
```shell
git filter-repo --strip-blobs-bigger-than 100M
```
### Deleting files
```shell
java -jar bfg.jar --delete-files id_{dsa,rsa} my-repo.git
```
becomes
```shell
git filter-repo --use-base-name --path id_dsa --path id_rsa --invert-paths
```
### Removing sensitive content
```shell
java -jar bfg.jar --replace-text passwords.txt my-repo.git
```
becomes
```shell
git filter-repo --replace-text passwords.txt
```
The `--replace-text` was a really clever idea that the BFG came up
with and I just implemented mostly as-is within filter-repo. Sadly,
BFG didn't document the format of files passed to --replace text very
well, but I added more detail in the filter-repo documentation.
There is one small but important difference between the two tools: if
you use both "regex:" and "==>" on a single line to specify a regex
search and replace, then filter-repo will use "\1", "\2", "\3",
etc. for replacement strings whereas BFG used "$1", "$2", "$3", etc.
The reason for this difference is simply that python used backslashes
in its regex format while scala used dollar signs, and both tools
wanted to just pass along the strings unmodified to the underlying
language. (Since bfg-ish attempts to emulate the BFG, it accepts
"$1", "$2" and so forth and translates them to "\1", "\2", etc. so
that filter-repo/python will understand it.)
### Removing files and folders with a certain name
```shell
java -jar bfg.jar --delete-folders .git --delete-files .git --no-blob-protection my-repo.git
```
becomes
```shell
git filter-repo --invert-paths --path-glob '*/.git' --path .git
```
Yes, that glob will handle .git directories one or more directories
deep; it's a git-style glob rather than a shell-style glob. Also, the
`--path .git` was added because `--path-glob '*/.git'` won't match a
directory named .git in the toplevel directory since it has a '/'
character in the glob expression (though I would hope the repository
doesn't have a tracked .git toplevel directory in its history).
================================================
FILE: Documentation/converting-from-filter-branch.md
================================================
# Cheat Sheet: Converting from filter-branch
This document is aimed at folks who are familiar with filter-branch and want
to learn how to convert over to using filter-repo.
## Table of Contents
* [Half-hearted conversions](#half-hearted-conversions)
* [Intention of "equivalent" commands](#intention-of-equivalent-commands)
* [Basic Differences](#basic-differences)
* [Cheat Sheet: Conversion of Examples from the filter-branch manpage](#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage)
* [Cheat Sheet: Additional conversion examples](#cheat-sheet-additional-conversion-examples)
## Half-hearted conversions
You can switch nearly any `git filter-branch` command to use
filter-repo under the covers by just replacing the `git filter-branch`
part of the command with
[`filter-lamely`](../contrib/filter-repo-demos/filter-lamely). The
git.git regression testsuite passes when I swap out the filter-branch
script with filter-lamely, for example. (However, the filter-branch
tests are not very comprehensive, so don't rely on that too much.)
Doing a half-hearted conversion has nearly all of the drawbacks of
filter-branch and nearly none of the benefits of filter-repo, but it
will make your command run a few times faster and makes for a very
simple conversion.
You'll get a lot more performance, safety, and features by just
switching to direct filter-repo commands.
## Intention of "equivalent" commands
filter-branch and filter-repo have different defaults, as highlighted
in the Basic Differences section below. As such, getting a command
which behaves identically is not possible. Also, sometimes the
filter-branch manpage lies, e.g. it says "suppose you want to...from
all commits" and then uses a command line like "git filter-branch
... HEAD", which only operates on commits in the current branch rather
than on all commits.
Rather than focusing on matching filter-branch output as exactly as
possible, I treat the filter-branch examples as idiomatic ways to
solve a certain type of problem with filter-branch, and express how
one would idiomatically solve the same problem in filter-repo.
Sometimes that means the results are not identical, but they are
largely the same in each case.
## Basic Differences
With `git filter-branch`, you have a git repository where every single
commit (within the branches or revisions you specify) is checked out
and then you run one or more shell commands to transform the working
copy into your desired end state.
With `git filter-repo`, you are essentially given an editing tool to
operate on the [fast-export](https://git-scm.com/docs/git-fast-export)
serialization of a repo. That means there is an input stream of all
the contents of the repository, and rather than specifying filters in
the form of commands to run, you usually employ a number of common
pre-defined filters that provide various ways to slice, dice, or
modify the repo based on its components (such as pathnames, file
content, user names or emails, etc.) That makes common operations
easier, even if it's not as versatile as shell callbacks. For cases
where more complexity or special casing is needed, filter-repo
provides python callbacks that can operate on the data structures
populated from the fast-export stream to do just about anything you
want.
filter-branch defaults to working on a subset of the repository, and
requires you to specify a branch or branches, meaning you need to
specify `-- --all` to modify all commits. filter-repo by contrast
defaults to rewriting everything, and you need to specify `--refs
<rev-list-args>` if you want to limit to just a certain set of
branches or range of commits. (Though any `<rev-list-args>` that
begin with a hyphen are not accepted by filter-repo as they look like
the start of different options.)
filter-repo also takes care of additional concerns automatically, like
rewriting commit messages that reference old commit IDs to instead
reference the rewritten commit IDs, pruning commits which do not start
empty but become empty due to the specified filters, and automatically
shrinking and gc'ing the repo at the end of the filtering operation.
## Cheat Sheet: Conversion of Examples from the filter-branch manpage
### Removing a file
The filter-branch manual provided three different examples of removing
a single file, based on different levels of ease vs. carefulness and
performance:
```shell
git filter-branch --tree-filter 'rm filename' HEAD
```
```shell
git filter-branch --tree-filter 'rm -f filename' HEAD
```
```shell
git filter-branch --index-filter 'git rm --cached --ignore-unmatch filename' HEAD
```
All of these just become
```shell
git filter-repo --invert-paths --path filename
```
### Extracting a subdirectory
Extracting a subdirectory via
```shell
git filter-branch --subdirectory-filter foodir -- --all
```
is one of the easiest commands to convert; it just becomes
```shell
git filter-repo --subdirectory-filter foodir
```
### Moving the whole tree into a subdirectory
Keeping all files but placing them in a new subdirectory via
```shell
git filter-branch --index-filter \
'git ls-files -s | sed "s-\t\"*-&newsubdir/-" |
GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
git update-index --index-info &&
mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"' HEAD
```
(which happens to be GNU-specific and will fail with BSD userland in
very subtle ways) becomes
```shell
git filter-repo --to-subdirectory-filter newsubdir
```
(which works fine regardless of GNU vs BSD userland differences.)
### Re-grafting history
The filter-branch manual provided one example with three different
commands that could be used to achieve it, though the first of them
had limited applicability (only when the repo had a single initial
commit). These three examples were:
```shell
git filter-branch --parent-filter 'sed "s/^\$/-p <graft-id>/"' HEAD
```
```shell
git filter-branch --parent-filter \
'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
```
```shell
git replace --graft $commit-id $graft-id
git filter-branch $graft-id..HEAD
```
git-replace did not exist when the original two examples were written,
but it is clear that the last example is far easier to understand. As
such, filter-repo just uses the same mechanism:
```shell
git replace --graft $commit-id $graft-id
git filter-repo --proceed
```
NOTE: --proceed is needed here because filter-repo errors out if no
arguments are specified (doing so is usually an error).
### Removing commits by a certain author
WARNING: This is a BAD example for BOTH filter-branch and filter-repo.
It does not remove the changes the user made from the repo, it just
removes the commit in question while smashing the changes from it into
any subsequent commits as though the subsequent authors had been
responsible for those changes as well. `git rebase` is likely to be a
better fit for what you really want if you are looking at this
example. (See also [this explanation of the differences between
rebase and
filter-repo](https://github.com/newren/git-filter-repo/issues/62#issuecomment-597725502))
This filter-branch example
```shell
git filter-branch --commit-filter '
if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
then
skip_commit "$@";
else
git commit-tree "$@";
fi' HEAD
```
becomes
```shell
git filter-repo --commit-callback '
if commit.author_name == b"Darl McBribe":
commit.skip()
'
```
### Rewriting commit messages -- removing text
Removing git-svn-id: lines from commit messages via
```shell
git filter-branch --msg-filter '
sed -e "/^git-svn-id:/d"
'
```
becomes
```shell
git filter-repo --message-callback '
return re.sub(b"^git-svn-id:.*\n", b"", message, flags=re.MULTILINE)
'
```
### Rewriting commit messages -- adding text
Adding Acked-by lines to the last ten commits via
```shell
git filter-branch --msg-filter '
cat &&
echo "Acked-by: Bugs Bunny <bunny@bugzilla.org>"
' master~10..master
```
becomes
```shell
git filter-repo --message-callback '
return message + b"Acked-by: Bugs Bunny <bunny@bugzilla.org>\n"
' --refs master~10..master
```
### Changing author/committer(/tagger?) information
```shell
git filter-branch --env-filter '
if test "$GIT_AUTHOR_EMAIL" = "root@localhost"
then
GIT_AUTHOR_EMAIL=john@example.com
fi
if test "$GIT_COMMITTER_EMAIL" = "root@localhost"
then
GIT_COMMITTER_EMAIL=john@example.com
fi
' -- --all
```
becomes either
```shell
# Ensure '<john@example.com> <root@localhost>' is a line in .mailmap, then:
git filter-repo --use-mailmap
```
or
```shell
git filter-repo --email-callback '
return email if email != b"root@localhost" else b"john@example.com"
'
```
(and as a bonus both filter-repo alternatives will fix tagger emails
too, unlike the filter-branch example)
### Restricting to a range
The partial examples
```shell
git filter-branch ... C..H
```
```shell
git filter-branch ... C..H ^D
```
```shell
git filter-branch ... D..H ^C
```
become
```shell
git filter-repo ... --refs C..H
```
```shell
git filter-repo ... --refs C..H ^D
```
```shell
git filter-repo ... --refs D..H ^C
```
Note that filter-branch accepts `--not` among the revision specifiers,
but that appears to python to be a flag name which breaks parsing.
So, instead of e.g. `--not C` as we might use with filter-branch, we
can specify `^C` to filter-repo.
## Cheat Sheet: Additional conversion examples
### Running a code formatter or linter on each file with some extension
Running some program on a subset of files is relatively natural in
filter-branch:
```shell
git filter-branch --tree-filter '
git ls-files -z "*.c" \
| xargs -0 -n 1 clang-format -style=file -i
'
```
though it has the disadvantage of running on every c file for every
commit in history, even if some commits do not modify any c files. This
means this kind of command can be excruciatingly slow.
The same functionality is slightly more involved in filter-repo for
two reasons:
- fast-export and fast-import split file contents and file names into
completely different data structures that aren't normally available
together
- to run a program on a file, you'll need to write the contents to the
a file, execute the program on that file, and then read the contents
of the file back in
```shell
git filter-repo --file-info-callback '
if not filename.endswith(b".c"):
return (filename, mode, blob_id) # no changes
contents = value.get_contents_by_identifier(blob_id)
tmpfile = os.path.basename(filename)
with open(tmpfile, "wb") as f:
f.write(contents)
subprocess.check_call(["clang-format", "-style=file", "-i", filename])
with open(filename, "rb") as f:
contents = f.read()
new_blob_id = value.insert_file_with_contents(contents)
return (filename, mode, new_blob_id)
'
```
However, one can write a script that uses filter-repo as a library to
simplify this, while also gaining filter-repo's automatic handling of
other concerns like rewriting commit IDs in commit messages or pruning
commits that become empty. In fact, one of the [contrib
demos](../contrib/filter-repo-demos),
[lint-history](../contrib/filter-repo-demos/lint-history), was
specifically written to make this kind of case really easy:
```shell
lint-history --relevant 'return filename.endswith(b".c")' \
clang-format -style=file -i
```
================================================
FILE: Documentation/examples-from-user-filed-issues.md
================================================
# Examples from user-filed issues
Lots of people have filed issues against git-filter-repo, and many times their
issue boils down into questions of "How do I?" or "Why doesn't this work?"
Below are a collection of example repository filterings in answer to their
questions, which may be of interest to others.
## Table of Contents
* [Adding files to root commits](#adding-files-to-root-commits)
* [Purge a large list of files](#purge-a-large-list-of-files)
* [Extracting a libary from a repo](#Extracting-a-libary-from-a-repo)
* [Replace words in all commit messages](#Replace-words-in-all-commit-messages)
* [Only keep files from two branches](#Only-keep-files-from-two-branches)
* [Renormalize end-of-line characters and add a .gitattributes](#Renormalize-end-of-line-characters-and-add-a-gitattributes)
* [Remove spaces at the end of lines](#Remove-spaces-at-the-end-of-lines)
* [Having both exclude and include rules for filenames](#Having-both-exclude-and-include-rules-for-filenames)
* [Removing paths with a certain extension](#Removing-paths-with-a-certain-extension)
* [Removing a directory](#Removing-a-directory)
* [Convert from NFD filenames to NFC](#Convert-from-NFD-filenames-to-NFC)
* [Set the committer of the last few commits to myself](#Set-the-committer-of-the-last-few-commits-to-myself)
* [Handling special characters, e.g. accents in names](#Handling-special-characters-eg-accents-in-names)
* [Handling repository corruption](#Handling-repository-corruption)
* [Removing all files with a backslash in them](#Removing-all-files-with-a-backslash-in-them)
* [Replace a binary blob in history](#Replace-a-binary-blob-in-history)
* [Remove commits older than N days](#Remove-commits-older-than-N-days)
* [Replacing pngs with compressed alternative](#Replacing-pngs-with-compressed-alternative)
* [Updating submodule hashes](#Updating-submodule-hashes)
* [Using multi-line strings in callbacks](#Using-multi-line-strings-in-callbacks)
## Adding files to root commits
<!-- https://github.com/newren/git-filter-repo/issues/21 -->
Here's an example that will take `/path/to/existing/README.md` and
store it as `README.md` in the repository, and take
`/home/myusers/mymodule.gitignore` and store it as `src/.gitignore` in
the repository:
```
git filter-repo --commit-callback "if not commit.parents: commit.file_changes += [
FileChange(b'M', b'README.md', b'$(git hash-object -w '/path/to/existing/README.md')', b'100644'),
FileChange(b'M', b'src/.gitignore', b'$(git hash-object -w '/home/myusers/mymodule.gitignore')', b'100644')]"
```
Alternatively, you could also use the [insert-beginning](../contrib/filter-repo-demos/insert-beginning) contrib script:
```
mv /path/to/existing/README.md README.md
mv /home/myusers/mymodule.gitignore src/.gitignore
insert-beginning --file README.md
insert-beginning --file src/.gitignore
```
## Purge a large list of files
<!-- https://github.com/newren/git-filter-repo/issues/63 -->
Stick all the files in some file (one per line),
e.g. `../DELETED_FILENAMES.txt`, and then run
```
git filter-repo --invert-paths --paths-from-file ../DELETED_FILENAMES.txt
```
## Extracting a libary from a repo
<!-- https://github.com/newren/git-filter-repo/issues/80 -->
If you want to pick out some subdirectory to keep
(e.g. `src/some-filder/some-feature/`), but don't want it moved to the
repository root (so that --subdirectory-filter isn't applicable) but
instead want it to become some other higher level directory
(e.g. `src/`):
```
git filter-repo \
--path src/some-folder/some-feature/ \
--path-rename src/some-folder/some-feature/:src/
```
## Replace words in all commit messages
<!-- https://github.com/newren/git-filter-repo/issues/83 -->
Replace "stuff" in any commit message with "task".
```
git filter-repo --message-callback 'return message.replace(b"stuff", b"task")'
```
## Only keep files from two branches
<!-- https://github.com/newren/git-filter-repo/issues/91 -->
Let's say you know that the files currently present on two branches
are the only files that matter. Files that used to exist in either of
these branches, or files that only exist on some other branch, should
all be deleted from all versions of history. This can be accomplished
by getting a list of files from each branch, combining them, sorting
the list and picking out just the unique entries, then passing the
result to `--paths-from-file`:
```
git ls-tree -r ${BRANCH1} >../my-files
git ls-tree -r ${BRANCH2} >>../my-files
sort ../my-files | uniq >../my-relevant-files
git filter-repo --paths-from-file ../my-relevant-files
```
## Renormalize end-of-line characters and add a .gitattributes
<!-- https://github.com/newren/git-filter-repo/issues/122 -->
```
contrib/filter-repo-demos/lint-history dos2unix
[edit .gitattributes]
contrib/filter-repo-demos/insert-beginning .gitattributes
```
## Remove spaces at the end of lines
<!-- https://github.com/newren/git-filter-repo/issues/145 -->
Removing all spaces at the end of lines of non-binary files, including
converting CRLF to LF:
```
git filter-repo --replace-text <(echo 'regex:[\r\t ]+(\n|$)==>\n')
```
## Having both exclude and include rules for filenames
<!-- https://github.com/newren/git-filter-repo/issues/230 -->
If you want to have rules to both include and exclude filenames, you
can simply invoke `git filter-repo` multiple times. Alternatively,
you can do it in one run if you dispense with `--path` arguments and
instead use the more generic `--filename-callback`. For example to
include all files under `src/` except for `src/README.md`:
```
git filter-repo --filename-callback '
if filename == b"src/README.md":
return None
if filename.startswith(b"src/"):
return filename
return None'
```
## Removing paths with a certain extension
<!-- https://github.com/newren/git-filter-repo/issues/274 -->
```
git filter-repo --invert-paths --path-glob '*.xsa'
```
or
```
git filter-repo --filename-callback '
if filename.endswith(b".xsa"):
return None
return filename'
```
## Removing a directory
<!-- https://github.com/newren/git-filter-repo/issues/278 -->
```
git filter-repo --path node_modules/electron/dist/ --invert-paths
```
## Convert from NFD filenames to NFC
<!-- https://github.com/newren/git-filter-repo/issues/296 -->
Given that Mac does utf-8 normalization of filenames, and has
historically switched which kind of normalization it does, users may
have committed files with alternative normalizations to their
repository. If someone wants to convert filenames in NFD form to NFC,
they could run
```
git filter-repo --filename-callback '
try:
return subprocess.check_output("iconv -f utf-8-mac -t utf-8".split(),
input=filename)
except:
return filename
'
```
or instead of relying on the system iconv utility and spawning separate
processes, doing it within python:
```
git filter-repo --filename-callback '
import unicodedata
try:
return bytearray(unicodedata.normalize('NFC', filename.decode('utf-8')), 'utf-8')
except:
return filename
'
```
## Set the committer of the last few commits to myself
<!-- https://github.com/newren/git-filter-repo/issues/379 -->
```
git filter-repo --refs main~5..main --commit-callback '
commit.commiter_name = b"My Wonderful Self"
commit.committer_email = b"my@self.org"
'
```
## Handling special characters, e.g. accents and umlauts in names
<!-- https://github.com/newren/git-filter-repo/issues/383 -->
Since characters like ë and á are multi-byte characters and python
won't allow you to directly place those in a bytestring
(e.g. `b"Raphaël González"` would result in a `SyntaxError: bytes can
only contain ASCII literal characters` error from Python), you just
need to make a normal (UTF-8) string and then convert to a bytestring
to handle these. For example, changing the author name and email
where the author email is currently `example@test.com`:
```
git filter-repo --refs main~5..main --commit-callback '
if commit.author_email = b"example@test.com":
commit.author_name = "Raphaël González".encode()
commit.author_email = b"rgonzalez@test.com"
'
```
## Handling repository corruption
<!-- https://github.com/newren/git-filter-repo/issues/420 -->
First, run fsck to get a list of the corrupt objects, e.g.:
```
$ git fsck --full
error in commit 166f57b3fbe31257100361ecaf735f305b533b21: missingSpaceBeforeDate: invalid author/committer line - missing space before date
error in tree c15680eae81cc8539af7e7de766a8a7c13bd27df: duplicateEntries: contains duplicate file entries
Checking object directories: 100% (256/256), done.
```
Odds are you'll only see one type of corruption, but if you see
multiple, you can either do multiple filterings, or create replacement
objects for all the corrupt objects (both commits and trees), and then
do the filtering. Since the method for handling corrupt commits and
corrupt tress is slightly different, I'll give examples below for each.
### Handling repository corruption -- commit objects
Print out the corrupt object literally to a temporary file:
```
$ git cat-file -p 166f57b3fbe31257100361ecaf735f305b533b21 >tmp
```
Taking a look at the file would show, for example:
```
$ cat tmp
tree e1d871155fce791680ec899fe7869067f2b4ffd2
author My Name <my@email.com>1673287380 -0800
committer My Name <my@email.com> 1673287380 -0800
Initial
```
Edit that file to fix the error (in this case, the missing space
between author email and author date). In this case, it would look
like this after editing:
```
tree e1d871155fce791680ec899fe7869067f2b4ffd2
author My Name <my@email.com> 1673287380 -0800
committer My Name <my@email.com> 1673287380 -0800
Initial
```
Save the updated file, then use `git replace` to make a replace reference
for it.
```
$ git replace -f 166f57b3fbe31257100361ecaf735f305b533b21 $(git hash-object -t commit -w tmp)
```
Then remove the temporary file `tmp` and run `filter-repo` to consume
the replace reference and make it permanent:
```
$ rm tmp
$ git filter-repo --proceed
```
Note that if you have multiple corrupt objects, you need to create
replacements for all of them, and then run filter-repo. Leaving any
corrupt object without a replacement is likely to cause the filter-repo run
to fail.
### Handling repository corruption -- tree objects
<!-- GitHub customer example -->
Print out the corrupt object literally to a temporary file:
```
$ git cat-file -p c15680eae81cc8539af7e7de766a8a7c13bd27df >tmp
```
Taking a look at the file would show, for example:
```
$ cat tmp
100644 blob cd5ded43e86f80bfd384702e3f4cc7ce42de49f9 .gitignore
100644 blob 226febfcc91ec2c166a5a06834fb47c3553ec469 README.md
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 src
040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078 src
040000 tree 99d732476808176bb9d73bcbfe2505e43d65cb4f t
```
Edit that file to fix the error (in this case, removing either the `src`
file (blob) or the `src` directory (tree)). In this case, it might look
like this after editing:
```
100644 blob cd5ded43e86f80bfd384702e3f4cc7ce42de49f9 .gitignore
100644 blob 226febfcc91ec2c166a5a06834fb47c3553ec469 README.md
040000 tree df2b8fc99e1c1d4dbc0a854d9f72157f1d6ea078 src
040000 tree 99d732476808176bb9d73bcbfe2505e43d65cb4f t
```
Save the updated file, then use `git mktree` to turn it into an actual
tree object:
```
$ git mktree <tmp
ace04f50a5d13b43e94c12802d3d8a6c66a35b1d
```
Now use the output of that command to create a replacement object for
the original corrupt object:
```
git replace -f c15680eae81cc8539af7e7de766a8a7c13bd27df ace04f50a5d13b43e94c12802d3d8a6c66a35b1d
```
Then remove the temporary file `tmp` and run `filter-repo` to consume
the replace reference and make it permanent:
```
$ rm tmp
$ git filter-repo --proceed
```
As mentioned with corrupt commit objects, if you have multiple corrupt
objects, as long as you create all the replacements for those objects
first, you only need to run filter-repo once.
## Removing all files with a backslash in them
<!-- https://github.com/newren/git-filter-repo/issues/427 -->
```
git filter-repo --filename-callback 'return None if b'\\' in filename else filename'
```
## Replace a binary blob in history
<!-- https://github.com/newren/git-filter-repo/issues/436 -->
Let's say you committed a binary blob, perhaps an image file, with
sensitive data, and never modified it. You want to replace it with
the contents of some alternate file, currently found at
`../alternative-file.jpg` (it can have a different filename than what
is stored in the repository). Let's also say the hash of the old file
was `f4ede2e944868b9a08401dafeb2b944c7166fd0a`. You can replace it
with either
```
git filter-repo --blob-callback '
if blob.original_id == b"f4ede2e944868b9a08401dafeb2b944c7166fd0a":
blob.data = open("../alternative-file.jpg", "rb").read()
'
```
or
```
git replace -f f4ede2e944868b9a08401dafeb2b944c7166fd0a $(git hash-object -w ../alternative-file.jpg)
git filter-repo --proceed
```
## Remove commits older than N days
<!-- https://github.com/newren/git-filter-repo/issues/300 -->
This is such a bad usecase. I'm tempted to leave it out, but it has
come up multiple times, and there are people who are totally fine with
changing every commit hash in their repository and throwing away
history periodically. First, identify an ${OLD_COMMIT} that you want
to be a new root commit, then run:
```
git replace --graft ${OLD_COMMIT}
git filter-repo --proceed
```
(The trick here is that `git replace --graft` takes a commit to replace, and
a list of new parents for the commit. Since ${OLD_COMMIT} is the final
positional argument, it means the list of new parents is an empty list, i.e.
we are turning it into a new root commit.)
## Replacing pngs with compressed alternative
<!-- https://github.com/newren/git-filter-repo/issues/492 -->
Let's say you committed thousands of pngs that were poorly compressed,
but later aggressively recompressed the pngs and commited and pushed.
Unfortunately, clones are slow because they still contain the poorly
compressed pngs and you'd like to rewrite history to pretend that the
aggressively compressed versions were used when the files were first
introduced.
First, take a look at the commit that aggressively recompressed the pngs:
```
git log -1 --raw --no-abbrev ${COMMIT_WHERE_YOU_COMPRESSED_PNGS}
```
that will show output like
```
:100755 100755 edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12 M resources/foo.png
:100755 100755 644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820 M resources/bar.png
```
Use that to make a --file-info-callback to fix up the original versions:
```
git filter-repo --file-info-callback '
if filename == b"resources/foo.png" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09":
blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12"
if filename == b"resources/bar.png" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11":
blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820"
return (filename, mode, blob_id)
'
```
## Updating submodule hashes
<!-- https://github.com/newren/git-filter-repo/issues/537 -->
Let's say you have a repo with a submodule at src/my-submodule, and
that you feel the wrong commit-hashes of the submodule were commited
within your project and you want them updated according to the
following table:
```
old new
edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12
644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820
```
You could do this as follows:
```
git filter-repo --file-info-callback '
if filename == b"src/my-submodule" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09":
blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12"
if filename == b"src/my-submodule" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11":
blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820"
return (filename, mode, blob_id)
```
Yes, `blob_id` is kind of a misnomer here since the file's hash
actually refers to a commit from the sub-project. But `blob_id` is
the name of the parameter passed to the --file-info-callback, so that
is what must be used.
## Using multi-line strings in callbacks
<!-- https://lore.kernel.org/git/CABPp-BFqbiS8xsbLouNB41QTc5p0hEOy-EoV0Sjnp=xJEShkTw@mail.gmail.com/ -->
Since the text for callbacks have spaces inserted at the front of every
line, multi-line strings are normally munged. For example, the command
```
git filter-repo --blob-callback '
blob.data = bytes("""\
This is the new
file that I am
replacing every blob
with. It is great.\n""", "utf-8")
'
```
would result in a file with extra spaces at the front of every line:
```
This is the new
file that I am
replacing every blob
with. It is great.
```
The two spaces at the beginning of every-line were inserted into every
line of the callback when trying to compile it as a function.
However, you can use textwrap.dedent to fix this; in fact, using it
will even allow you to add more leading space so that it looks nicely
indented. For example:
```
git filter-repo --blob-callback '
import textwrap
blob.data = bytes(textwrap.dedent("""\
This is the new
file that I am
replacing every blob
with. It is great.\n"""), "utf-8")
'
```
That will result in a file with contents
```
This is the new
file that I am
replacing every blob
with. It is great.
```
which has no leading spaces on any lines.
================================================
FILE: Documentation/git-filter-repo.txt
================================================
// This file is NOT the documentation; it's the *source code* for it.
// Please follow the "user manual" link under
// https://github.com/newren/git-filter-repo#how-do-i-use-it
// to access the actual documentation, or view another site that
// has compiled versions available, such as:
// https://www.mankier.com/1/git-filter-repo
git-filter-repo(1)
==================
NAME
----
git-filter-repo - Rewrite repository history
SYNOPSIS
--------
[verse]
'git filter-repo' --analyze
'git filter-repo' [<path_filtering_options>] [<content_filtering_options>]
[<ref_renaming_options>] [<commit_message_filtering_options>]
[<name_or_email_filtering_options>] [<parent_rewriting_options>]
[<generic_callback_options>] [<miscellaneous_options>]
DESCRIPTION
-----------
Rapidly rewrite entire repository history using user-specified filters.
This is a destructive operation which should not be used lightly; it
writes new commits, trees, tags, and blobs corresponding to (but
filtered from) the original objects in the repository, then deletes the
original history and leaves only the new. See <<DISCUSSION>> for more
details on the ramifications of using this tool. Several different
types of history rewrites are possible; examples include (but are not
limited to):
* stripping large files (or large directories or large extensions)
* stripping unwanted files by path
* extracting wanted paths and their history (stripping everything else)
* restructuring the file layout (such as moving all files into a
subdirectory in preparation for merging with another repo, making a
subdirectory become the new toplevel directory, or merging two
directories with independent filenames into one directory)
* renaming tags (also often in preparation for merging with another repo)
* replacing or removing sensitive text such as passwords
* making mailmap rewriting of user names or emails permanent
* making grafts or replacement refs permanent
* rewriting commit messages
Additionally, several concerns are handled automatically (many of these
can be overridden, but they are all on by default):
* rewriting (possibly abbreviated) hashes in commit messages to
refer to the new post-rewrite commit hashes
* pruning commits which become empty due to the above filters (also
handles edge cases like pruning of merge commits which become
degenerate and empty)
* rewriting stashes
* baking the changes made by refs/replace/ refs into the permanent
history and removing the replace refs
* stripping of original history to avoid mixing old and new history
* repacking the repository post-rewrite to shrink the repo for the
user
And additional facilities are available via a config option
* creating replace-refs (see linkgit:git-replace[1]) for old commit
hashes, which if manually pushed and fetched will allow users to
continue to refer to new commits using (unabbreviated) old commit
IDs
Also, it's worth noting that there is an important safety mechanism:
* abort if run from a repo that is not a fresh clone (to prevent
accidental data loss from rewriting local history that doesn't
exist anywhere else). See <<FRESHCLONE>>.
For those who know that there is large unwanted stuff in their history
and want help finding it, this command also
* provides an option to analyze a repository and generate reports that
can be useful in determining what to filter (or in determining
whether a separate filtering command was successful).
See also <<VERSATILITY>>, <<DISCUSSION>>, <<EXAMPLES>>, and
<<INTERNALS>>.
OPTIONS
-------
Analysis Options
~~~~~~~~~~~~~~~~
--analyze::
Analyze repository history and create a report that may be
useful in determining what to filter in a subsequent run (or
in determining if a previous filtering command did what you
wanted). Will not modify your repo.
Filtering based on paths (see also --filename-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These options specify the paths to select. Note that much like git
itself, renames are NOT followed so you may need to specify multiple
paths, e.g. `--path olddir/ --path newdir/`
--invert-paths::
Invert the selection of files from the specified
--path-{match,glob,regex} options below, i.e. only select
files matching none of those options.
--path-match <dir_or_file>::
--path <dir_or_file>::
Exact paths (files or directories) to include in filtered
history. Multiple --path options can be specified to get a
union of paths.
--path-glob <glob>::
Glob of paths to include in filtered history. Multiple
--path-glob options can be specified to get a union of paths.
--path-regex <regex>::
Regex of paths to include in filtered history. Multiple
--path-regex options can be specified to get a union of paths.
--use-base-name::
Match on file base name instead of full path from the top of
the repo. Incompatible with --path-rename, and incompatible
with matching against directory names.
Renaming based on paths (see also --filename-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note: if you combine path filtering with path renaming, be aware that
a rename directive does not select paths, it only says how to
rename paths that are selected with the filters.
--path-rename <old_name:new_name>::
--path-rename-match <old_name:new_name>::
Path to rename; if filename or directory matches <old_name>
rename to <new_name>. Multiple --path-rename options can be
specified.
Path shortcuts
~~~~~~~~~~~~~~
--paths-from-file <filename>::
Specify several path filtering and renaming directives, one
per line. Lines with `==>` in them specify path renames, and
lines can begin with `literal:` (the default), `glob:`, or
`regex:` to specify different matching styles. Blank lines
and lines starting with a `#` are ignored (if you have a
filename that you want to filter on that starts with
`literal:`, `#`, `glob:`, or `regex:`, then prefix the line
with 'literal:').
--subdirectory-filter <directory>::
Only look at history that touches the given subdirectory and
treat that directory as the project root. Equivalent to using
`--path <directory>/ --path-rename <directory>/:`
--to-subdirectory-filter <directory>::
Treat the project root as if it were under
<directory>. Equivalent to using `--path-rename :<directory>/`
Content editing filters (see also --blob-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--replace-text <expressions_file>::
A file with expressions that, if found, will be replaced. By
default, each expression is treated as literal text, but
`regex:` and `glob:` prefixes are supported. You can end the
line with `==>` and some replacement text to choose a
replacement choice other than the default of `***REMOVED***`.
--strip-blobs-bigger-than <size>::
Strip blobs (files) bigger than specified size (e.g. `5M`,
`2G`, etc)
--strip-blobs-with-ids <blob_id_filename>::
Read git object ids from each line of the given file, and
strip all of them from history
Renaming of refs (see also --refname-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--tag-rename <old:new>::
Rename tags starting with <old> to start with <new>. For example,
--tag-rename foo:bar will rename tag foo-1.2.3 to bar-1.2.3;
either <old> or <new> can be empty.
Filtering of commit messages (see also --message-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--replace-message <expressions_file>::
A file with expressions that, if found in commit or tag
messages, will be replaced. This file uses the same syntax as
--replace-text.
--preserve-commit-hashes::
By default, since commits are rewritten and thus gain new
hashes, references to old commit hashes in commit messages are
replaced with new commit hashes (abbreviated to the same
length as the old reference). Use this flag to turn off
updating commit hashes in commit messages.
--preserve-commit-encoding::
Do not reencode commit messages into UTF-8. By default, if the
commit object specifies an encoding for the commit message,
the message is re-encoded into UTF-8.
Filtering of names & emails (see also --name-callback and --email-callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--mailmap <filename>::
Use specified mailmap file (see linkgit:git-shortlog[1] for details
on the format) when rewriting author, committer, and tagger names
and emails. If the specified file is part of git history,
historical versions of the file will be ignored; only the current
contents are consulted.
--use-mailmap::
Same as: '--mailmap .mailmap'
Parent rewriting
~~~~~~~~~~~~~~~~
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add, old-default}::
How to handle replace refs (see git-replace(1)). Replace refs
can be added during the history rewrite as a way to allow
users to pass old commit IDs (from before git-filter-repo was
run) to git commands and have git know how to translate those
old commit IDs to the new (post-rewrite) commit IDs. Also,
replace refs that existed before the rewrite can either be
deleted or updated. The choices to pass to --replace-refs
thus need to specify both what to do with existing refs and
what to do with commit rewrites. Thus 'update-and-add' means
to update existing replace refs, and for any commit rewrite
(even if already pointed at by a replace ref) add a new
refs/replace/ reference to map from the old commit ID to the
new commit ID. The default is update-no-add, meaning update
existing replace refs but do not add any new ones. There is
also a special 'old-default' option for picking the default
used in versions prior to git-filter-repo-2.45, namely
'update-and-add' upon the first run of git-filter-repo in a
repository and 'update-or-add' if running git-filter-repo
again on a repository.
--prune-empty {always, auto, never}::
Whether to prune empty commits. 'auto' (the default) means
only prune commits which become empty (not commits which were
empty in the original repo, unless their parent was
pruned). When the parent of a commit is pruned, the first
non-pruned ancestor becomes the new parent.
--prune-degenerate {always, auto, never}::
Since merge commits are needed for history topology, they are
typically exempt from pruning. However, they can become
degenerate with the pruning of other commits (having fewer
than two parents, having one commit serve as both parents, or
having one parent as the ancestor of the other.) If such merge
commits have no file changes, they can be pruned. The default
('auto') is to only prune empty merge commits which become
degenerate (not which started as such).
--no-ff::
Even if the first parent is or becomes an ancestor of another
parent, do not prune it. This modifies how --prune-degenerate
behaves, and may be useful in projects who always use merge
--no-ff.
Generic callback code snippets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--filename-callback <function_body>::
Python code body for processing filenames; see <<CALLBACKS>>.
--message-callback <function_body>::
Python code body for processing messages (both commit messages and
tag messages); see <<CALLBACKS>>.
--name-callback <function_body>::
Python code body for processing names of people; see <<CALLBACKS>>.
--email-callback <function_body>::
Python code body for processing emails addresses; see
<<CALLBACKS>>.
--refname-callback <function_body>::
Python code body for processing refnames; see <<CALLBACKS>>.
--file-info-callback <function_body>::
Python code body for processing the combination of filename, mode,
and associated file contents; see <<CALLBACKS>. Note that when
--file-info-callback is specified, any replacements specified by
--replace-text will not be automatically applied; instead, you
have control within the --file-info-callback to choose which files
to apply those transformations to.
--blob-callback <function_body>::
Python code body for processing blob objects; see <<CALLBACKS>>.
--commit-callback <function_body>::
Python code body for processing commit objects; see <<CALLBACKS>>.
--tag-callback <function_body>::
Python code body for processing tag objects; see <<CALLBACKS>>.
Note that lightweight tags have no tag object and thus are not
handled by this callback. The only thing you really could do with a
lightweight tag is rename it, but for that you should see
--refname-callback instead.
--reset-callback <function_body>::
Python code body for processing reset objects; see <<CALLBACKS>>.
Sensitive Data Removal
~~~~~~~~~~~~~~~~~~~~~~
--sensitive-data-removal::
--sdr::
This rewrite is intended to remove sensitive data from a repository.
Gather extra information from the rewrite needed to provide
additional instructions on how to clean up other copies. This
includes:
- Fetching all refs, so that if refs outside of branches and tags
also reference the sensitive data, they can be cleaned up too
Note that if you have any local-only changes (i.e. un-pushed
changes) in your repository, on any branch or ref, this fetch step
may discard them. Working in a fresh clone avoids this problem;
see also the --no-fetch option if you don't want to work with a
fresh clone and you have important local-only changes.
- Tracking and reporting on the first changed commit(s)
- Tracking and reporting whether any LFS objects become orphaned by
the rewrite, so they can be removed
- Providing additional instructions at the end on how to clean up
the repository you cloned from, and other clones of the repo
--no-fetch::
Avoid the "fetch all refs" step with --sensitive-data-removal, and
thus avoid overwriting local-only changes in the repository, but at
the risk of leaving the sensitive data in other refs in the source
repository. This option is implied by --partial or any flag that
implies --partial.
Location to filter from/to
~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTE: Specifying alternate source or target locations implies
--partial. However, unlike normal uses of --partial, this doesn't
risk mixing old and new history since the old and new histories are in
different repositories.
--source <source>::
Git repository to read from
--target <target>::
Git repository to overwrite with filtered history
Miscellaneous options
~~~~~~~~~~~~~~~~~~~~~
--help::
-h::
Show a help message and exit.
--force::
-f::
Ignore fresh clone checks and rewrite history (an irreversible
operation, especially since it by default ends with an
immediate pruning of reflogs and old objects). See
<<FRESHCLONE>>. Note that when cloning repos on a local
filesystem, it is better to pass `--no-local` to git clone
than passing `--force` to git-filter-repo.
--partial::
Do a partial history rewrite, resulting in the mixture of old and
new history. This disables rewriting refs/remotes/origin/* to
refs/heads/*, disables removing of the 'origin' remote, disables
removing unexported refs, disables expiring the reflog, and
disables the automatic post-filter gc. Also, this modifies
--tag-rename and --refname-callback options such that instead of
replacing old refs with new refnames, it will instead create new
refs and keep the old ones around. Use with caution.
--refs <refs+>::
Limit history rewriting to the specified refs. Implies --partial.
In addition to the normal caveats of --partial (mixing old and new
history, no automatic remapping of refs/remotes/origin/* to
refs/heads/*, etc.), this also may cause problems for pruning of
degenerate empty merge commits when negative revisions are
specified.
--dry-run::
Do not change the repository. Run `git fast-export` and filter its
output, and save both the original and the filtered version for
comparison. This also disables rewriting commit messages due to
not knowing new commit IDs and disables filtering of some empty
commits due to inability to query the fast-import backend.
--debug::
Print additional information about operations being performed and
commands being run. (If used together with --dry-run, shows
extra information about what would be run).
--stdin::
Instead of running `git fast-export` and filtering its output,
filter the fast-export stream from stdin. The stdin must be in
the expected input format (e.g. it needs to include original-oid
directives).
--quiet::
Pass --quiet to other git commands called.
OUTPUT
------
Every time filter-repo is run, files are created in the `.git/filter-repo/`
directory. These files are updated or overwritten on every run.
Commit map
~~~~~~~~~~
The `$GIT_DIR/filter-repo/commit-map` file contains a mapping of how all
commits were (or were not) changed.
* A header is the first line with the text "old" and "new"
* Commit mappings are in no particular order
* All commits in range of the rewrite will be listed, even commits
that are unchanged (e.g. because the commit pre-dated when files
the filtering operation are removing were introduced to the repo).
* An all-zeros hash, or null SHA, represents a non-existent object.
When in the "new" column, this means the commit was removed
entirely.
Reference map
~~~~~~~~~~~~~
The `$GIT_DIR/filter-repo/ref-map` file contains a mapping of which local
references were (or were not) changed.
* A header is the first line with the text "old", "new" and "ref"
* Reference mappings are sorted by ref
* An all-zeros hash, or null SHA, represents a non-existent object.
When in the "new" column, this means the ref was removed entirely.
Changed References
~~~~~~~~~~~~~~~~~~
The `$GIT_DIR/filter-repo/changed-refs` file contains a list of refs that
were changed.
* No header is provided
* Lists the subsets of refs from ref-map for which old != new
* While unnecessary since this provides no new information over ref-map,
it does make it easier to quickly determine which refs were changed by
the rewrite.
First Changed Commits
~~~~~~~~~~~~~~~~~~~~~
The `$GIT_DIR/filter-repo/first-changed-commits` contains a list of the
first commit(s) changed by the filtering operation. These are the commits
that got rewritten and which had no parents that were also rewritten.
So, for example if you had commits
A1-B1-C1-D1-E1
before running git-filter-repo, and afterward you had commits
A1-B2-C2-D2-E2
then the First Changed Commits file would contain just one line, which
would be the hash of B2.
In most cases, there will only be one commit listed, but if you had
multiple root commits or a non-linear history where the commits on
those diverging histories were the first ones modified, then there
could be multiple first changed commits and they will each be listed
on separate lines.
Already Ran
~~~~~~~~~~~
The `$GIT_DIR/filter-repo/already_ran` file contains a file recording that
git-filter-repo has been run. When this file is present, future runs will
be treated as an extension of the previous filtering operation.
Concretely, this means:
* The "Fresh Clone" check is bypassed
This is done because past runs would cause the repository to no longer
look like a fresh clone, and thus fail the fresh clone check, but doing
filtering via multiple invocations of git-filter-repo is an intended
and support usecase. You already passed or bypassed the "Fresh Clone"
check on your initial run.
* The commit-map and ref-map files above will be updated rather than
simply rewritten.
In other words, if the first filter-repo invocation rewrote commit
A to commit B, and the second filter-repo invocation rewrite
commit B to commit C, then the second run would have an "A C"
entry rather than a "B C" entry for the changed commit.
* The first changed commit(s) (reported When using the
--sensitive-data-removal option) will be the first original commit
modified, not the first intermediate commit modified.
In more detail, if the repository original had the following commits:
A1-B1-C1-D1-E1
and the first invocation of filter-repo changed this to
A1-B1-C2-D2-E2
then the first run would report "C1" as the first changed commit. If
a second filter-repo run further changed this to
A1-B1-C2-D3-E3
then it would report "C1" as the first changed commit, not "D2",
because it is comparing to the original commits rather than the
intermediate ones.
However, if the already_ran file exists but is older than 1 day when they
invoke git-filter-repo, the user will be prompted for whether the new run
should be considered a continuation of the previous run. If they do not
answer in the affirmative, then the above three bullets will not apply.
This prompt exists because users might do a history rewrite in a repository,
forget about it and leave the $GIT_DIR/filter-repo directory around, and
then some months or years later need to do another rewrite. If commits
have been made public and shared from the previous rewrite, then the next
filter-repo run should not be considered a continuation of the previous
filtering run.
Original LFS Objects
~~~~~~~~~~~~~~~~~~~~
When running with the --sensitive-data-removal flag, and LFS is in use by the
repository, the `$GIT_DIR/filter-repo/original_lfs_objects` contains a list of
LFS objects referenced by the repository before the rewrite, in sorted order.
Orphaned LFS Objects
~~~~~~~~~~~~~~~~~~~~
When running with the --sensitive-data-removal flag, and LFS is in use by the
repository, the `$GIT_DIR/filter-repo/orphaned_lfs_objects` contains a list of
LFS objects that used to be referenced by the repository but no longer are after
git-filter-repo has run. Objects appear in sorted order.
[[FRESHCLONE]]
FRESH CLONE SAFETY CHECK AND --FORCE
------------------------------------
Since filter-repo does irreversible rewriting of history, it is
important to avoid making changes to a repo for which the user doesn't
have a good backup. The primary defense mechanism is to simply
educate users and rely on them to be good stewards of their data; thus
there are several warnings in the documentation about how filter repo
rewrites history.
However, as a service to users, we would like to provide an additional
safety check beyond the documentation. There isn't a good way to
check if the user has a good backup, but we can ask a related question
that is an imperfect but quite reasonable proxy: "Is this repository a
fresh clone?" Unfortunately, that is also a question we can't get a
perfect answer to; git provides no way to answer that question.
However, there are approximately a dozen things that I found that seem
to always be true of brand new clones (assuming they are either clones
of remote repositories or are made with the `--no-local` flag), and I
check for all of those.
These checks can have both false positives and false negatives.
Someone might have a perfectly good backup of their repo without it
actually being a fresh clone -- but there's no way for filter-repo to
know that. Conversely, someone could look at all things that
filter-repo checks for in its safety checks and then just tweak their
non-backed-up repository to satisfy those conditions (though it would
take a fair amount of effort, and it's astronomically unlikely that a
repo that isn't a fresh clone randomly happens to match all the
criteria). In practice, the safety checks filter-repo uses seem to be
really good at avoiding people accidentally running filter-repo on a
repository that they shouldn't be running it on. It even caught me
once when I did mean to run filter-repo but was in a different
directory than I thought I was.
In short, it's perfectly fine to use `--force` to override the safety
checks as long as you're okay with filter-repo irreversibly rewriting
the contents of the current repository. It is a really bad idea to
get in the habit of always specifying `--force`; if you do, one day
you will run one of your commands in the wrong directory like I did,
and you won't have the safety check anymore to bail you out. Also, it
is definitely NOT okay to recommend `--force` on forums, Q&A sites, or
in emails to other users without first carefully explaining that
`--force` means putting your repositories' data at risk. I am
especially bothered by people who suggest the flag when it clearly is
NOT needed; they are needlessly putting other peoples' data at risk.
[[VERSATILITY]]
VERSATILITY
-----------
filter-repo has a hierarchy of capabilities on the spectrum from easy to
use convenience flags that perform pre-defined types of filtering, to
choices that provide lots of flexibility in controlling how filtering
occurs. This spectrum includes the following:
* Convenience flags making common types of history rewriting simple (e.g.
--path, --strip-blobs-bigger-than, --replace-text, --mailmap)
* Options which are shorthand for others or which provide greater control
than others (e.g. --subdirectory-filter could just be written using
both a path selection (--path) and a path rename (--path-rename)
filter; --paths-from-file can handle all other --path* options and more
such as regex renaming of paths)
* Generic python callbacks for handling a certain type of data (the
filename, message, name, email, and refname callbacks)
* Generic python callbacks for handling fundamental git objects, allowing
greater control over the combination of data types the object holds
(the commit, tag, blob, and reset callbacks)
* The ability to import filter-repo as a module in a python program and
use its classes and functions for even greater control and flexibility
while still leveraging lots of basic capabilities. One can even use
this to write new tools with a completely different interface.
For more information about callbacks, see <<CALLBACKS>>. For examples on
writing python programs that import filter-repo as a module to create new
history rewriting tools, look at the contrib/filter-repo-demos/ directory.
That directory includes, among other examples, a reimplementation of
git-filter-branch which is faster than git-filter-branch, and a
reimplementation of BFG Repo Cleaner with several bug fixes and new
features.
[[DISCUSSION]]
DISCUSSION
----------
Using filter-repo is relatively simple, but rewriting history is part of
a larger discussion in terms of collaboration. When you rewrite
history, the old and new histories are no longer compatible; if you push
this history somewhere for others to view, it will look as though you've
done a rebase of all branches and tags. Make sure you are familiar with
the "RECOVERING FROM UPSTREAM REBASE" section of linkgit:git-rebase[1]
(and in particular, "The hard case") before proceeding, in addition to
this section.
Steps to use git-filter-repo as part of the bigger picture of doing a
history rewrite are roughly as follows:
1. Create a clone of your repository. You may pass `--bare` or
`--mirror` to `git clone`, if you prefer. You should pass
`--no-local` if the repository you are cloning from is on the local
filesystem. Avoid other flags; some might confuse the fresh clone
check, and others could cause parts of the data to be missing that
are needed for the rewrite.
2. (Optional) Run `git filter-repo --analyze`. This will create a
directory of reports mentioning multiple things: (a) paths that have
existed over time in your repo, (b) renames that have occurred in
your repo and (c) sizes of objects aggregated by
path/directory/extension/blob-id. This information may be useful in
choosing how to filter your repo. It can also be useful to re-run
--analyze after filtering to verify the changes look correct.
3. Before rewriting the history of your local copy with git-filter-repo,
determine where you will push the rewritten history to when you are
done. In the special case that you are trying to remove sensitive
data from an existing repository, you will want to push it back where
you cloned from, as well as clean up all other clones/copies of the
repo. If you will be pushing back to the repository you cloned from,
you will want to use the --sensitive-data-removal option and see the
Sensitive Data Removal section below. In most cases not dealing with
sensitive data removal, you will want to push to a new repo, because:
* Even after you rewrite history and push it back, other people who
previously cloned from the original repo will have the old history.
If they simply run `git pull && git push`, it will merge the
unrewritten history with the new, resulting in what looks like two
copies of each commit involved in your rewrite -- a new copy of
each commit which has the cleanups you made, and an old copy of
each commit that has not been cleaned up -- being merged together.
That means everything you carefully worked to remove from the
repository has been pushed back. You're more likely to succeed in
making sure they don't re-push the unclean data if you just give
them a new repository URL and tell them to reclone.
* Rewriting history will rewrite tags; those who have already
downloaded tags will not get the updated tags even if they specify
`--tags` to `git fetch` or `git pull` (see the "On Re-tagging"
section of linkgit:git-tag[1]). Every user trying to use an
existing clone will have to forcibly delete all tags they already
downloaded _before_ re-fetching them; it may be easier for them to
just re-clone, which they are more likely to do with a new clone
URL.
* Rewriting history may delete some refs (e.g. branches that only
had files that you wanted excised from history); unless you run
git push with the `--mirror` or `--prune` options, those refs
will continue to exist on the server. If folks then merge these
branches into others, then people have started mixing old and new
history. If users had already cloned these branches, removing
them from the server isn't enough; you need all users to delete
any local branches based on these refs and run fetch with the
`--prune` option as well. Simply re-cloning from a new URL is
easier.
* The server may not allow you to force push over some refs. For
example, code review systems may have special ref namespaces
(e.g. refs/changes/, refs/pull/, refs/merge-requests/) that they
have locked down, and you'll need to somehow prevent users from
merging those locked-down (and thus not cleaned up) histories
with your cleaned-up history. Every software code review system
handles this differently (see the sensitive data removal section
for some links).
4. Run filter-repo with your desired filtering options. Many examples
are given in the <<EXAMPLES>> section. For more complex cases, note
that doing the filtering in multiple steps (by running multiple
filter-repo invocations in a sequence) is supported. If anything
goes wrong here, simply delete your clone and restart.
5. Push your new repository to its new home (note that
refs/remotes/origin/* will have been moved to refs/heads/* as the
first part of filter-repo, so you can just deal with normal branches
instead of remote tracking branches).
6. (Optional) Some additional considerations
* filter-repo has a --replace-refs option to allow creating replace
refs (see linkgit:git-replace[1]) for each rewritten commit ID,
allowing you to use old (unabbreviated) commit hashes in the git
command line to refer to the newly rewritten commits. If you
want to use these replace refs, manually push them to the
relevant clone URL and tell users to manually fetch them (e.g. by
adjusting their fetch refspec, `git config --add
remote.origin.fetch +refs/replace/*:refs/replace/*`). Sadly,
replace refs are not yet widely understood; projects like jgit
and libgit2 do not support them and existing repository managers
(e.g. Gerrit, GitHub, GitLab) do not yet understand replace refs.
Thus one can't use old commit hashes within the UI of these other
systems. This may change in the future, but replace refs at
least help users locally within the git command line interface.
Also, be aware that commit-graphs are excessively cautious around
replace refs and just turn off entirely if any are present, so
after enough time has passed that old commit IDs become less
relevant, users may want to locally delete the replace refs to
regain the speedups from commit-graphs.
Why is my origin removed?
~~~~~~~~~~~~~~~~~~~~~~~~~
When you rewrite history, all commit IDs (starting with the first one
where changes are made) are modified. Even if you think you didn't
change an intermediate commit, the fact that you changed any of its
ancestors is also a change that counts and will cause a commit's ID to
change as well. It is unfortunately all-too-easy for yourself or
someone else to accidentally merge the old ugly history you were
trying to rewrite with the new history, resulting in not only the old
ugly history returning but getting you "two copies" of each commit
(both an original commit and a cleaned-up alternative), and thus
doubling the number of commits in your repository. In short, you end
up with an even bigger mess to clean up than you started with.
This happens frequently to people using `git filter-branch` or `BFG
repo cleaner`, and can happen to folks using `git filter-repo` if they
insist on pushing back to the original repo. Example ways you can get
such an even uglier history include:
* at the command line (of another clone of the same repo from before the
cleanup): `git pull && git push`
* in a software forge: "reopen old Pull-Request/Merge-Request/Code-Review
and hit the merge/submit button"
Removing the `origin` remote and suggesting people push to a new repo
(and ensuring they tell others to clone the new repo) is usually a
good forcing function to avoid these problems. But, if people really
want to push to the original repository despite these warnings, it is
trivial to do so; simply run:
* `git remote add origin $ORIGINAL_CLONE_URL`
and then you can push (e.g. `git push --force --branches --tags
--prune`). Since removing the origin url is such a cheap way to
potentially prevent big messes, and it's so easy to work around for
those that really do want to push back over the original history,
removing the origin url is a great safety measure that I employ.
One final warning if you really want to push back to the original repo:
see the next section on sensitive data removals. Those are the steps
needed when pushing back to the original repo; they are so involved that
I assume they are only worth it when sensitive data is involved, but you
can choose to follow them for other kinds of rewrites too.
Sensitive Data Removals
~~~~~~~~~~~~~~~~~~~~~~~
Sensitive data removals are a specialized type of history rewrite.
While it is always very problematic to mix the cleaned-up history with
the non-cleaned-up history, for sensitive data removals it is also bad
to allow others to continue to view/clone/fetch the non-cleaned-up
history at all; users often need to try to expunge the old history as
well.
Note that if the sensitive data under consideration is a
token/password/credential/secret (as is often the case), then it is
important that you revoke and rotate that credential first. Once the
credential is revoked or rotated, it can no longer be used for access.
Revoking/rotating may resolve your problem without resorting to the
heavy-handed action of rewriting and purging history.
For sensitive data removal history rewrites, there are three high-level
steps:
- Rewrite the repository locally, using git-filter-repo
- Make sure other copies are cleaned up, including:
* the server you cloned from
* other clones that exist, such as ones your colleagues made
- Prevent repeats and avoid future sensitive data spills
Each will be discussed in greater detail below.
One important thing to note, though, is that others working on the same
repository should be instructed to stop while you do the cleanup; if
they continue development during your cleanup, you'll likely be forced to
either discard their changes or start over on your cleanup.
Rewrite the repository locally, using git-filter-repo
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first step is to rewrite a copy of your repository locally using
git-filter-repo. The exact commands to run will differ based on where
in your repository the sensitive data is found, but some general tips:
- Use the --sensitive-data-removal flag. It will provide additional
information useful for the other steps.
- If the sensitive data is the entirety of one or more files, and no
version of those files from history needs to be kept in your
repository, the --invert-paths flag together with one or more --path
arguments may come in handy.
- If the sensitive data is just a string found within one or more
files and you want to replace that sensitive string with something
else while leaving the rest of the file(s) intact, the --replace-text
option may come in handy.
After rewriting the history locally, make sure to inspect it to ensure the
sensitive data has been removed. Some commands that might be handy for
checking are:
----
git log --all --name-status -- ${PROBLEMATIC_FILE1} ${PROBLEMATIC_FILE2}
----
or
----
git log -S"${PROBLEMATIC_STRING}" --all -p --
----
If either of these commands turn up more sensitive data, then run additional
git-filter-repo commands to clean up the necessary data before proceeding.
Make sure other copies are cleaned up: primary server
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Cleaning up the repository you cloned from requires force pushing your
rewritten history over the original. You need to force push all refs,
not just your current branch. You can use the following command to do so
(read the bulleted list right after this command before running it):
----
git push --force --mirror origin
----
Several comments on this command:
* If any of your colleagues have pushed any changes since you
started, this force push command will discard their changes.
* This force push is likely to fail to push some refs, since most
forges (Gerrit, GitHub, GitLab, etc.) prevent you from updating
some refs (e.g. `refs/changes/*`, `refs/pull/*`,
`refs/merge-requests/*`). You will need to follow the directions
from those forges to get the remaining refs updated or deleted,
and a garbage collection to be triggered on their end. Some
examples:
(https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html[GitLab's
docs on reducing repository size], or
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#fully-removing-the-data-from-github[the
"Fully removing the data from GitHub" section of GitHub's docs]).
* If you passed the `--no-fetch` option to git-filter-repo (or
implied it with another option), you will either need to (1) drop
the `--mirror` option and figure out which refs or refspecs to
push on your own, or (2) use the `--mirror` option and risk
deleting any refs you didn't fetch. Further, if you lacked some
refs the server had which included the sensitive data in their
history, then your only options at this point to actually clean up
the sensitive data from the server are to either redo your rewrite
from scratch (and make sure to get the relevant refs included this
time) or delete those refs on the server.
* Yes, I know that --mirror implies --force and is unnecessary. I
included --force anyway as a visual reminder to readers that this
is going to overwrite changes on the server.
Also, if any LFS objects were orphaned by your rewrite, those objects
likely contain sensitive data and need to be deleted/purged from the LFS
server. You'll have to ask the maintainer of the LFS server you are
using for how to delete/purge those on the server.
Make sure other copies are cleaned up: clones of colleagues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After you have cleaned up the server, the easiest way to clean up other
clones is to make everyone delete their existing clones and reclone.
If that isn't an option, then you will need to proceed carefully because
a simple `git pull && git push` from any other clone will recontaminate
the main repository and make the mess even harder to clean up. To avoid
this, before pushing from any other clone, you'll need to have them clean
up their copy, as detailed below.
First, though, let me note that you should *not* have other developers
try to cleanup their clone by running the same `git-filter-repo`
commands that you ran. While that sometimes may happen to work, it is
not reliable in general. Running the same `git-filter-repo` commands,
even if identical, can result in them getting new hashes for commits
that are different than your new hashes, and you'll end up with a mess
involving two or more copies of every commit.
Instead developers with other clones of the repository should run
through the following steps to clean up their copy if they are unwilling
to discard their copy and reclone:
- delete all tags and run `git fetch --prune --tags`. Running the
fetch command without deleting tags first will result in the old
tags being kept, which will keep the sensitive data.
- rebase any changes they have on any branch (or other ref) on top of
the new history. See the "RECOVERING FROM UPSTREAM REBASE" section
of linkgit:git-rebase[1] (and in particular, "The hard case") for
instructions.
- run a few steps to clean out the pre-rebase history (note that the first
step drops all reflogs including all stash entries. That's a high cost,
but needed to clean up the sensitive data):
* git reflog expire --expire=now --all
* git gc --prune=now
Once these steps are complete, you also need to verify that the clone no
longer contains any sensitive data (it is really easy to miss something,
which puts you at risk of recontaminating other repositories with the
sensitive data). You can do so by running:
----
git cat-file -t ${HASH_OF_FIRST_CHANGED_COMMIT}
----
Where `${HASH_OF_FIRST_CHANGED_COMMIT}` was printed by git-filter-repo at
the end of its run (if there was more than one "first changed commit",
run this command multiple times, with each commit hash). If this
command returns a fatal error, then the commit has correctly been
removed from this repository. If it responds with "commit", then the
object still exists and you need to re-delete tags, re-rebase all
necessary branches/refs, and re-expire reflogs and redo the gc. If you
are curious about which branches or refs were the problematic ones
holding on to `${HASH_OF_FIRST_CHANGED_COMMIT}`, then presuming you did
the reflog expire and gc jobs above, the following command should help
you find the problematic branches/refs:
----
git for-each-ref --contains ${HASH_OF_FIRST_CHANGED_COMMIT}
----
Also, remember, the cat-file command needs to come back with a fatal
error for every `${HASH_OF_FIRST_CHANGED_COMMIT}` involved if you have
more than one.
After this is all done, then if any LFS objects were orphaned by the
rewrite (which again, you will be told if you use the
--sensitive-data-removal option when you run git-filter-repo), then you
also need to remove those LFS objects. Look for them a couple
directories under .git/lfs/objects/, and delete them.
Prevent repeats and avoid future sensitive data spills
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are several measures you can take to help avoid repeat problems.
Not all may be applicable for your case, but the more that are, the more
likely you can avoid problems.
For dealing with the existing sensitive data spill:
- Since it is so easy to re-contaminate the repository you cloned from
(it merely takes a colleague to run `git pull && git push` from their
clone that was created before your cleanup), take extra vigilance in
performing the clean ups steps above for other clones to ensure they
have all been cleaned up.
- If you have a central repository everyone pushes to, look into methods
to ban the First Changed Commit(s) from being (re-)pushed to your
repository. Sadly, few repository managers currently have such a
built-in capability (see Gerrit's ban-commit ability for one such
example at
https://gerrit-review.googlesource.com/Documentation/cmd-ban-commit.html),
but a few may allow you to write your own pre-receive hooks that
reject pushes containing these bad commits. (Pro-tip for writing such
a pre-receive hook: use `git cat-file -t ${BAD_COMMIT}` as a cheap
check before checking if any revision range between `<old-oid>` and
`<new-oid>` contains `${BAD_COMMIT}`)
Steps to help avoid other future sensitive data spills:
* If sensitive data is likely to appear within certain filenames that
should not be tracked in git at all, then add those filenames to
.gitignore to reduce the risk that others accidentally add them.
* Avoid hardcoding secrets in code. Use environment variables,
configuration management tools, or secrets management services like
Azure Key Vault, AWS Secrets Manager, or HashiCorp Vault to manage and
inject secrets at runtime.
* Create a pre-commit hook to check for sensitive data before it is
committed or pushed anywhere, or use a well-known tool in a pre-commit
hook like git-secrets or gitleaks.
[[EXAMPLES]]
EXAMPLES
--------
Path based filtering
~~~~~~~~~~~~~~~~~~~~
To only keep the 'README.md' file plus the directories 'guides' and
'tools/releases/':
--------------------------------------------------
git filter-repo --path README.md --path guides/ --path tools/releases
--------------------------------------------------
Directory names can be given with or without a trailing slash, and all
filenames are relative to the toplevel of the repo. To keep all files
except these paths, just add `--invert-paths`:
--------------------------------------------------
git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths
--------------------------------------------------
If you want to have both an inclusion filter and an exclusion filter, just
run filter-repo multiple times. For example, to keep the src/main
subdirectory but exclude files under src/main named 'data', run:
--------------------------------------------------
git filter-repo --path src/main/
git filter-repo --path-glob 'src/*/data' --invert-paths
--------------------------------------------------
Note that the asterisk (`*`) will match across multiple directories, so the
second command would remove e.g. src/main/org/whatever/data. Also, the
second command by itself would also remove e.g. src/not-main/foo/data, but
since src/not-main/ was removed by the first command, that's not an issue.
Also, the use of quotes around the asterisk is sometimes important to avoid
glob expansion by the shell.
You can also select paths by regular expression (see
https://docs.python.org/3/library/re.html#regular-expression-syntax).
For example, to only include files from the repo whose name is in the
format YYYY-MM-DD.txt and is found at least two subdirectories deep:
--------------------------------------------------
git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'
--------------------------------------------------
If you want two directories to be renamed (and maybe merged if both are
renamed to the same location), use --path-rename; for example, to rename
both 'cmds/' and 'src/scripts/' to 'tools/':
--------------------------------------------------
git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/
--------------------------------------------------
As with `--path`, directories can be specified with or without a
trailing slash for `--path-rename`.
If you do a `--path-rename` to something that was already in use, it will
be silently overwritten. However, if you try to rename multiple files to
the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh
both existed and had different content with the renames above), then you
will be given an error. If you have such a case, you may want to add
another rename command to move one of the paths somewhere else where it
won't collide:
--------------------------------------------------
git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
--path-rename cmds/:tools/ \
--path-rename src/scripts/:tools/
--------------------------------------------------
Also, `--path-rename` brings up ordering issues; all path arguments are
applied in order. Thus, a command like
--------------------------------------------------
git filter-repo --path-rename sources/:src/main/ --path src/main/
--------------------------------------------------
would make sense but reversing the two arguments would not (src/main/ is
created by the rename so reversing the two would give you an empty repo).
Also, note that the rename of cmds/run_release.sh a couple examples ago was
done before the other renames.
Note that path renaming does not do path filtering, thus the following
command
--------------------------------------------------
git filter-repo --path src/main/ --path-rename tools/:scripts/
--------------------------------------------------
would not result in the tools or scripts directories being present, because
the single filter selected only src/main/. It's likely that you would
instead want to run:
--------------------------------------------------
git filter-repo --path src/main/ --path tools/ --path-rename tools/:scripts/
--------------------------------------------------
If you prefer to filter based solely on basename, use the `--use-base-name`
flag (though this is incompatible with `--path-rename`). For example, to
only include README.md and Makefile files from any directory:
--------------------------------------------------
git filter-repo --use-base-name --path README.md --path Makefile
--------------------------------------------------
If you wanted to delete all .DS_Store files in any directory, you could
either use:
--------------------------------------------------
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
--------------------------------------------------
or
--------------------------------------------------
git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'
--------------------------------------------------
(the `--path-glob` isn't sufficient by itself as it might miss a toplevel
.DS_Store file; further while something like `--path-glob '*.DS_Store'`
would workaround that problem it would also grab files named `foo.DS_Store`
or `bar/baz.DS_Store`)
Finally, see also the `--filename-callback` from <<CALLBACKS>>.
Filtering based on many paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have a long list of files, directories, globs, or regular
expressions to filter on, you can stick them in a file and use
`--paths-from-file`; for example, with a file named stuff-i-want.txt with
contents of
--------------------------------------------------
# Blank lines and comment lines are ignored.
# Examples similar to --path:
README.md
guides/
tools/releases
# An example that is like --path-glob:
glob:*.py
# An example that is like --path-regex:
regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$
# An example of renaming a path
tools/==>scripts/
# An example of using a regex to rename a path
regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt
--------------------------------------------------
then you could run
--------------------------------------------------
git filter-repo --paths-from-file stuff-i-want.txt
--------------------------------------------------
to get a repo containing only the toplevel README.md file, the guides/
and tools/releases/ directories, all python files, files whose name
was of the form YYYY-MM-DD.txt at least two subdirectories deep, and
would rename tools/ to scripts/ and rename files like foo/bar/baz.text
to bar/foo/baz.txt. Note the special line prefixes of `glob:` and
`regex:` and the special string `==>` denoting renames.
Sometimes you have a way of easily generating all the files you want.
For example, if you know that none of the currently tracked files have
any newlines or special characters in them (see core.quotePath from
`git config --help`) so that `git ls-files` would print all files
literally one per line, and you knew that you wanted to keep only the
files that are currently tracked (thus deleting from all commits in
history any files that only appear on other branches or that only
appear in older commits), then you could use a pair of commands such
as
--------------------------------------------------
git ls-files >../paths-i-want.txt
git filter-repo --paths-from-file ../paths-i-want.txt
--------------------------------------------------
Similarly, you could use --paths-from-file to delete many files. For
example, you could run `git filter-repo --analyze` to get reports,
look in one such as .git/filter-repo/analysis/path-deleted-sizes.txt
and copy all the filenames into a file such as
/tmp/files-i-dont-want-anymore.txt and then run
--------------------------------------------------
git filter-repo --invert-paths --paths-from-file /tmp/files-i-dont-want-anymore.txt
--------------------------------------------------
to delete them all.
Directory based shortcuts
~~~~~~~~~~~~~~~~~~~~~~~~~
Let's say you had a directory structure like the following:
module/
foo.c
bar.c
otherDir/
blah.config
stuff.txt
zebra.jpg
If you wanted just the module/ directory and you wanted it to become the
new root so that your new directory structure looked like
foo.c
bar.c
then you could run:
--------------------------------------------------
git filter-repo --subdirectory-filter module/
--------------------------------------------------
If you wanted all the files from the original repo, but wanted to move
everything under a subdirectory named my-module/, so that your new
directory structure looked like
my-module/
module/
foo.c
bar.c
otherDir/
blah.config
stuff.txt
zebra.jpg
then you would instead run:
--------------------------------------------------
git filter-repo --to-subdirectory-filter my-module/
--------------------------------------------------
Content based filtering
~~~~~~~~~~~~~~~~~~~~~~~
If you want to filter out all files bigger than a certain size, you can use
`--strip-blobs-bigger-than` with some size (K, M, and G suffixes are
recognized), e.g.:
--------------------------------------------------
git filter-repo --strip-blobs-bigger-than 10M
--------------------------------------------------
If you want to strip out all files with specified git object ids (hashes),
list the hashes in a file and run
--------------------------------------------------
git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS
--------------------------------------------------
If you want to modify file contents, you can do so based on a list of
expressions in a file, one per line. For example, with a file named
expressions.txt containing
--------------------------------------------------
p455w0rd
foo==>bar
glob:*666*==>
regex:\bdriver\b==>pilot
literal:MM/DD/YYYY==>YYYY-MM-DD
regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
--------------------------------------------------
then running
--------------------------------------------------
git filter-repo --replace-text expressions.txt
--------------------------------------------------
will go through and replace `p455w0rd` with `***REMOVED***`, `foo` with
`bar`, any line containing `666` with a blank line, the word `driver` with
`pilot` (but not if it has letters before or after; e.g. `drivers` will be
unmodified), replace the exact text `MM/DD/YYYY` with `YYYY-MM-DD` and
replace date strings of the form MM/DD/YYYY with ones of the form
YYYY-MM-DD. In the expressions file, there are a few things to note:
* Every line has a replacement, given by whatever is on the right of
`==>`. If `==>` does not appear on the line, the default replacement
is `***REMOVED***`.
* Lines can start with `literal:`, `glob:`, or `regex:` to specify
whether to do literal string matches,
globs (see https://docs.python.org/3/library/fnmatch.html), or regular
expressions (see https://docs.python.org/3/library/re.html#regular-expression-syntax).
If none of these are specified, `literal:` is assumed.
* If multiple matches are found, all are replaced.
* globs and regexes are applied to the entire file, but without any
special flags turned on. Some folks may be interested in adding `(?m)`
to the regex to turn on MULTILINE mode, so that `^` and `$` match the
beginning and ends of lines rather than the beginning and end of file.
See https://docs.python.org/3/library/re.html for details.
See also the `--blob-callback` from <<CALLBACKS>>.
Updating commit/tag messages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to modify commit or tag messages, you can do so with the
same syntax as `--replace-text`, explained above. For example, with a
file named expressions.txt containing
--------------------------------------------------
foo==>bar
--------------------------------------------------
then running
--------------------------------------------------
git filter-repo --replace-message expressions.txt
--------------------------------------------------
will replace `foo` in commit or tag messages with `bar`.
See also the `--message-callback` from <<CALLBACKS>>.
Refname based filtering
~~~~~~~~~~~~~~~~~~~~~~~
To rename tags, use `--tag-rename`, e.g.:
--------------------------------------------------
git filter-repo --tag-rename foo:bar
--------------------------------------------------
This will rename any tags starting with `foo` to now start with `bar`.
Either side of the colon could be blank, e.g.
--------------------------------------------------
git filter-repo --tag-rename '':'my-module-'
--------------------------------------------------
For more general refname modification, see `--refname-callback` from
<<CALLBACKS>>.
User and email based filtering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To modify username and emails of commits, you can create a mailmap
file in the format accepted by linkgit:git-shortlog[1]. For example,
if you have a file named my-mailmap you can run
--------------------------------------------------
git filter-repo --mailmap my-mailmap
--------------------------------------------------
and if the current contents of that file are as follows (if the
specified mailmap file is version controlled, historical versions of
the file are ignored):
--------------------------------------------------
Name For User <email@addre.ss>
<new@ema.il> <old1@ema.il>
New Name And <new@ema.il> <old2@ema.il>
New Name And <new@ema.il> Old Name And <old3@ema.il>
--------------------------------------------------
then we can update username and/or emails based on the specified
mapping.
See also the `--name-callback` and `--email-callback` from
<<CALLBACKS>>.
Parent rewriting
~~~~~~~~~~~~~~~~
To replace $commit_A with $commit_B (e.g. make all commits which had
$commit_A as a parent instead have $commit_B for that parent), and
rewrite history to make it permanent:
--------------------------------------------------
git replace $commit_A $commit_B
git filter-repo --proceed
--------------------------------------------------
To create a new commit with the same contents as $commit_A except with
different parent(s) and then replace $commit_A with the new commit,
and rewrite history to make it permanent:
--------------------------------------------------
git replace --graft $commit_A $new_parent_or_parents
git filter-repo --proceed
--------------------------------------------------
The `--proceed` option is needed to avoid failing the "no arguments
specified" check. Note that older versions of git-filter-repo
required `--force` to be passed after creating a graft to avoid
triggering the not-a-fresh-clone check; that check has been modified
to remove this overuse of `--force`.
Partial history rewrites
~~~~~~~~~~~~~~~~~~~~~~~~
To rewrite the history on just one branch (which may cause it to no longer
share any common history with other branches), use `--refs`. For example,
to remove a file named 'extraneous.txt' from the 'master' branch:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master
--------------------------------------------------
To rewrite just some recent commits:
--------------------------------------------------
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
--------------------------------------------------
[[CALLBACKS]]
CALLBACKS
---------
For flexibility, filter-repo allows you to specify functions on the
command line to further filter all changes. Please note that there
are some API compatibility caveats associated with these callbacks
that you should be aware of before using them; see the "API BACKWARD
COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source
code.
Most callback functions are of the same general format
(--file-info-callback is an exception which will be noted later). For
a command line argument like
--------------------------------------------------
--foo-callback 'BODY'
--------------------------------------------------
the following code will be compiled and called:
--------------------------------------------------
def foo_callback(foo):
BODY
--------------------------------------------------
Thus, you just need to make sure your _BODY_ modifies and returns
_foo_ appropriately. One important thing to note for all callbacks is
that filter-repo uses bytestrings (see
https://docs.python.org/3/library/stdtypes.html#bytes) everywhere
instead of strings.
There are four callbacks that allow you to operate directly on raw
objects that contain data that's easy to write in
linkgit:git-fast-import[1] format:
--------------------------------------------------
--blob-callback
--commit-callback
--tag-callback
--reset-callback
--------------------------------------------------
We'll come back to these later because it is often the case that the
other callbacks are more convenient. The other callbacks operate on a
small piece of the raw objects or operate on pieces across multiple
types of raw object (e.g. author names and committer names and tagger
names across commits and tags, or refnames across commits, tags, and
resets, or messages across commits and tags). The convenience
callbacks are:
--------------------------------------------------
--filename-callback
--message-callback
--name-callback
--email-callback
--refname-callback
--file-info-callback
--------------------------------------------------
in each you are expected to simply return a new value based on the one
passed in. For example,
--------------------------------------------------
git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'
--------------------------------------------------
would result in the following function being called:
--------------------------------------------------
def name_callback(name):
return name.replace(b"Wiliam", b"William")
--------------------------------------------------
The email callback is quite similar:
--------------------------------------------------
git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'
--------------------------------------------------
The refname callback is also similar, but note that the refname passed in
and returned are expected to be fully qualified (e.g. b"refs/heads/master"
instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"):
--------------------------------------------------
git-filter-repo --refname-callback '
# Change e.g. refs/heads/master to refs/heads/prefix-master
rdir,rpath = os.path.split(refname)
return rdir + b"/prefix-" + rpath'
--------------------------------------------------
The message callback is quite similar to the previous three callbacks,
though it operates on a bytestring that is likely more than one line:
--------------------------------------------------
git-filter-repo --message-callback '
if b"Signed-off-by:" not in message:
message += b"\nSigned-off-by: Me My <self@and.eye>"
return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'
--------------------------------------------------
The filename callback is slightly more interesting. Returning None means
the file should be removed from all commits, returning the filename
unmodified marks the file to be kept, and returning a different name means
the file should be renamed. An example:
--------------------------------------------------
git-filter-repo --filename-callback '
if b"/src/" in filename:
# Remove all files with a directory named "src" in their path
# (except when "src" appears at the toplevel).
return None
elif filename.startswith(b"tools/"):
# Rename tools/ -> scripts/misc/
return b"scripts/misc/" + filename[6:]
else:
# Keep the filename and do not rename it
return filename
'
--------------------------------------------------
The file-info callback is more involved. It is designed to be used in
cases where filtering depends on both filename and contents (and maybe
mode). It is called for file changes other than deletions (since
deletions have no file contents to operate on). The file info
callback takes four parameters (filename, mode, blob_id, and value),
and expects three to be returned (filename, mode, blob_id). The
filename is handled similar to the filename callback; it can be used
to rename the file (or set to None to drop the change). The mode is a
simple bytestring (b"100644" for regular non-executable files,
b"100755" for executable files/scripts, b"120000" for symlinks, and
b"160000" for submodules). The blob_id is most useful in conjunction
with the value parameter. The value parameter is an instance of a
class that has the following functions
value.get_contents_by_identifier(blob_id) -> contents (bytestring)
value.get_size_by_identifier(blob_id) -> size_of_blob (int)
value.insert_file_with_contents(contents) -> blob_id
value.is_binary(contents) -> bool
value.apply_replace_text(contents) -> new_contents (bytestring)
and has the following member data you can write to
value.data (dict)
These functions allow you to get the contents of the file, or its
size, create a new file in the stream whose blob_id you can return,
check whether some given contents are binary (using the heuristic from
the grep(1) command), and apply the replacement rules from --replace-text
(note that --file-info-callback makes the changes from --replace-text not
auto-apply). You could use this for example to only apply the changes
from --replace-text to certain file types and simultaneously rename the
files it applies the changes to:
--------------------------------------------------
git-filter-repo --file-info-callback '
if not filename.endswith(b".config"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
new_filename = filename[0:-7] + b".cfg"
contents = value.get_contents_by_identifier(blob_id)
new_contents = value.apply_replace_text(contents)
new_blob_id = value.insert_file_with_contents(new_contents)
return (new_filename, mode, new_blob_id)
--------------------------------------------------
Note that if history has multiple revisions with the same file
(e.g. it was cherry-picked to multiple branches or there were a number
of reverts), then the --file-info-callback will be called multiple
times. If you want to avoid processing the same file multiple times,
then you can stash transformation results in the value.data dict.
For, example, we could modify the above example to make it only apply
transformations on blob_ids we have not seen before:
--------------------------------------------------
git-filter-repo --file-info-callback '
if not filename.endswith(b".config"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
new_filename = filename[0:-7] + b".cfg"
if blob_id in value.data:
return (new_filename, mode, value.data[blob_id])
contents = value.get_contents_by_identifier(blob_id)
new_contents = value.apply_replace_text(contents)
new_blob_id = value.insert_file_with_contents(new_contents)
value.data[blob_id] = new_blob_id
return (new_filename, mode, new_blob_id)
--------------------------------------------------
An alternative example for the --file-info-callback is to make all
.sh files executable and add an extra trailing newline to the .sh
files:
--------------------------------------------------
git-filter-repo --file-info-callback '
if not filename.endswith(b".sh"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
# There are only 4 valid modes in git:
# - 100644, for regular non-executable files
# - 100755, for executable files/scripts
# - 120000, for symlinks
# - 160000, for submodules
new_mode = b"100755"
contents = value.get_contents_by_identifier(blob_id)
new_contents = contents + b"\n"
new_blob_id = value.insert_file_with_contents(new_contents)
return (filename, new_mode, new_blob_id)
--------------------------------------------------
In contrast to the previous callback types, the blob, reset, tag, and
commit callbacks are not expected to return a value, but are instead
expected to modify the object passed in. Major fields for these
objects are (subject to API backward compatibility caveats mentioned
previously):
* Blob: `original_id` (original hash) and `data`
* Reset: `ref` (name of reference) and `from_ref` (hash or integer mark)
* Tag: `ref`, `from_ref`, `original_id`, `tagger_name`, `tagger_email`,
`tagger_date`, `message`
* Commit: `branch`, `original_id`, `author_name`, `author_email`,
`author_date`, `committer_name`, `committer_email`,
`committer_date`, `message`, `file_changes` (list of
FileChange objects, each containing a `type`, `filename`,
`mode`, and `blob_id`), `parents` (list of hashes or integer
marks)
An example of each:
--------------------------------------------------
git filter-repo --blob-callback '
if len(blob.data) > 25:
# Mark this blob for removal from all commits
blob.skip()
else:
blob.data = blob.data.replace(b"Hello", b"Goodbye")
'
--------------------------------------------------
--------------------------------------------------
git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'
--------------------------------------------------
--------------------------------------------------
git filter-repo --tag-callback '
if tag.tagger_name == b"Jim Williams":
# Omit this tag
tag.skip()
else:
tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'
--------------------------------------------------
--------------------------------------------------
git filter-repo --commit-callback '
# Remove executable files with three 6s in their name (including
# from leading directories).
# Also, undo deletion of sources/foo/bar.txt (change types are
# either b"D" (deletion) or b"M" (add or modify); renames are
# handled by deleting the old file and adding a new one)
commit.file_changes = [
change for change in commit.file_changes
if not (change.mode == b"100755" and
change.filename.count(b"6") == 3) and
not (change.type == b"D" and
change.filename == b"sources/foo/bar.txt")]
# Mark all .sh files as executable; modes in git are always one of
# 100644 (normal file), 100755 (executable), 120000 (symlink), or
# 160000 (submodule)
for change in commit.file_changes:
if change.filename.endswith(b".sh"):
change.mode = b"100755"
'
--------------------------------------------------
[[INTERNALS]]
INTERNALS
---------
You probably don't need to read this section unless you are just very
curious or you are trying to do a very complex history rewrite.
How filter-repo works
~~~~~~~~~~~~~~~~~~~~~
Roughly, filter-repo works by running
--------------------------------------------------
git fast-export <options> | filter | git fast-import <options>
--------------------------------------------------
where filter-repo not only launches the whole pipeline but also serves as
the _filter_ in the middle. However, filter-repo does a few additional
things on top in order to make it into a well-rounded filtering tool. A
sequence that more accurately reflects what filter-repo runs is:
1. Verify we're in a fresh clone
2. `git fetch -u . refs/remotes/origin/*:refs/heads/*`
3. `git remote rm origin`
4. `git fast-export --show-original-ids --reference-excluded-parents --fake-missing-tagger --signed-tags=strip --tag-of-filtered-object=rewrite --use-done-feature --no-data --reencode=yes --mark-tags --all | filter | git -c core.ignorecase=false fast-import --date-format=raw-permissive --force --quiet`
5. `git update-ref --no-deref --stdin`, fed with a list of refs to nuke, and a list of replace refs to delete, create, or update.
6. `git reset --hard`
7. `git reflog expire --expire=now --all`
8. `git gc --prune=now`
Some notes or exceptions on each of the above:
1. If we're not in a fresh clone, users will not be able to recover if
they used the wrong command or ran in the wrong repo. (Though
`--force` overrides this check, and it's also off if you've already
ran filter-repo once in this repo.)
2. Technically, we actually use a `git update-ref` command fed with a lot
of input due to the fact that users can use `--force` when local
branches might not match remote branches. But this fetch command
catches the intent rather succinctly.
3. We don't want users accidentally pushing back to the original repo, as
discussed in <<DISCUSSION>>. It also reminds users that since history
has been rewritten, this repo is no longer compatible with the
original. Finally, another minor benefit is this allows users to push
with the `--mirror` option to their new home without accidentally
sending remote tracking branches.
4. Some of these flags are always used but others are actually
conditional. For example, filter-repo's `--replace-text` and
`--blob-callback` options need to work on blobs so `--no-data` cannot
be passed to fast-export. But when we don't need to work on blobs,
passing `--no-data` speeds things up. Also, other flags may change
the structure of the pipeline as well (e.g. `--dry-run` and `--debug`)
5. We use this step to write replace refs for accessing the newly written
commit hashes using their previous names. Also, if refs were renamed
by various steps, we need to delete the old refnames in order to avoid
mixing old and new history.
6. Users also have old versions of files in their working tree and index;
we want those cleaned up to match the rewritten history as well. Note
that this step is skipped in bare repos.
7. Reflogs will hold on to old history, so we need to expire them.
8. We need to gc to avoid mixing new and old history. Also, it shrinks
the repository for users, so they don't have to do extra work. (Odds
are that they've only rewritten trees and commits and maybe a few
blobs, so `--aggressive` isn't needed and would be too slow.)
Information about these steps is printed out when `--debug` is passed
to filter-repo. When doing a `--partial` history rewrite, steps 2, 3,
7, and 8 are unconditionally skipped, step 5 is skipped if
`--replace-refs` is `update-no-add`, and just the nuke-unused-refs
portion of step 5 is skipped if `--replace-refs` is something else.
Limitations
~~~~~~~~~~~
Inherited limitations
^^^^^^^^^^^^^^^^^^^^^
Since git filter-repo calls fast-export and fast-import to do a lot of the
heavy lifting, it inherits limitations from those systems:
* extended commit headers, if any, are stripped
* commits get rewritten meaning they will have new hashes; therefore,
signatures on commits and tags cannot continue to work and instead are
just removed (thus signed tags become annotated tags)
* tags of commits are supported. Prior to git-2.24.0, tags of blobs and
tags of tags are not supported (fast-export would die on such tags).
tags of trees are not supported in any git version (since fast-export
ignores tags of trees with a warning and fast-import provides no way to
import them).
* annotated and signed tags outside of the refs/tags/ namespace are not
supported (their location will be mangled in weird ways)
* fast-import will die on various forms of invalid input, such as a
timezone with more than four digits
* fast-export cannot reencode commit messages into UTF-8 if the commit
message is not valid in its specified encoding (in such cases, it'll
leave the commit message and the encoding header alone).
* commits without an author will be given one matching the committer
* tags without a tagger will be given a fake tagger
* references that include commit cycles in their history (which can be
created with linkgit:git-replace[1]) will not be flagged to the user as
an error but will be silently deleted by fast-export as though the
branch or tag contained no interesting files
There are also some limitations due to the design of these systems:
* Trying to insert additional files into the stream can be tricky; since
fast-export only lists file changes in a merge relative to its first
parent, if you insert additional files into a commit that is in the
second (or third or fourth) parent history of a merge, then you also
need to add it to the merge manually. (Similarly, if you change which
parent is the first parent in a merge commit, you need to manually
update the list of file changes to be relative to the new first
parent.)
* fast-export and fast-import work with exact file contents, not patches.
(e.g. "Whatever the current contents of this file, update them to now
have these contents") Because of this, removing the changes made in a
single commit or inserting additional changes to a file in some commit
and expecting them to propagate forward is not something that can be
done with these tools. Use linkgit:git-rebase[1] for that.
Intrinsic limitations
^^^^^^^^^^^^^^^^^^^^^
Some types of filtering have limitations that would affect any tool
attempting to perform them; the most any tool can do is attempt to notify
the user when it detects an issue:
* When rewriting commit hashes in commit messages, there are a variety
of cases when the hash will not be updated (whenever this happens, a
note is written to `.git/filter-repo/suboptimal-issues`):
** if a commit hash does not correspond to a commit in the old repo
** if a commit hash corresponds to a commit that gets pruned
** if an abbreviated hash is not unique
* Pruning of empty commits can cause a merge commit to lose an entire
ancestry line and become a non-merge. If the merge commit had no
changes then it can be pruned too, but if it still has changes it needs
to be kept. This might cause minor confusion since the commit will
likely have a commit message that makes it sound like a merge commit
even though it's not. (Whenever a merge commit becomes a non-merge
commit, a note is written to `.git/filter-repo/suboptimal-issues`)
Issues specific to filter-repo
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Multiple repositories in the wild have been observed which use a bogus
timezone (`+051800`); google will find you some reports. The intended
timezone wasn't clear or wasn't always the same. Replace with a
different bogus timezone that fast-import will accept (`+0261`).
* `--path-rename` can result in pathname collisions; to avoid excessive
memory requirements of tracking which files are in all commits or
looking up what files exist with either every commit or every usage of
--path-rename, we just tell the user that they might clobber other
changes if they aren't careful. We can check if the clobbering comes
from another --path-rename without much overhead. (Perhaps in the
future it's worth adding a slow mode to --path-rename that will do the
more exhaustive checks?)
* There is no mechanism for directly controlling which flags are passed
to fast-export (or fast-import); only pre-defined flags can be turned
on or off as a side-effect of other options. Direct control would make
little sense because some options like `--full-tree` would require
additional code in filter-repo (to parse new directives), and others
such as `-M` or `-C` would break assumptions used in other places of
filter-repo.
* Partial-repo filtering, while supported, runs counter to filter-repo's
"avoid mixing old and new history" design. This support has required
improvements to core git as well (e.g. it depends upon the
`--reference-excluded-parents` option to fast-export that was added
specifically for this usage within filter-repo). The `--partial` and
`--refs` options will continue to be supported since there are people
with usecases for them; however, I am concerned that this inconsistency
about mixing old and new history seems likely to lead to user mistakes.
For now, I just hope that long explanations of caveats in the
documentation of these options suffice to curtail any such problems.
Comments on reversibility
^^^^^^^^^^^^^^^^^^^^^^^^^
Some people are interested in reversibility of a rewrite; e.g. rewrite
history, possibly add some commits, then unrewrite and get the original
history back plus a few new "unrewritten" commits. Obviously this is
impossible if your rewrite involves throwing away information
(e.g. filtering out files or replacing several different strings with
`***REMOVED***`), but may be possible with some rewrites. filter-repo is
likely to be a poor fit for this type of workflow for a few reasons:
* most of the limitations inherited from fast-export and fast-import
are of a type that cause reversibility issues
* grafts and replace refs, if present, are used in the rewrite and made
permanent
* rewriting of commit hashes will probably be reversible, but it is
possible for rewritten abbreviated hashes to not be unique even if the
original abbreviated hashes were.
* filter-repo defaults to several forms of irreversible rewriting that
you may need to turn off (e.g. the last two bullet points above or
reencoding commit messages into UTF-8); it's possible that additional
forms of irreversible rewrites will be added in the future.
* I assume that people use filter-repo for one-shot conversions, not
ongoing data transfers. I explicitly reserve the right to change any
API in filter-repo based on this presumption (and a comment to this
effect is found in multiple places in the code and examples). You
have been warned.
SEE ALSO
--------
linkgit:git-rebase[1], linkgit:git-filter-branch[1]
GIT
---
Part of the linkgit:git[1] suite
================================================
FILE: INSTALL.md
================================================
# Table of Contents
* [Pre-requisites](#pre-requisites)
* [Simple Installation](#simple-installation)
* [Installation via Package Manager](#installation-via-package-manager)
* [Detailed installation explanation for
packagers](#detailed-installation-explanation-for-packagers)
* [Installation as Python Package from
PyPI](#installation-as-python-package-from-pypi)
* [Installation via Makefile](#installation-via-makefile)
* [Notes for Windows Users](#notes-for-windows-users)
# Pre-requisites
Instructions on this page assume you have already installed both
[Git](https://git-scm.com) and [Python](https://www.python.org/)
(though the [Notes for Windows Users](#notes-for-windows-users) has
some tips on Python).
# Simple Installation
All you need to do is download one file: the [git-filter-repo script
in this repository](git-filter-repo) ([direct link to raw
file](https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repo)),
making sure to preserve its name (`git-filter-repo`, with no
extension). **That's it**. You're done.
Then you can run any command you want, such as
$ python3 git-filter-repo --analyze
If you place the git-filter-repo script in your $PATH, then you can
shorten commands by replacing `python3 git-filter-repo` with `git
filter-repo`; the manual assumes this but you can use the longer form.
Optionally, if you also want to use some of the contrib scripts, then
you need to make sure you have a `git_filter_repo.py` file which is
either a link to or copy of `git-filter-repo`, and you need to place
that git_filter_repo.py file in $PYTHONPATH.
If you prefer an "official" installation over the manual installation
explained above, the other sections may have useful tips.
# Installation via Package Manager
If you want to install via some [package
manager](https://alternativeto.net/software/yellowdog-updater-modified/?license=opensource),
you can run
$ PACKAGE_TOOL install git-filter-repo
The following package managers have packaged git-filter-repo:
[](https://repology.org/project/git-filter-repo/versions)
This list covers at least Windows (Scoop), Mac OS X (Homebrew), and
Linux (most the rest). Note that I do not curate this list (and have
no interest in doing so); https://repology.org tracks who packages
these versions.
# Detailed installation explanation for packagers
filter-repo only consists of a few files that need to be installed:
* git-filter-repo
This is the _only_ thing needed for basic use.
This can be installed in the directory pointed to by `git --exec-path`,
or placed anywhere in $PATH.
If your python3 executable is named "python" instead of "python3"
(this particularly appears to affect a number of Windows users),
then you'll also need to modify the first line of git-filter-repo
to replace "python3" with "python".
* git_filter_repo.py
This is needed if you want to make use of one of the scripts in
contrib/filter-repo-demos/, or want to write your own script making use
of filter-repo as a python library.
You can create this symlink to (or copy of) git-filter-repo named
git_filter_repo.py and place it in your python site packages; `python
-c "import site; print(site.getsitepackages())"` may help you find the
appropriate location for your system. Alternatively, you can place
this file anywhere within $PYTHONPATH.
* git-filter-repo.1
This is needed if you want `git filter-repo --help` to succeed in
displaying the manpage, when help.format is "man" (the default on Linux
and Mac).
This can be installed in the directory pointed to by `$(git
--man-path)/man1/`, or placed anywhere in $MANDIR/man1/ where $MANDIR
is some entry from $MANPATH.
Note that `git filter-repo -h` will show a more limited built-in set of
instructions regardless of whether the manpage is installed.
* git-filter-repo.html
This is needed if you want `git filter-repo --help` to succeed in
displaying the html version of the help, when help.format is set to
"html" (the default on Windows).
This can be installed in the directory pointed to by `git --html-path`.
Note that `git filter-repo -h` will show a more limited built-in set of
instructions regardless of whether the html version of help is
installed.
So, installation might look something like the following:
1. If you don't have the necessary documentation files (because you
are installing from a clone of filter-repo instead of from a
tarball) then you can first run:
`make snag_docs`
(which just copies the generated documentation files from the
`docs` branch)
2. Run the following
```
cp -a git-filter-repo $(git --exec-path)
cp -a git-filter-repo.1 $(git --man-path)/man1 && mandb
cp -a git-filter-repo.html $(git --html-path)
ln -s $(git --exec-path)/git-filter-repo \
$(python -c "import site; print(site.getsitepackages()[-1])")/git_filter_repo.py
```
or you can use the provided Makefile, as noted below.
# Installation as Python Package from PyPI
`git-filter-repo` is also available as
[PyPI-package](https://pypi.org/project/git-filter-repo/).
Therefore, it can be installed with [pipx](https://pypa.github.io/pipx/)
or [uv tool](https://docs.astral.sh/uv/concepts/tools/).
Command example for pipx:
`pipx install git-filter-repo`
# Installation via Makefile
Installing should be doable by hand, but a Makefile is provided for those
that prefer it. However, usage of the Makefile really requires overriding
at least a couple of the directories with sane values, e.g.
$ make prefix=/usr pythondir=/usr/lib64/python3.8/site-packages install
Also, the Makefile will not edit the shebang line (the first line) of
git-filter-repo if your python executable is not named "python3";
you'll still need to do that yourself.
# Notes for Windows Users
git-filter-repo can be installed with multiple tools, such as
[pipx](https://pypa.github.io/pipx/) or a Windows-specific package manager
like Scoop (both of which were covered above).
Sadly, Windows sometimes makes things difficult. Common and historical issues:
* **Non-functional Python stub**: Windows apparently ships with a
[non-functional
python](https://github.com/newren/git-filter-repo/issues/36#issuecomment-568933825).
This can even manifest as [the app
hanging](https://github.com/newren/git-filter-repo/issues/36) or
[the system appearing to
hang](https://github.com/newren/git-filter-repo/issues/312). Try
installing
[Python](https://docs.microsoft.com/en-us/windows/python/beginners)
from the [Microsoft
Store](https://apps.microsoft.com/store/search?publisher=Python%20Software%20Foundation)
* **Modifying PATH, making the script executable**: If modifying your PATH
and/or making scripts executable is difficult for you, you can skip that
step by just using `python3 git-filter-repo` instead of `git filter-repo`
in your commands.
* **Different python executable name**: Some users don't have
a `python3` executable but one named something else like `python`
or `python3.8` or whatever. You may need to edit the first line
of the git-filter-repo script to specify the appropriate path. Or
just don't bother and instead use the long form for executing
filter-repo commands. Namely, replace the `git filter-repo` part
of commands with `PYTHON_EXECUTABLE git-filter-repo`. (Where
`PYTHON_EXECUTABLE` is something like `python` or `python3.8` or
`C:\PATH\TO\INSTALLATION\OF\python3.exe` or whatever).
* **Symlink issues**: git_filter_repo.py is supposed to be a symlink to
git-filter-repo, so that it appears to have identical contents.
If your system messed up the symlink (usually meaning it looks like a
regular file with just one line), then delete git_filter_repo.py and
replace it with a copy of git-filter-repo.
* **Old GitBash limitations**: older versions of GitForWindows had an
unfortunate shebang length limitation (see [git-for-windows issue
#3165](https://github.com/git-for-windows/git/pull/3165)). If
you're affected, just use the long form for invoking filter-repo
commands, i.e. replace the `git filter-repo` part of commands with
`python3 git-filter-repo`.
For additional historical context, see:
* [#371](https://github.com/newren/git-filter-repo/issues/371#issuecomment-1267116186)
* [#360](https://github.com/newren/git-filter-repo/issues/360#issuecomment-1276813596)
* [#312](https://github.com/newren/git-filter-repo/issues/312)
* [#307](https://github.com/newren/git-filter-repo/issues/307)
* [#225](https://github.com/newren/git-filter-repo/pull/225)
* [#231](https://github.com/newren/git-filter-repo/pull/231)
* [#124](https://github.com/newren/git-filter-repo/issues/124)
* [#36](https://github.com/newren/git-filter-repo/issues/36)
* [this git mailing list
thread](https://lore.kernel.org/git/nycvar.QRO.7.76.6.2004251610300.18039@tvgsbejvaqbjf.bet/)
================================================
FILE: Makefile
================================================
# A bunch of installation-related paths people can override on the command line
DESTDIR = /
INSTALL = install
prefix = $(HOME)
bindir = $(prefix)/libexec/git-core
localedir = $(prefix)/share/locale
mandir = $(prefix)/share/man
htmldir = $(prefix)/share/doc/git-doc
pythondir = $(prefix)/lib64/python3.6/site-packages
default: build
build:
@echo Nothing to do: filter-repo is a script which needs no compilation.
test:
time t/run_coverage
# fixup_locale might matter once we actually have translations, but right now
# we don't. It might not even matter then, because python has a fallback podir.
fixup_locale:
sed -ie s%@@LOCALEDIR@@%$(localedir)% git-filter-repo
# People installing from tarball will already have man1/git-filter-repo.1 and
# html/git-filter-repo.html. But let's support people installing from a git
# clone too; for them, just cheat and snag a copy of the built docs that I
# record in a different branch.
snag_docs: Documentation/man1/git-filter-repo.1 Documentation/html/git-filter-repo.html
Documentation/man1/git-filter-repo.1:
mkdir -p Documentation/man1
git show origin/docs:man1/git-filter-repo.1 >Documentation/man1/git-filter-repo.1
Documentation/html/git-filter-repo.html:
mkdir -p Documentation/html
git show origin/docs:html/git-filter-repo.html >Documentation/html/git-filter-repo.html
install: snag_docs #fixup_locale
$(INSTALL) -Dm0755 git-filter-repo "$(DESTDIR)/$(bindir)/git-filter-repo"
$(INSTALL) -dm0755 "$(DESTDIR)/$(pythondir)"
ln -sf "$(bindir)/git-filter-repo" "$(DESTDIR)/$(pythondir)/git_filter_repo.py"
$(INSTALL) -Dm0644 Documentation/man1/git-filter-repo.1 "$(DESTDIR)/$(mandir)/man1/git-filter-repo.1"
$(INSTALL) -Dm0644 Documentation/html/git-filter-repo.html "$(DESTDIR)/$(htmldir)/git-filter-repo.html"
if which mandb > /dev/null; then mandb; fi
#
# The remainder of the targets are meant for tasks for the maintainer; if they
# don't work for you, I don't care. These tasks modify branches and upload
# releases and whatnot, and presume a directory layout I have locally.
#
update_docs:
# Set environment variables once
export GIT_WORK_TREE=$(shell mktemp -d) \
export GIT_INDEX_FILE=$(shell mktemp) \
COMMIT=$(shell git rev-parse HEAD) \
&& \
# Sanity check; we'll build docs in a clone of a git repo \
test -d ../git && \
# Sanity check; docs == origin/docs \
test -z "$(git rev-parse docs origin/docs | uniq -u)" && \
# Avoid spurious errors by forcing index to be well formatted, if empty \
git read-tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 && # empty tree \
# Symlink git-filter-repo.txt documentation into git and build it \
ln -sf ../../git-filter-repo/Documentation/git-filter-repo.txt ../git/Documentation/ && \
make -C ../git/Documentation -j4 man html && \
# Take the built documentation and lay it out nicely \
mkdir $$GIT_WORK_TREE/html && \
mkdir $$GIT_WORK_TREE/man1 && \
cp -a ../git/Documentation/*.html $$GIT_WORK_TREE/html/ && \
cp -a ../git/Documentation/git-filter-repo.1 $$GIT_WORK_TREE/man1/ && \
dos2unix $$GIT_WORK_TREE/html/* && \
# Add new version of the documentation as a commit, if it differs \
git --work-tree $$GIT_WORK_TREE add . && \
git diff --quiet docs || git write-tree \
| xargs git commit-tree -p docs -m "Update docs to $$COMMIT" \
| xargs git update-ref refs/heads/docs && \
# Remove temporary files \
rm -rf $$GIT_WORK_TREE && \
rm $$GIT_INDEX_FILE && \
# Push the new documentation upstream \
git push origin docs && \
# Notify of completion \
echo && \
echo === filter-repo docs branch updated ===
# Call like this:
# make GITHUB_COM_TOKEN=$KEY TAGNAME=v2.23.0 release
release: github_release pypi_release
# Call like this:
# make GITHUB_COM_TOKEN=$KEY TAGNAME=v2.23.0 github_release
github_release: update_docs
FILEBASE=git-filter-repo-$(shell echo $(TAGNAME) | tail -c +2) \
TMP_INDEX_FILE=$(shell mktemp) \
COMMIT=$(shell git rev-parse HEAD) \
&& \
test -n "$(GITHUB_COM_TOKEN)" && \
test -n "$(TAGNAME)" && \
test -n "$$COMMIT" && \
# Make sure we don't have any staged or unstaged changes \
git diff --quiet --staged HEAD && git diff --quiet HEAD && \
# Make sure 'jq' is installed \
type -p jq && \
# Tag the release, push it to GitHub \
git tag -a -m "filter-repo $(TAGNAME)" $(TAGNAME) $$COMMIT && \
git push origin $(TAGNAME) && \
# Create the tarball \
GIT_INDEX_FILE=$$TMP_INDEX_FILE git read-tree $$COMMIT && \
git ls-tree -r docs | grep filter-repo \
| sed -e 's%\t%\tDocumentation/%' \
| GIT_INDEX_FILE=$$TMP_INDEX_FILE git update-index --index-info && \
GIT_INDEX_FILE=$$TMP_INDEX_FILE git write-tree \
| xargs git archive --prefix=$$FILEBASE/ \
| xz -c >$$FILEBASE.tar.xz && \
rm $$TMP_INDEX_FILE && \
# Make GitHub mark our new tag as an official release \
curl -s -H "Authorization: token $(GITHUB_COM_TOKEN)" -X POST \
https://api.github.com/repos/newren/git-filter-repo/releases \
--data "{ \
\"tag_name\": \"$(TAGNAME)\", \
\"target_commitish\": \"$$COMMIT\", \
\"name\": \"$(TAGNAME)\", \
\"body\": \"filter-repo $(TAGNAME)\" \
}" | jq -r .id >asset_id && \
# Upload our tarball \
cat asset_id | xargs -I ASSET_ID curl -s -H "Authorization: token $(GITHUB_COM_TOKEN)" -H "Content-Type: application/octet-stream" --data-binary @$$FILEBASE.tar.xz https://uploads.github.com/repos/newren/git-filter-repo/releases/ASSET_ID/assets?name=$$FILEBASE.tar.xz && \
# Remove temporary file(s) \
rm asset_id && \
# Notify of completion \
echo && \
echo === filter-repo $(TAGNAME) created and uploaded to GitHub ===
pypi_release: # Has an implicit dependency on github_release because...
# Upload to PyPI, automatically picking tag created by github_release
python3 -m venv venv
venv/bin/pip install --upgrade pip
venv/bin/pip install build twine
venv/bin/pyproject-build
# Note: Retrieve "git-filter-repo releases" token; username is 'newren'
venv/bin/twine upload dist/*
# Remove temporary file(s)
rm -rf dist/ venv/ git_filter_repo.egg-info/
# NOTE TO FUTURE SELF: If you accidentally push a bad release, you can remove
# all but the git-filter-repo-$VERSION.tar.xz asset with
# git push --delete origin $TAGNAME
# To remove the git-filter-repo-$VERSION.tar.xz asset as well:
# curl -s -H "Authorization: token $GITHUB_COM_TOKEN" -X GET \
# https://api.github.com/repos/newren/git-filter-repo/releases
# and look for the "id", then run
# curl -s -H "Authorization: token $GITHUB_COM_TOKEN" -X DELETE \
# https://api.github.com/repos/newren/git-filter-repo/releases/$ID
================================================
FILE: README.md
================================================
git filter-repo is a versatile tool for rewriting history, which includes
[capabilities I have not found anywhere
else](#design-rationale-behind-filter-repo). It roughly falls into the
same space of tool as [git
filter-branch](https://git-scm.com/docs/git-filter-branch) but without the
capitulation-inducing poor
[performance](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/),
with far more capabilities, and with a design that scales usability-wise
beyond trivial rewriting cases. [git filter-repo is now recommended by the
git project](https://git-scm.com/docs/git-filter-branch#_warning) instead
of git filter-branch.
While most users will probably just use filter-repo as a simple command
line tool (and likely only use a few of its flags), at its core filter-repo
contains a library for creating history rewriting tools. As such, users
with specialized needs can leverage it to quickly create [entirely new
history rewriting tools](contrib/filter-repo-demos).
# Table of Contents
* [Prerequisites](#prerequisites)
* [How do I install it?](#how-do-i-install-it)
* [How do I use it?](#how-do-i-use-it)
* [Why filter-repo instead of other alternatives?](#why-filter-repo-instead-of-other-alternatives)
* [filter-branch](#filter-branch)
* [BFG Repo Cleaner](#bfg-repo-cleaner)
* [Simple example, with comparisons](#simple-example-with-comparisons)
* [Solving this with filter-repo](#solving-this-with-filter-repo)
* [Solving this with BFG Repo Cleaner](#solving-this-with-bfg-repo-cleaner)
* [Solving this with filter-branch](#solving-this-with-filter-branch)
* [Solving this with fast-export/fast-import](#solving-this-with-fast-exportfast-import)
* [Design rationale behind filter-repo](#design-rationale-behind-filter-repo)
* [How do I contribute?](#how-do-i-contribute)
* [Is there a Code of Conduct?](#is-there-a-code-of-conduct)
* [Upstream Improvements](#upstream-improvements)
# Prerequisites
filter-repo requires:
* git >= 2.36.0
* python3 >= 3.6
# How do I install it?
While the `git-filter-repo` repository has many files, the main logic
is all contained in a single-file python script named
`git-filter-repo`, which was done to make installation for basic use
on many systems trivial: just place that one file into your $PATH.
See [INSTALL.md](INSTALL.md) for things beyond basic usage or special
cases. The more involved instructions are only needed if one of the
following apply:
* you do not find the above comment about trivial installation intuitively
obvious
* you are working with a python3 executable named something other than
"python3"
* you want to install documentation (beyond the builtin docs shown with -h)
* you want to run some of the [contrib](contrib/filter-repo-demos/) examples
* you want to create your own python filtering scripts using filter-repo as
a module/library
# How do I use it?
For comprehensive documentation:
* see the [user manual](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html)
* alternative formating of the user manual is available on various
external sites
([example](https://www.mankier.com/1/git-filter-repo)), for those
that don't like the htmlpreview.github.io layout, though it may
only be up-to-date as of the latest release
If you prefer learning from examples:
* there is a [cheat sheet for converting filter-branch
commands](Documentation/converting-from-filter-branch.md#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage),
which covers every example from the filter-branch manual
* there is a [cheat sheet for converting BFG Repo Cleaner
commands](Documentation/converting-from-bfg-repo-cleaner.md#cheat-sheet-conversion-of-examples-from-bfg),
which covers every example from the BFG website
* the [simple example](#simple-example-with-comparisons) below may
be of interest
* the user manual has an extensive [examples
section](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#EXAMPLES)
* I have collected a set of [example filterings based on user-filed issues](Documentation/examples-from-user-filed-issues.md)
In either case, you may also find the [Frequently Answered Questions](Documentation/FAQ.md) useful.
# Why filter-repo instead of other alternatives?
This was covered in more detail in a [Git Rev News article on
filter-repo](https://git.github.io/rev_news/2019/08/21/edition-54/#an-introduction-to-git-filter-repo--written-by-elijah-newren),
but some highlights for the main competitors:
## filter-branch
* filter-branch is [extremely to unusably
slow](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/)
([multiple orders of magnitude slower than it should
be](https://git-scm.com/docs/git-filter-branch#PERFORMANCE))
for non-trivial repositories.
* [filter-branch is riddled with
gotchas](https://git-scm.com/docs/git-filter-branch#SAFETY) that can
silently corrupt your rewrite or at least thwart your "cleanup"
efforts by giving you something more problematic and messy than what
you started with.
* filter-branch is [very onerous](#simple-example-with-comparisons)
[to
use](https://github.com/newren/git-filter-repo/blob/a6a6a1b0f62d365bbe2e76f823e1621857ec4dbd/contrib/filter-repo-demos/filter-lamely#L9-L61)
for any rewrite which is even slightly non-trivial.
* the git project has stated that the above issues with filter-branch
cannot be backward compatibly fixed; they recommend that you [stop
using
filter-branch](https://git-scm.com/docs/git-filter-branch#_warning)
* die-hard fans of filter-branch may be interested in
[filter-lamely](contrib/filter-repo-demos/filter-lamely)
(a.k.a. [filter-branch-ish](contrib/filter-repo-demos/filter-branch-ish)),
a reimplementation of filter-branch based on filter-repo which is
more performant (though not nearly as fast or safe as
filter-repo).
* a [cheat
sheet](Documentation/converting-from-filter-branch.md#cheat-sheet-conversion-of-examples-from-the-filter-branch-manpage)
is available showing how to convert example commands from the manual of
filter-branch into filter-repo commands.
## BFG Repo Cleaner
* great tool for its time, but while it makes some things simple, it
is limited to a few kinds of rewrites.
* its architecture is not amenable to handling more types of
rewrites.
* its architecture presents some shortcomings and bugs even for its
intended usecase.
* fans of bfg may be interested in
[bfg-ish](contrib/filter-repo-demos/bfg-ish), a reimplementation of bfg
based on filter-repo which includes several new features and bugfixes
relative to bfg.
* a [cheat
sheet](Documentation/converting-from-bfg-repo-cleaner.md#cheat-sheet-conversion-of-examples-from-bfg)
is available showing how to convert example commands from the manual of
BFG Repo Cleaner into filter-repo commands.
# Simple example, with comparisons
Let's say that we want to extract a piece of a repository, with the intent
on merging just that piece into some other bigger repo. For extraction, we
want to:
* extract the history of a single directory, src/. This means that only
paths under src/ remain in the repo, and any commits that only touched
paths outside this directory will be removed.
* rename all files to have a new leading directory, my-module/ (e.g. so that
src/foo.c becomes my-module/src/foo.c)
* rename any tags in the extracted repository to have a 'my-module-'
prefix (to avoid any conflicts when we later merge this repo into
something else)
## Solving this with filter-repo
Doing this with filter-repo is as simple as the following command:
```shell
git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'
```
(the single quotes are unnecessary, but make it clearer to a human that we
are replacing the empty string as a prefix with `my-module-`)
## Solving this with BFG Repo Cleaner
BFG Repo Cleaner is not capable of this kind of rewrite; in fact, all
three types of wanted changes are outside of its capabilities.
## Solving this with filter-branch
filter-branch comes with a pile of caveats (more on that below) even
once you figure out the necessary invocation(s):
```shell
git filter-branch \
--tree-filter 'mkdir -p my-module && \
git ls-files \
| grep -v ^src/ \
| xargs git rm -f -q && \
ls -d * \
| grep -v my-module \
| xargs -I files mv files my-module/' \
--tag-name-filter 'echo "my-module-$(cat)"' \
--prune-empty -- --all
git clone file://$(pwd) newcopy
cd newcopy
git for-each-ref --format="delete %(refname)" refs/tags/ \
| grep -v refs/tags/my-module- \
| git update-ref --stdin
git gc --prune=now
```
Some might notice that the above filter-branch invocation will be really
slow due to using --tree-filter; you could alternatively use the
--index-filter option of filter-branch, changing the above commands to:
```shell
git filter-branch \
--index-filter 'git ls-files \
| grep -v ^src/ \
| xargs git rm -q --cached;
git ls-files -s \
| sed "s%$(printf \\t)%&my-module/%" \
| git update-index --index-info;
git ls-files \
| grep -v ^my-module/ \
| xargs git rm -q --cached' \
--tag-name-filter 'echo "my-module-$(cat)"' \
--prune-empty -- --all
git clone file://$(pwd) newcopy
cd newcopy
git for-each-ref --format="delete %(refname)" refs/tags/ \
| grep -v refs/tags/my-module- \
| git update-ref --stdin
git gc --prune=now
```
However, for either filter-branch command there are a pile of caveats.
First, some may be wondering why I list five commands here for
filter-branch. Despite the use of --all and --tag-name-filter, and
filter-branch's manpage claiming that a clone is enough to get rid of
old objects, the extra steps to delete the other tags and do another
gc are still required to clean out the old objects and avoid mixing
new and old history before pushing somewhere. Other caveats:
* Commit messages are not rewritten; so if some of your commit
messages refer to prior commits by (abbreviated) sha1, after the
rewrite those messages will now refer to commits that are no longer
part of the history. It would be better to rewrite those
(abbreviated) sha1 references to refer to the new commit ids.
* The --prune-empty flag sometimes misses commits that should be
pruned, and it will also prune commits that *started* empty rather
than just ended empty due to filtering. For repositories that
intentionally use empty commits for versioning and publishing
related purposes, this can be detrimental.
* The commands above are OS-specific. GNU vs. BSD issues for sed,
xargs, and other commands often trip up users; I think I failed to
get most folks to use --index-filter since the only example in the
filter-branch manpage that both uses it and shows how to move
everything into a subdirectory is linux-specific, and it is not
obvious to the reader that it has a portability issue since it
silently misbehaves rather than failing loudly.
* The --index-filter version of the filter-branch command may be two to
three times faster than the --tree-filter version, but both
filter-branch commands are going to be multiple orders of magnitude
slower than filter-repo.
* Both commands assume all filenames are composed entirely of ascii
characters (even special ascii characters such as tabs or double
quotes will wreak havoc and likely result in missing files or
misnamed files)
## Solving this with fast-export/fast-import
One can kind of hack this together with something like:
```shell
git fast-export --no-data --reencode=yes --mark-tags --fake-missing-tagger \
--signed-tags=strip --tag-of-filtered-object=rewrite --all \
| grep -vP '^M [0-9]+ [0-9a-f]+ (?!src/)' \
| grep -vP '^D (?!src/)' \
| perl -pe 's%^(M [0-9]+ [0-9a-f]+ )(.*)$%\1my-module/\2%' \
| perl -pe 's%^(D )(.*)$%\1my-module/\2%' \
| perl -pe s%refs/tags/%refs/tags/my-module-% \
| git -c core.ignorecase=false fast-import --date-format=raw-permissive \
--force --quiet
git for-each-ref --format="delete %(refname)" refs/tags/ \
| grep -v refs/tags/my-module- \
| git update-ref --stdin
git reset --hard
git reflog expire --expire=now --all
git gc --prune=now
```
But this comes with some nasty caveats and limitations:
* The various greps and regex replacements operate on the entire
fast-export stream and thus might accidentally corrupt unintended
portions of it, such as commit messages. If you needed to edit
file contents and thus dropped the --no-data flag, it could also
end up corrupting file contents.
* This command assumes all filenames in the repository are composed
entirely of ascii characters, and also exclude special characters
such as tabs or double quotes. If such a special filename exists
within the old src/ directory, it will be pruned even though it
was intended to be kept. (In slightly different repository
rewrites, this type of editing also risks corrupting filenames
with special characters by adding extra double quotes near the end
of the filename and in some leading directory name.)
* This command will leave behind huge numbers of useless empty
commits, and has no realistic way of pruning them. (And if you
tried to combine this technique with another tool to prune the
empty commits, then you now have no way to distinguish between
commits which were made empty by the filtering that you want to
remove, and commits which were empty before the filtering process
and which you thus may want to keep.)
* Commit messages which reference other commits by hash will now
reference old commits that no longer exist. Attempting to edit
the commit messages to update them is extraordinarily difficult to
add to this kind of direct rewrite.
# Design rationale behind filter-repo
None of the existing repository filtering tools did what I wanted;
they all came up short for my needs. No tool provided any of the
first eight traits below I wanted, and no tool provided more than
two of the last four traits either:
1. [Starting report] Provide user an analysis of their repo to help
them get started on what to prune or rename, instead of expecting
them to guess or find other tools to figure it out. (Triggered, e.g.
by running the first time with a special flag, such as --analyze.)
1. [Keep vs. remove] Instead of just providing a way for users to
easily remove selected paths, also provide flags for users to
only *keep* certain paths. Sure, users could workaround this by
specifying to remove all paths other than the ones they want to
keep, but the need to specify all paths that *ever* existed in
**any** version of the repository could sometimes be quite
painful. For filter-branch, using pipelines like `git ls-files |
grep -v ... | xargs -r git rm` might be a reasonable workaround
but can get unwieldy and isn't as straightforward for users; plus
those commands are often operating-system specific (can you spot
the GNUism in the snippet I provided?).
1. [Renaming] It should be easy to rename paths. For example, in
addition to allowing one to treat some subdirectory as the root
of the repository, also provide options for users to make the
root of the repository just become a subdirectory. And more
generally allow files and directories to be easily renamed.
Provide sanity checks if renaming causes multiple files to exist
at the same path. (And add special handling so that if a commit
merely copied oldname->newname without modification, then
filtering oldname->newname doesn't trigger the sanity check and
die on that commit.)
1. [More intelligent safety] Writing copies of the original refs to
a special namespace within the repo does not provide a
user-friendly recovery mechanism. Many would struggle to recover
using that. Almost everyone I've ever seen do a repository
filtering operation has done so with a fresh clone, because
wiping out the clone in case of error is a vastly easier recovery
mechanism. Strongly encourage that workflow by [detecting and
bailing if we're not in a fresh
clone](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#FRESHCLONE),
unless the user overrides with --force.
1. [Auto shrink] Automatically remove old cruft and repack the
repository for the user after filtering (unless overridden); this
simplifies things for the user, helps avoid mixing old and new
history together, and avoids problems where the multi-step
process for shrinking the repo documented in the manpage doesn't
actually work in some cases. (I'm looking at you,
filter-branch.)
1. [Clean separation] Avoid confusing users (and prevent accidental
re-pushing of old stuff) due to mixing old repo and rewritten
repo together. (This is particularly a problem with filter-branch
when using the --tag-name-filter option, and sometimes also an
issue when only filtering a subset of
gitextract_9o8dsut3/
├── .gitattributes
├── .github/
│ ├── dependabot.yml
│ └── workflows/
│ └── test.yml
├── .gitignore
├── COPYING
├── COPYING.gpl
├── COPYING.mit
├── Documentation/
│ ├── Contributing.md
│ ├── FAQ.md
│ ├── converting-from-bfg-repo-cleaner.md
│ ├── converting-from-filter-branch.md
│ ├── examples-from-user-filed-issues.md
│ └── git-filter-repo.txt
├── INSTALL.md
├── Makefile
├── README.md
├── contrib/
│ └── filter-repo-demos/
│ ├── README.md
│ ├── barebones-example
│ ├── bfg-ish
│ ├── clean-ignore
│ ├── convert-svnexternals
│ ├── filter-lamely
│ ├── insert-beginning
│ ├── lint-history
│ └── signed-off-by
├── git-filter-repo
├── pyproject.toml
└── t/
├── run_coverage
├── run_tests
├── t9390/
│ ├── basic
│ ├── basic-filename
│ ├── basic-mailmap
│ ├── basic-message
│ ├── basic-numbers
│ ├── basic-replace
│ ├── basic-ten
│ ├── basic-twenty
│ ├── degenerate
│ ├── degenerate-evil-merge
│ ├── degenerate-globme
│ ├── degenerate-keepme
│ ├── degenerate-keepme-noff
│ ├── degenerate-moduleA
│ ├── empty
│ ├── empty-keepme
│ ├── less-empty-keepme
│ ├── more-empty-keepme
│ ├── sample-mailmap
│ ├── sample-message
│ ├── sample-replace
│ ├── unusual
│ ├── unusual-filtered
│ └── unusual-mailmap
├── t9390-filter-repo-basics.sh
├── t9391/
│ ├── commit_info.py
│ ├── create_fast_export_output.py
│ ├── emoji-repo
│ ├── erroneous.py
│ ├── file_filter.py
│ ├── print_progress.py
│ ├── rename-master-to-develop.py
│ ├── splice_repos.py
│ ├── strip-cvs-keywords.py
│ └── unusual.py
├── t9391-filter-repo-lib-usage.sh
├── t9392-filter-repo-python-callback.sh
├── t9393/
│ ├── lfs
│ └── simple
├── t9393-filter-repo-rerun.sh
├── t9394/
│ └── date-order
├── t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh
├── test-lib-functions.sh
└── test-lib.sh
SYMBOL INDEX (19 symbols across 8 files)
FILE: t/t9391/commit_info.py
function change_up_them_commits (line 14) | def change_up_them_commits(commit, metadata):
FILE: t/t9391/erroneous.py
function handle_tag (line 11) | def handle_tag(tag):
FILE: t/t9391/file_filter.py
function drop_file_by_contents (line 12) | def drop_file_by_contents(blob, metadata):
function drop_files_by_name (line 17) | def drop_files_by_name(commit, metadata):
FILE: t/t9391/print_progress.py
function print_progress (line 22) | def print_progress():
function my_blob_callback (line 27) | def my_blob_callback(blob, metadata):
function my_commit_callback (line 32) | def my_commit_callback(commit, metadata):
FILE: t/t9391/rename-master-to-develop.py
function my_commit_callback (line 11) | def my_commit_callback(commit, metadata):
FILE: t/t9391/splice_repos.py
class InterleaveRepositories (line 18) | class InterleaveRepositories:
method __init__ (line 19) | def __init__(self, repo1, repo2, output_dir):
method skip_reset (line 27) | def skip_reset(self, reset, metadata):
method hold_commit (line 30) | def hold_commit(self, commit, metadata):
method weave_commit (line 35) | def weave_commit(self, commit, metadata):
method run (line 58) | def run(self):
FILE: t/t9391/strip-cvs-keywords.py
function strip_cvs_keywords (line 12) | def strip_cvs_keywords(blob, metadata):
FILE: t/t9391/unusual.py
function track_everything (line 24) | def track_everything(obj, *_ignored):
function handle_progress (line 39) | def handle_progress(progress):
function handle_checkpoint (line 43) | def handle_checkpoint(checkpoint_object):
function look_for_reset (line 112) | def look_for_reset(obj, metadata):
Condensed preview — 73 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (793K chars).
[
{
"path": ".gitattributes",
"chars": 105,
"preview": "*.sh eol=lf\n*.py eol=lf\n/git-filter-repo eol=lf\n/contrib/filter-repo-demos/[a-z]* eol=lf\n/t/t9*/* eol=lf\n"
},
{
"path": ".github/dependabot.yml",
"chars": 122,
"preview": "---\nversion: 2\nupdates:\n - package-ecosystem: \"github-actions\"\n directory: \"/\"\n schedule:\n interval: \"monthl"
},
{
"path": ".github/workflows/test.yml",
"chars": 963,
"preview": "name: Run tests\n\non: [push, pull_request]\n\njobs:\n run-tests:\n strategy:\n matrix:\n os: [ 'windows', 'ubun"
},
{
"path": ".gitignore",
"chars": 92,
"preview": "/Documentation/html/\n/Documentation/man1/\n/t/test-results\n/t/trash directory*\n/__pycache__/\n"
},
{
"path": "COPYING",
"chars": 1159,
"preview": "git-filter-repo itself and most the files in this repository (exceptions\nnoted below) are provided under the MIT license"
},
{
"path": "COPYING.gpl",
"chars": 18092,
"preview": " GNU GENERAL PUBLIC LICENSE\n Version 2, June 1991\n\n Copyright (C) 1989, 1991 Fr"
},
{
"path": "COPYING.mit",
"chars": 1054,
"preview": "Copyright (c) 2009, 2018-2019\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this soft"
},
{
"path": "Documentation/Contributing.md",
"chars": 4352,
"preview": "Welcome to the community!\n\nContributions need to meet the bar for inclusion in git.git. Although\nfilter-repo is not par"
},
{
"path": "Documentation/FAQ.md",
"chars": 18035,
"preview": "# Frequently Answered Questions\n\n## Table of Contents\n\n * [Why did `git-filter-repo` rewrite commit hashes?](#why-did-g"
},
{
"path": "Documentation/converting-from-bfg-repo-cleaner.md",
"chars": 6505,
"preview": "# Cheat Sheet: Converting from BFG Repo Cleaner\n\nThis document is aimed at folks who are familiar with BFG Repo Cleaner\n"
},
{
"path": "Documentation/converting-from-filter-branch.md",
"chars": 11669,
"preview": "# Cheat Sheet: Converting from filter-branch\n\nThis document is aimed at folks who are familiar with filter-branch and wa"
},
{
"path": "Documentation/examples-from-user-filed-issues.md",
"chars": 17762,
"preview": "# Examples from user-filed issues\n\nLots of people have filed issues against git-filter-repo, and many times their\nissue "
},
{
"path": "Documentation/git-filter-repo.txt",
"chars": 84251,
"preview": "// This file is NOT the documentation; it's the *source code* for it.\n// Please follow the \"user manual\" link under\n// "
},
{
"path": "INSTALL.md",
"chars": 9172,
"preview": "# Table of Contents\n\n * [Pre-requisites](#pre-requisites)\n * [Simple Installation](#simple-installation)\n * [Installa"
},
{
"path": "Makefile",
"chars": 6649,
"preview": "# A bunch of installation-related paths people can override on the command line\nDESTDIR = /\nINSTALL = install\nprefix = $"
},
{
"path": "README.md",
"chars": 29651,
"preview": "git filter-repo is a versatile tool for rewriting history, which includes\n[capabilities I have not found anywhere\nelse]("
},
{
"path": "contrib/filter-repo-demos/README.md",
"chars": 2280,
"preview": "## Background\n\nfilter-repo is not merely a history rewriting tool, it also contains a\nlibrary that can be used to write "
},
{
"path": "contrib/filter-repo-demos/barebones-example",
"chars": 801,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a simple program that behaves identically to git-filter-repo. Its\nentire purpose is"
},
{
"path": "contrib/filter-repo-demos/bfg-ish",
"chars": 23201,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a re-implementation of BFG Repo Cleaner, with some changes...\n\nNew features:\n* pruni"
},
{
"path": "contrib/filter-repo-demos/clean-ignore",
"chars": 2602,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a simple program that will delete files from history which match\ncurrent gitignore r"
},
{
"path": "contrib/filter-repo-demos/convert-svnexternals",
"chars": 21446,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a program that will insert Git submodules according to SVN externals\ndefinitions (sv"
},
{
"path": "contrib/filter-repo-demos/filter-lamely",
"chars": 29156,
"preview": "#!/usr/bin/env python3\n\n\"\"\"This is a bug compatible-ish[1] reimplementation of filter-branch, which\nhappens to be faster"
},
{
"path": "contrib/filter-repo-demos/insert-beginning",
"chars": 2770,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a simple program that will insert some regular file into the root\ncommit(s) of histo"
},
{
"path": "contrib/filter-repo-demos/lint-history",
"chars": 7317,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a simple program that will run a linting program on all non-binary\nfiles in history."
},
{
"path": "contrib/filter-repo-demos/signed-off-by",
"chars": 2467,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nThis is a simple program that will add Signed-off-by: tags to a range of\ncommits. Example u"
},
{
"path": "git-filter-repo",
"chars": 211653,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\ngit-filter-repo filters git repositories, similar to git filter-branch, BFG\nrepo cleaner, an"
},
{
"path": "pyproject.toml",
"chars": 1417,
"preview": "[project]\nname = \"git-filter-repo\"\ndescription = \"Quickly rewrite git repository history\"\nauthors = [\n {name = \"Elija"
},
{
"path": "t/run_coverage",
"chars": 558,
"preview": "#!/bin/bash\n\nset -eu\n\norig_dir=$(cd $(dirname $0) && pwd -P)\ntmpdir=$(mktemp -d)\n\ncat <<EOF >$tmpdir/.coveragerc\n[run]\np"
},
{
"path": "t/run_tests",
"chars": 580,
"preview": "#!/bin/bash\nset -eu\n\ncd $(dirname $0)\n\n# Put git_filter_repo.py on the front of PYTHONPATH\nexport PYTHONPATH=\"$PWD/..${P"
},
{
"path": "t/t9390/basic",
"chars": 2082,
"preview": "feature done\n# Simple repo with three files, a merge where each side touches exactly one\n# file, and a commit at the end"
},
{
"path": "t/t9390/basic-filename",
"chars": 570,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/A\ncommit refs/heads/A\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/basic-mailmap",
"chars": 1251,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/B\ncommit refs/heads/B\nmark :2\nauthor Little 'ol Me <me@little"
},
{
"path": "t/t9390/basic-message",
"chars": 1278,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/B\ncommit refs/heads/B\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/basic-numbers",
"chars": 1217,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/B\ncommit refs/heads/B\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/basic-replace",
"chars": 1294,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/B\ncommit refs/heads/B\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/basic-ten",
"chars": 758,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/B\ncommit refs/heads/B\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/basic-twenty",
"chars": 763,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/A\ncommit refs/heads/A\nmark :2\nauthor Little O. Me <me@little."
},
{
"path": "t/t9390/degenerate",
"chars": 9462,
"preview": "feature done\n# Simple repo with only three files, with a bunch of cases of dealing with\n# topology changes possibly caus"
},
{
"path": "t/t9390/degenerate-evil-merge",
"chars": 1640,
"preview": "feature done\nblob\nmark :1\ndata 0\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Full Name <user@organi"
},
{
"path": "t/t9390/degenerate-globme",
"chars": 4806,
"preview": "feature done\nblob\nmark :1\ndata 10\nkeepme v1\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Full Name <"
},
{
"path": "t/t9390/degenerate-keepme",
"chars": 1837,
"preview": "feature done\nblob\nmark :1\ndata 10\nkeepme v1\n\nreset refs/heads/branchO\ncommit refs/heads/branchO\nmark :2\nauthor Full Name"
},
{
"path": "t/t9390/degenerate-keepme-noff",
"chars": 2044,
"preview": "feature done\nblob\nmark :1\ndata 10\nkeepme v1\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Full Name <"
},
{
"path": "t/t9390/degenerate-moduleA",
"chars": 3474,
"preview": "feature done\nblob\nmark :1\ndata 10\nkeepme v1\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Full Name <"
},
{
"path": "t/t9390/empty",
"chars": 3104,
"preview": "feature done\n# Simple repo with only two files, with a whole bunch of cases dealing with\n# empty pruning, particularly c"
},
{
"path": "t/t9390/empty-keepme",
"chars": 1051,
"preview": "feature done\nreset refs/heads/master\ncommit refs/heads/master\nmark :1\nauthor Full Name <user@organization.tld> 100000000"
},
{
"path": "t/t9390/less-empty-keepme",
"chars": 1968,
"preview": "feature done\nreset refs/heads/master\ncommit refs/heads/master\nmark :1\nauthor Full Name <user@organization.tld> 100000000"
},
{
"path": "t/t9390/more-empty-keepme",
"chars": 727,
"preview": "feature done\nblob\nmark :1\ndata 10\nkeepme v1\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Full Name <"
},
{
"path": "t/t9390/sample-mailmap",
"chars": 278,
"preview": "Little 'ol Me <me@little.net>\n<me@little.net> <me@laptop.(none)>\n# Here is a comment\nLittle 'ol Me <me@little.net> Littl"
},
{
"path": "t/t9390/sample-message",
"chars": 83,
"preview": "Initial==>Modified\nregex:tw.nty==>the number 20\nv1.0==>version one!\nregex:!$==> :)\n"
},
{
"path": "t/t9390/sample-replace",
"chars": 27,
"preview": "mod==>modified-by-gremlins\n"
},
{
"path": "t/t9390/unusual",
"chars": 764,
"preview": "option git quiet\nfeature done\n# Input in a format filter-repo isn't generally expected to receive (either\n# because we d"
},
{
"path": "t/t9390/unusual-filtered",
"chars": 382,
"preview": "feature done\nblob\nmark :1\ndata 5\nhello\nreset refs/heads/develop\ncommit refs/heads/develop\nmark :2\nauthor Srinivasa Raman"
},
{
"path": "t/t9390/unusual-mailmap",
"chars": 398,
"preview": "feature done\nblob\nmark :1\ndata 5\nhello\nreset refs/heads/develop\ncommit refs/heads/develop\nmark :2\nauthor Srinivasa Raman"
},
{
"path": "t/t9390-filter-repo-basics.sh",
"chars": 24416,
"preview": "#!/bin/bash\n\ntest_description='Basic filter-repo tests'\n\n. ./test-lib.sh\n\nexport PATH=$(dirname $TEST_DIRECTORY):$PATH "
},
{
"path": "t/t9391/commit_info.py",
"chars": 996,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/create_fast_export_output.py",
"chars": 4284,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/emoji-repo",
"chars": 346,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nblob\nmark :2\ndata 5\nlock\n\nblob\nmark :3\ndata 11\n*.bak\n🔒\n\nreset refs/heads/maste"
},
{
"path": "t/t9391/erroneous.py",
"chars": 371,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/file_filter.py",
"chars": 796,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/print_progress.py",
"chars": 1122,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/rename-master-to-develop.py",
"chars": 431,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/splice_repos.py",
"chars": 2754,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/strip-cvs-keywords.py",
"chars": 661,
"preview": "#!/usr/bin/env python3\n\n\"\"\"\nPlease see the\n ***** API BACKWARD COMPATIBILITY CAVEAT *****\nnear the top of git-filter-re"
},
{
"path": "t/t9391/unusual.py",
"chars": 4596,
"preview": "#!/usr/bin/env python3\n\n# Please: DO NOT USE THIS AS AN EXAMPLE.\n#\n# This file is NOT for demonstration of how to use gi"
},
{
"path": "t/t9391-filter-repo-lib-usage.sh",
"chars": 6517,
"preview": "#!/bin/bash\n\ntest_description='Usage of git-filter-repo as a library'\n. ./test-lib.sh\n\n# for git_filter_repo.py import\nc"
},
{
"path": "t/t9392-filter-repo-python-callback.sh",
"chars": 10757,
"preview": "#!/bin/bash\n\ntest_description='Usage of git-filter-repo with python callbacks'\n. ./test-lib.sh\n\nexport PATH=$(dirname $T"
},
{
"path": "t/t9393/lfs",
"chars": 4221,
"preview": "feature done\n# Simple repo with a few files, some of them lfs. Note that the lfs object\n# ids and the original-oid dire"
},
{
"path": "t/t9393/simple",
"chars": 2202,
"preview": "feature done\n# Simple repo with a few files, and two branches with no common history.\n# Note that the original-oid direc"
},
{
"path": "t/t9393-filter-repo-rerun.sh",
"chars": 28087,
"preview": "#!/bin/bash\n\ntest_description='filter-repo tests with reruns'\n\n. ./test-lib.sh\n\nexport PATH=$(dirname $TEST_DIRECTORY):$"
},
{
"path": "t/t9394/date-order",
"chars": 1073,
"preview": "feature done\nblob\nmark :1\ndata 8\ninitial\n\nreset refs/heads/master\ncommit refs/heads/master\nmark :2\nauthor Little O. Me <"
},
{
"path": "t/t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh",
"chars": 37966,
"preview": "#!/bin/bash\n\ntest_description='Basic filter-repo tests'\n\n. ./test-lib.sh\n\nexport PATH=$(dirname $TEST_DIRECTORY):$PATH "
},
{
"path": "t/test-lib-functions.sh",
"chars": 30980,
"preview": "# Library of functions shared by all tests scripts, included by\n# test-lib.sh.\n#\n# Copyright (c) 2005 Junio C Hamano\n#\n#"
},
{
"path": "t/test-lib.sh",
"chars": 33968,
"preview": "# Test framework for git. See t/README for usage.\n#\n# Copyright (c) 2005 Junio C Hamano\n#\n# This program is free softwa"
}
]
About this extraction
This page contains the full source code of the newren/git-filter-repo GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 73 files (736.0 KB), approximately 199.8k tokens, and a symbol index with 19 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.