Deduplicate an orderly archiveSource:
Deduplicate an orderly archive. Deduplicating an orderly archive will replace all files that have the same content with "hard links". This requires hard link support in the underlying operating system, which is available on all unix-like systems (e.g. MacOS and Linux) and on Windows since Vista. However, on windows systems this might require somewhat elevated privileges. If you use this feature, it is very important that you treat your orderly archive as read-only (though you should be anyway) as changing one copy of a linked file changes all the other instances of it - the files are literally the same file.
The path to an orderly root directory, or
NULL(the default) to search for one from the current working directory if
Logical, indicating if the configuration should be searched for. If
configis not given, then orderly looks in the working directory and up through its parents until it finds an
Logical, indicating if the deduplication should be planned but not run
Logical, indicating if the status should not be printed
Invisibly, information about the duplication status of the archive before deduplication was run.
This function will alter your orderly archive. Ordinarily this is not something that should be done, so we try to be careful. In order for this to work, it is very important to treat your orderly archive as read-only generally. If your canonical orderly archive is behind OrderlyWeb this will almost certainly be the case already.
With "hard linking", two files with the same content can be updated so that both files point at the same physical bit of data. This is great, as if the file is large, then only one copy needs to be stored. However, this means that if a change is made to one copy of the file, it is immediately reflected in the other, but there is nothing to indicate that the files are linked!
This approach is worth exploring if you have large files that are
outputs of one report and inputs to another, or large inputs
repeatedly used in different reports, or outputs that end up being
the same in multiple reports. If you run the deduplication with
dry_run = TRUE, an indication of the savings will be
path <- orderly::orderly_example("demo") id1 <- orderly::orderly_run("minimal", root = path) #> [ name ] minimal #> [ id ] 20221116-140027-9cede1d0 #> [ start ] 2022-11-16 14:00:27 #> [ data ] source => dat: 20 x 2 #> #> > png("mygraph.png") #> #> > par(mar = c(15, 4, 0.5, 0.5)) #> #> > barplot(setNames(dat$number, dat$name), las = 2) #> #> > dev.off() #> agg_png #> 2 #> [ end ] 2022-11-16 14:00:27 #> [ elapsed ] Ran report in 0.01797462 secs #> [ artefact ] mygraph.png: 175369b2bcf4115f343c8ad746c0c072 id2 <- orderly::orderly_run("minimal", root = path) #> [ name ] minimal #> [ id ] 20221116-140027-ab5c125e #> [ start ] 2022-11-16 14:00:27 #> [ data ] source => dat: 20 x 2 #> #> > png("mygraph.png") #> #> > par(mar = c(15, 4, 0.5, 0.5)) #> #> > barplot(setNames(dat$number, dat$name), las = 2) #> #> > dev.off() #> agg_png #> 2 #> [ end ] 2022-11-16 14:00:27 #> [ elapsed ] Ran report in 0.01707554 secs #> [ artefact ] mygraph.png: 175369b2bcf4115f343c8ad746c0c072 orderly_commit(id1, root = path) #> [ commit ] minimal/20221116-140027-9cede1d0 #> [ copy ] #> [ import ] minimal:20221116-140027-9cede1d0 #> [ success ] :) #>  "/tmp/Rtmpj7MjPt/file46606ca6e9/archive/minimal/20221116-140027-9cede1d0" orderly_commit(id2, root = path) #> [ commit ] minimal/20221116-140027-ab5c125e #> [ copy ] #> [ import ] minimal:20221116-140027-ab5c125e #> [ success ] :) #>  "/tmp/Rtmpj7MjPt/file46606ca6e9/archive/minimal/20221116-140027-ab5c125e" tryCatch( orderly::orderly_deduplicate(path, dry_run = TRUE), error = function(e) NULL) #> Deduplication information for #> /tmp/Rtmpj7MjPt/file46606ca6e9/archive #> - 6 tracked files #> - 20.43 kB total size #> - 3 duplicate files #> - 10.22 kB duplicated size #> - 0 deduplicated files #> - 0 B deduplicated size #> - 0 untracked files #> - 0 B untracked size