Detection and removal of duplicate files in enterprise environments is significantly more complicated
and therefore requires more features and capabilities from a potential solution to be performed effectively
and accurately. In general, Enterprise storage pools may be divided into two broad categories: organized
storage pools and personal storage pools. Organized storage pools are intended for well defined purposes
and consequently the storage hierarchy and directory structures are strictly defined for the designated
purposes. Unorganized storage pools are typically used for storing personal user directories
and other unmanaged data.
In an enterprise storage environment, duplicate files may be produced by people, applications and operating
systems running on personal computers and corporate servers. Operating systems and enterprise applications
are operating according to their own hidden logic and touching any duplicate files located in operating system
directories or application-specific directories may be very dangerous and should be avoided. On the other hand,
duplicate files located in directories managed by people may be accurately detected and removed while preserving
access to original files at designated locations.
Detection of duplicate files is a relatively simple process - just compare files having the same file size and
you will know exactly which files are identical. The problem begins when you need to search for duplicate files
among many thousands or even millions of files in an enterprise environment. Only a few duplicate file finders
available today are capable of processing more than 100,000 files hardly making it feasible to process large
amounts of files stored in a typical enterprise storage environment. For more information about the expected
performance refer to the duplicate files search benchmark.
The large number of files to be processed in enterprise storage environments makes it impossible to
manually review all the detected duplicate file sets and therefore requires some kind of automation that
should be capable of:
- Accurately distinguishing between one or more duplicate files and the original file in each duplicate file set.
- Making an automatic selection of user-defined duplicate removal actions for each specific duplicate files set according to user-controllable rules and policies.
- Automatically executing duplicates removal actions in duplicate file sets with accurately detected original files and user-defined removal actions.
Suppose you have two duplicate files located in two home directories related to two different users. In this case,
it is impossible to make any reliable assumptions which file is the original and which is the duplicate. Yes, it is possible
to compare files' modification times and make an assumption that the older file is the original, but in this specific
situation it will be better for a human being to make the final decision.
Another situation is when you have two or more duplicate files with one of them located in an organized storage pool.
For example, suppose we have two documents with one of them located in a user's home directory and the second located
in a designated corporate directory intended for business related documents. In this case, it may be assumed quite
accurately that the file located in the designated directory is the original and the file located in the user's
home directory is a duplicate.
For additional accuracy, the original detection process may be performed using multiple rules such the file type,
location, size, owner, etc. Once we have detected the original file in each duplicate file set, we can assign specific
duplicate files removal actions for each specific duplicate file type. For example, duplicate documents may be
linked to the original, duplicate reports older than 1 year moved to an archive directory and duplicate media
files (music, videos and images) deleted.
The FlexTk file management toolkit allows one to search for duplicate files, accurately detect original files in each
specific duplicate files set and automatically execute user-defined duplicates removal actions (FlexTk Ultimate only).
Now let's define an example duplicate files search command showing how to use all the mentioned features and capabilities.
In order to do that, start FlexTk's main GUI application, select the user-defined commands tool pane and select the
'Add New - Duplicates Search Command' menu item.
On the 'Inputs' dialog add all the input directories that should be processed. For this specific tutorial we have prepared
two directories: the first one (K:\home) containing all users' personal directories and the second one (K:\data) contained
an organized directory structure with purpose-specific directories. After finishing adding input directories, press the 'Next' button.
The 'General' tab allows one to control the signature type, the file scanning mode, the maximum number of displayed duplicate
file sets and the file scanning filter. The signature type parameter controls the type of the file signature algorithm used
to detect duplicate files. The SHA256 algorithm is the most reliable one and it is used by default. In the sequential
file scanning mode FlexTk will scan all input directories one after one in the order as they were specified on the inputs dialog.
This is the most effective way to scan files located on a single physical disk. If you need to process multiple input directories
located on multiple physical disks or an enterprise storage system or a disk array (RAID), use the parallel file scanning mode,
which will deliver better performance when processing a large amount of files.
The maximum number of duplicate file sets controls the number of duplicate file sets displayed on the results dialog.
After finishing the search process, FlexTk sorts all the detected duplicate file sets by the amount of the wasted storage space
and displays the top X file sets as specified by this parameter. The file filter provides the user with the ability to limit
the duplicates search process to a specific file type or a custom file set matching the specified file scanning filter.
For example, in order to search for duplicate PDF documents only, set the file scanning filter to '*.pdf'. This file scanning filter
will match all files with the extension PDF (PDF Documents) and skip all other files.
The 'Rules' tab allows one to specify multiple file matching rules that should be used during the duplicates search process.
If there are no file matching rules defined in the 'Rules' tab, FlexTk will process all file types. Otherwise, FlexTk will
process files matching the specified rules only. For detailed information about how to use file matching rules refer to the
advanced, rule-based search tutorial.
The 'Performance' tab provides the user with the ability to customize the duplicates search process for user-specific storage
configurations and performance requirements. FlexTk is optimized for multi-core/multi-CPU computers and advanced RAID storage
systems and capable of scanning multiple file systems in parallel. In order to speedup the duplicates search process, use multiple
processing threads when searching through input directories located on multiple physical hard disks or a RAID disk array.
In addition, in order to minimize the potential performance impact on running production systems, FlexTk allows one to intentionally
slow down the duplicates search process. According to your specific needs, select the 'Full Speed', 'Medium Speed', 'Low Speed'
or 'Manual Control' performance mode.
The 'Exclude' tab allows one to specify a list of directories that should be excluded from the duplicates search process.
Directories containing operating system files may have a large number of duplicate files that should not be removed.
Duplicates located in the Windows system directories may be critical to the proper operation of the operating system
and it is highly recommended to avoid touching any files in these directories. By default, FlexTk populates the list of exclude
directories from the global list of exclude directories, which may be modified on the FlexTk options dialog's 'Exclude' tab.
The 'Actions' tab is the place where the user can define original file detection rules and automatic duplicates removal policies.
FlexTk allows one to specify multiple actions intended for detection and removal of different types of duplicate files. In order
to add an action, press the 'Add' button. The 'Duplicate Files Action' dialog provides the 'Action' combo box, a list of rules
and the original detection type combo box. Set the action type to 'Replace with Links', add one or more original detection rules
and set the original detection mode to 'Detected by Rules'. After finishing adding all the required duplicate removal actions,
set the actions mode to 'Auto-Select' and press the 'Save' button.
In the 'Auto-Select' actions mode, FlexTk will evaluate duplicate files and try to detect the original file in each set of duplicate
files according to the specified original detection rules and policies. Actions containing the original file detection rules will
be evaluated one after one in the order as they are specified in the actions list. If a duplicate file will match rules defined in an
action, the duplicate file will be set as the original and the matching action will be set as the active action for the whole duplicate set.
Now, you have a user-defined duplicates search command, which is capable of automatically detecting original files and assigning your
specific duplicates removal actions to accurately detected duplicate files sets. In order to execute the newly created command,
click on the command item in the user-defined commands tool-pane. After finishing the search process, FlexTk will display
the duplicate results dialog showing all the detected duplicate file sets.
All duplicate files in sets with detected originals will be automatically selected and the duplicates removal action will be set
to the user-specified action. Press the 'Preview' button to see the final list of actions that is going to be executed. Once you
have finished to tune a user-defined duplicates search command and ensured accurate detection of original files, you can set the
actions mode, located on the 'Actions' tab, to 'Execute'. In the 'Execute' mode FlexTk will automatically execute duplicates
removal actions for all duplicate file sets with detected original files.
Once configured and tuned, a user-defined duplicates search command may be executed automatically at specific time intervals
using a general purpose command scheduler such as the Windows Task Scheduler. For example, by using the FlexTk's
command line tools in conjunction with user-defined commands,
the user may configure FlexTk to fully automatically search and remove duplicate files from specific directories,
servers or enterprise storage systems once a week or month.