- Wealth is what you don't see.
- Richness can be seen which could be from high leverage (debt).
- A high savings rate will leave you quite wealthy over time.
- The amount you can save is the difference between your income and ego.
- Invest the way YOU are comfortable and for meeting YOUR goals. Do not copy/envy others blindly.
- Trying to Keep Up With The Joneses? Think again.
- The highest return that money provides: the freedom to do what you want, when you want, with whom you want and for however long you want. In other words, TIME is the highest return that money can provide.
Ring0 - The Inner Circle
Ring0 - the inner most layer in the processor's privilege ring.
May 15, 2022
The Psychology of Money: Timeless Lessons on Wealth, Greed, and Happiness by Morgan Housel
October 8, 2016
x86 vs. x64 assembly language
2. Instructions use the REX prefix byte in 64bit mode to specify 64bit operands. The REX byte is placed immediately before the first opcode byte and after all legacy prefixes.
3. 64bit mode has more registers available so more arguments are passed via registers than on the stack. The call to printf is smaller in 64bit.
4. The REX prefix occupies range 40h - 4Fh, so the single byte opcode forms of the INC/DEC instructions, that occupy the same range, are not available in 64bit mode. The ModR/M forms of these instructions (FFh) is still available.
[REX byte]
This is a byte-sized prefix used in 64bit mode in order specify 64bit operands and registers of size other than 32bits. Format of the REX byte is -
Bit position - 7 6 5 4 3 2 1 0
Bit value - 0 1 0 0 W R X B
The high nibble is always 0100 which is 4h. So the REX prefix range is 40h – 4Fh.
W bit : 1 – 64bit operand, 0 – operand size determined by CS.D bit (CodeSegment descriptor, refer Intel system programming manual).
R bit : This adds a 4th bit to the reg field in ModR/M byte. Enables addressing the new registers (R8-R15).
X bit : This adds a 4th bit to the Index field in SIB byte.
B bit : This adds a 4th bit to the r/m field in ModR/M or the Base field in SIB or Opcode reg field.
[Dissecting two instructions]
00007FF780AE17F5 48 C7 45 28 00 00 00 00 mov qword ptr [f0], 0
REX = 48 : which means only the W bit is set, indicating 64bit operand (the qword ptr [f0] operand).
00007FF780AE1876 4C 8B 45 68 mov r8, qword ptr [fn]
REX = 4C : which means both W and R bits are set. W indicates 64bit operand and the R bit adds a 4th bit to the reg field in ModRM byte.
MOV = 8B : this is 64bit register/memory to 64bit register MOV instruction
ModRM = 45 : Mod = 01, R/M = 101 – this is [RBP]+disp so this is the effective address of 'fn'. Register R8 is identified by the 4bit reg field 1000.
Mod R Reg R/M
7 6 5 4 3 2 1 0
0 1 1 0 0 0 1 0 1 (the Rbit from REX adds a 4th bit to the reg field)
Disp = 68 : This is the 8bit displacement added to RBP to get effective address of 'fn'.
int main() { UINT n = 5; size_t f0 = size_t(0); size_t f1 = size_t(1); size_t fn = f0; if (n == 1) { fn = f0 + f1; } else if (n > 1) { for (UINT itr = 2; itr <= n; ++itr) { fn = f0 + f1; f0 = f1; f1 = fn; } } printf("Fibonacci(%u) = %Iu\n", n, fn); return 0; }
size_t f0 = size_t(0);
October 23, 2015
Using CRITICAL_SECTION internals to debug
For the sake of this post, I'm going to call the class BadClass and the access violating member function funcAV(). BadClass has a critical section data member that I'll call m_csBadClass. funcAV was entering the critical section as its very first line of code, then checking if the m_fShtudown boolean data member was set and leaving the critical section just before returning from the function. The function called several other functions while inside the critical section. The class also had a shutdown() function that was always invoked before deletion and that entered the same critical section, set the m_fShutdown flag and returned.
void BadClass::funcAV() {
EnterCriticalSection(&m_csBadClass)
if (!m_fShutdown) {
// do some work
// call other functions
// use data member, say, m_intA <-- access violation reading this
}
LeaveCriticalSection(&m_csBadClass)
}
void BadClass::shutdown() {
EnterCriticalSection(&m_csBadClass)
m_fShutdown = true
LeaveCriticalSection(&m_csBadClass)
}
Since the shutdown() function was always called before object deletion and knowing that shutdown() entered the m_csBadClass critical section, I concluded that the object was deleted by the same thread where funcAV() is executing because the object was deleted while funcAV() was inside the critical section. That is, some callee-of-callee-of-funcAV was deleting the object. Remember that critical sections can be entered recursively by the same thread. You must be wondering now, "Dude, why can't you just place a breakpoint/assertion in the class destructor to see who was calling the destructor?". I could but not in this case because there were legitimate cases where the destructor is called and everything is fine in this scenario. I had to break only when the object was being deleted while funcAV() was still on the callstack and was inside the critical section.
Knowing that the only way for the object of BadClass to get deleted is if the critical section was being entered recursively, I placed an assertion in BadClass::shutdown() like this:
void BadClass::shutdown() {
EnterCriticalSection(&m_csBadClass)
NT_ASSERT(m_csBadClass.RecursionCount < 2)
m_fShutdown = true
LeaveCriticalSection(&m_csBadClass)
}
Then I let the code run on our test environment overnight and I had couple of crash dumps the next morning. These new crashes, to my relief, were from the assertion! Now I got to know the exact code path that was causing the object to get deleted and the fix was easy knowing the root cause. The code path was something like this:
ntdll!breakpoint <-- for the assertion
myDll!BadClass::shutdown
... couple of other functions
myDll!ClassC::func2
myDll!ClassB::func1
myDll!BadClass::funcAV
ntdll!thread-start-function
So knowing the internals of the CRITICAL_SECTION structure certainly made figuring out this bug easy. Other fields in the structure also offer a lot of insight that may one day be useful -
struct _RTL_CRITICAL_SECTION {
PRTL_CRITICAL_SECTION_DEBUG DebugInfo;
LONG LockCount;
LONG RecursionCount;
HANDLE OwningThread;
HANDLE LockSemaphore;
ULONG_PTR SpinCount;
}
DebugInfo - Pointer to a RTL_CRITICAL_SECTION_DEBUG structure which gives more information about this critical section. This structure is actually part of a linked list whose elements are the debug structures of all critical sections used in the current process. This will help in detecting deadlocks during runtime and for other debugging purposes.
LockCount - This field name is a little misleading because it actually more than one datum.
- Bit0 - Whether the critical section is unlocked (1) or locked (0)
- Bit1 - Whether the critical section has woken up a waiting thread (0) or not (1)
- Bit2 thru 31 - #threads waiting on this critical section
OwningThread - ThreadID of the thread that currently owns this critical section. Zero if the critical section is unlocked, i.e., not owned by any thread.
LockSemaphore - Misleading field name. Usually zero but in cases where it isn't zero, it is a HANDLE to an event used by the kernel to signal a waiting thread when another thread leaves this critical section. That is, the kernel internally uses this event to wake up one of the 'waiting' threads when the contented critical section becomes free. I've observed that this always -1 (see example below) after introducing a contention situation so I will dig up some more info on how exactly this can be an event handle.
SpinCount - Putting a thread into waiting state during contention for the critical section is quite an expensive operation because it involves transitioning into kernel mode. If we know that a particular critical section will be held for very very short periods at a time but multiple threads use this critical section, the SpinCount can be given a valid positive number, say 1000000. This indicates that when there is contention, the thread must enter a loop until the critical section is free or the number of loop iterations reaches the SpinCount value. This is essentially a busy-wait in user-mode to see if the contention gets resolved. This avoids transitioning into kernel-mode for a critical section that could get released in the next moment.
Sample program to see these some of these fields in action:
#include <Windows.h>
#include <stdio.h>
#define NUM_CONTENDING_THREADS 3
#define CS_LOCKCOUNT_UNLOCKED(lc) ((lc) & 0x00000001) // bit0
#define CS_LOCKCOUNT_TH_WOKEN(lc) (((lc) & 0x00000002) >> 1) // bit1
#define CS_LOCKCOUNT_CONTENTION(lc) (((~(lc)) & 0xfffffffc) >> 2) // bit2 to 31
void DumpCSInfo(PCWSTR pszMsg, PCRITICAL_SECTION pcs)
{
wprintf(L"[%5u] %-16s contention = %u, locked = %-3s, woken = %-3s, lockSem = 0x%p\n",
GetCurrentThreadId(),
pszMsg,
CS_LOCKCOUNT_CONTENTION(pcs->LockCount),
CS_LOCKCOUNT_UNLOCKED(pcs->LockCount) ? L"no" : L"yes",
CS_LOCKCOUNT_TH_WOKEN(pcs->LockCount) ? L"no" : L"yes",
pcs->LockSemaphore);
}
DWORD WINAPI GrabCriticalSection(LPVOID lpParameter)
{
PCRITICAL_SECTION pcs = (PCRITICAL_SECTION)lpParameter;
DumpCSInfo(L"Waiting", pcs);
EnterCriticalSection(pcs);
DumpCSInfo(L"Inside", pcs);
Sleep(5000);
LeaveCriticalSection(pcs);
DumpCSInfo(L"Left", pcs);
return ERROR_SUCCESS;
}
int main()
{
DWORD tid;
HANDLE hThreads[NUM_CONTENDING_THREADS];
CRITICAL_SECTION cs;
WCHAR szMsg[16];
InitializeCriticalSection(&cs);
DumpCSInfo(L"Init", &cs);
EnterCriticalSection(&cs);
DumpCSInfo(L"Inside", &cs);
for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
{
hThreads[i] = CreateThread(NULL, 0, GrabCriticalSection, &cs, CREATE_SUSPENDED, &tid);
}
for (int i = 0; i < ARRAYSIZE(hThreads); ++i)
{
ResumeThread(hThreads[i]);
Sleep(1000);
swprintf_s(szMsg, ARRAYSIZE(szMsg), L"R th %d", i);
DumpCSInfo(szMsg, &cs);
}
Sleep(2000);
LeaveCriticalSection(&cs);
DumpCSInfo(L"Left", &cs);
WaitForMultipleObjects(ARRAYSIZE(hThreads), hThreads, TRUE, INFINITE);
DumpCSInfo(L"Exiting", &cs);
return 0;
}
Output:
[10192] Init contention = 0, locked = no , woken = no , lockSem = 0x0000000000000000
[10192] Inside contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[ 5676] Waiting contention = 0, locked = yes, woken = no , lockSem = 0x0000000000000000
[10192] R th 0 contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Waiting contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 1 contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Waiting contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] R th 2 contention = 3, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Left contention = 2, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Inside contention = 2, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 5676] Left contention = 1, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Inside contention = 1, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[ 9744] Left contention = 0, locked = no , woken = yes, lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Inside contention = 0, locked = yes, woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10188] Left contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
[10192] Exiting contention = 0, locked = no , woken = no , lockSem = 0xFFFFFFFFFFFFFFFF
October 27, 2014
How Big Are Your Functions?
Disassembler to the rescue!
First thing that came to my mind was to use dumpbin to disassemble the binary code and find out the code size of a function by subtracting the function start address and the end address. This works fine but is tedious if you want to measure the size of multiple (or all) functions within a program.?DestroyDirInfo_NoHash@@YAXPAU_DirectoryInfo@@@Z:
00413560: 55 push ebp
00413561: 8B EC mov ebp,esp
00413563: 81 EC D8 00 00 00 sub esp,0D8h
...
...
00413662: 8B E5 mov esp,ebp
00413664: 5D pop ebp
00413665: C3 ret
This is a snippet from the disassembly of the function DestroyDirInfo_NoHash, one of the functions in FDiffDelete program. So we can calculate the code size as:
0x00413665 - 0x00413560 + 1 = 0x106 bytes (or 262 bytes)
DIA, I've got my eye on you
- Path to the PDB file corresponding to the exe/dll file whose functions you want to analyze.
- (OR) Path to the exe/dll file and optionally a search path where to look for the corresponding PDB file.
I've uploaded this code to github here: CodeSizer. It shows the undecorated function names and their size in bytes in descending order of size. It also has options to show all functions or specific functions or functions that match the specified sub-string.
Going back to our earlier example of calculating the size of the function DestroyDirInfo_NoHash, see the output from CodeSizer, the size is 262 bytes:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /s DestroyDirInfo_NoHash
Function Name Size In Bytes
------------- -------------
DestroyDirInfo_NoHash 262
Enumerated #functions = 1
Sizes of all functions that have 'Build' as a sub-string:
> CodeSizer.exe
> /f ..\..\FDiffDelete\Source\Debug\FDiffDelete.exe
> /r Build
Function Name Size In Bytes
------------- -------------
BuildFilesInDir_NoHash 2620
BuildFilesInDir_Hash 2350
BuildDirTree_Hash 938
BuildDirTree_NoHash 938
BuildDirInfo 889
BuildFilesInDir 107
BuildDirTree 99
Enumerated #functions = 7
Thus, a manual task becomes an easy automation that can be used frequently in future. This is what we programmers are good at, isn't it? We are lazy, and that's a good thing, in a way!
Peeking Into cryptbase.dll
What's Next?
October 16, 2014
Interlocked Instructions
I first came across InterlockedIncrement when going through a COM application's source code. Every COM class is derived from the IUnknown interface and this interface has AddRef and Release functions. As specified in the MSDN article, these functions are together used to keep track of a COM object's lifetime by the reference counting mechanism. A COM server is designed such that a single server object can be used by more than one client. Whenever a COM client obtains a pointer to a COM server object, a call to AddRef must be made and when the object is no longer needed, a call to Release must be made. This way, the server can keep track of how many references to its object are present and can clean itself up when the reference count goes down to zero. The below snippet of pseudo-code should give you an idea of how this is implemented:
class ComServer : public IUnknown {
UINT nRefs;
ULONG AddRef() { InterlockedIncrement(&nRefs); } // ++nRefs
ULONG Release() { InterlockedDecrement(&nRefs); } // --nRefs
HRESULT QueryInterface(clsid, **ppObject)
{
...
AddRef();
return pointer to ComServer object
}
}
Since COM objects are designed to be used by multiple clients in a multi-threaded (and multi-processor) environment, the incrementing and decrementing of the counter variable (nRefs in above example) must be controlled using some synchronization mechanisms. This is where the InterlockedIncrement and InterlockedDecrement functions come into picture.
Implementation
According to MSDN, InterlockedIncrement "Increments (increases by one) the value of the specified 32-bit variable as an atomic operation." and the InterlockedDecrement function decreases the value of its argument by one. Let us see how these two functions ensure that operations on their arguments are atomic/synchronized. Looking at the assembly code of the function call and implementation of InterlockedIncrement:Function Call |
Function Implementation |
The function call is a regular call to a system API. The function implementation is just 5 instructions. The increment is done by a single instruction - the xadd instruction which has a lock prefix. The Intel Developer's Manual Vol2 says about xadd: "This instruction exchanges the first operand(destination) with the second operand(source), then loads the sum of the two values into the destination operand." So ecx contains the address of the memory variable to be incremented and eax contains 1 before the xadd instruction. The xadd instruction does this:
Assert LOCK signal
|
temp = eax + *ecx
|
eax = *ecx
|
*ecx = temp
We can see that even though it is a single assembly instruction, it involves three steps. The LOCK signal assertion is the one that ensures the instruction executes as an atomic operation.
Looking at the same document for the lock prefix: "Asserts LOCK# signal for duration of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted".
What this means is that the when the LOCK signal is asserted, no other processor or bus agent(could be an memory-mapped IO device) can read/write to the main memory. This ensures that the read-write operations of the variable to be incremented is atomic. All other processors and bus agents must wait until this instruction is complete before accessing the main memory. The very simplified diagram of the system architecture should make it easy to understand:
What If The Variable is Cached?
A Look At The 64bit Implementation
long tmp = InterlockedIncrement(&nRefs);
This time I saw an implementation similar to the 32bit version but is inlined:
How Does Decrement Work?
October 7, 2014
So You Want to Write An API?
The topic of this post is the design of an API. These are things I learnt in the course of writing the CHelpLib library. As most programmers know, API stands for Application Programming Interface. An API consists of a set of functions and data structures that provide some functionality to the user of the API. The most common example is the interface offered by an OS. All functions that are part of the OS, exposed to kernel mode applications like device drivers and user mode applications such as MS Word, make up the API of the OS. We all know the general guidelines around writing functions and designing data structures - clear functionality, keeping it simple, following naming conventions and so on. Many good programmers do follow these, of course, in developing applications but these are probably much more important in the design and implementation of an API because an API has the potential to be used much more widely since it is exposed to the public. Compare this to any module within a particular product, which is only used within the product itself and only the developers working on that product need to understand it.
Some points that I incorporated into the CHelpLib library's APIs:
- Consistent experience in the usage of all the APIs within the library. This has two aspects: Consistent API usage experience across all functions within the library and consistency with the industry or platform specific conventions.
- Consistency with the industry/platform makes it easy for the programmer using the API. In my case, this was the Windows platform programming conventions since I was developing for this platform. Consistency of the API includes:
- Object(data structures) creation, usage and disposal semantics must be similar across the library.
- Function signatures and naming must be consistent and follow the Windows API model.
- All functions must be consistent in the way they handle memory allocation/deallocation. This could either be left to the caller or usually shared between the caller and the callee.
- All functions within the API must follow the same method of error/failure reporting to the caller.
- A public API must be very well documented and the documentation must be up to date. This means the documentation must include what exactly each function does, how it is supposed to be used, what are the failure scenarios and so on.
- All known bugs must be documented and made visible to the users.
- An API/library is meant to provide functionality with as less complexity in usage as possible. To this end, the API must expose very little implementation details (this must be documented, however).
- If the API consists of functions that provide functionality such as algorithms and data structures: the algorithm design, memory requirements and expected performance should be specified so that the users can make better choices and use the API in a better way.
The most important changes I made to the CHelpLib library in order to incorporate the above points:
- Simplifying the naming of all functions and avoiding naming collisions with system functions. All function names follow this pattern: "CHL_{module_name}{function_name}". For example, CHL_DsCreateHT - 'Ds' indicates DataStructures, 'CreateHT' indicates that the function creates a hashtable.
- Standardizing the error/failure reporting using HRESULT as the return type of most functions. This type is the recommended return type on the Windows platform.
- Since return type is used only to indicate function success or failure, programmatic return values are passed back to caller via 'out' parameters, i.e., using pointer parameters. For example: CHL_DsCreateHT(**pHTableOut, nEstEntries, keyType, valType, fValInHeapMem). This function creates a hashtable and returns a pointer to the hashtable object in the first parameter pHTableOut.
- Each module has its own header file. API users only need to include the used module's header file. Earlier, all exported functions and data structures were specified within a single header file.
- Usage of Microsoft's SAL2(Source Annotation Language) to annotate the function parameters. This enables better static code checking and also conveys the meaning of the parameters better.
- The data structures - hashtable, linkedlist and queue - all have identical support for storage and retrieval of values (and keys).
- More test code was written to test the new changes and older tests provided the necessary regressions.
- Writing documentation for the full API. This is still a work in progress.
September 7, 2014
Learnings From A Diffing Tool
TL;DR - Source code and screenshots.
Motivation
This post is about the things I learned in the course of developing a folder diffing tool I wrote recently, as a pastime project. Couple of weeks back I was wondering if there was any tool that I could use to diff two folders to find duplicate files and delete those from either folder. I wanted to see if I had any duplicates in the files that I copied from my friend. So, I started thinking about a diffing tool which offers these capabilities. I could think of BeyondCompare and few other diffing tools but off the top of my head, couldn't think of any tool that could diff folders and help me delete the duplicates from either of the folders.I said to myself that this is the perfect opportunity for a project! So, I went home that day and immediately listed down my requirements for the diffing tool. I decided to call it 'FDiffDelete' - Folder Diff and Delete. The initial set of requirements that I came up with was:
- Compare two folders(a left folder and a right folder), highlight the common/duplicate files(name and size match - later, use hash).
- Give choice to user to delete the common/duplicate files from any folder.
- Ver1.0: Only single level files in a folder. Later, recursive compare.
- UI stuff: Has a dialog box. The box is split into two. Each has a text area to specify the folder path and a list view to display the file list in the specified folder(s). There is also a delete button in each half of the dialog box that enables the user to delete selected files(or all duplicate files) from either of the two folders.
- Duplicate files determined based on name, size, last-modified-date (later, or by using the file hash(SHA1)).
Nitty gritty design details
With the requirements set, I started designing the tool. I needed a data structure to actually hold information about the files in a folder and to also allow comparisons between files from the two folders. The very first thing that came to mind was a hash table to store file information of all files in the folder.Pros and cons of hash table:
- + Fast inserts and lookups.
- + Pretty straight-forward implementation in C. There is no need for complicated stuff like rebalancing in case of a binary tree.
- + I already had a hash table implementation in my CHelpLib project :-) .
- - Not very space efficient when size of hash table is much larger than the actual number of items in the table, i.e., when it is sparse.
- - Iterating through the hash table wastes CPU cycles when it is sparse. You have to traverse all slots in the table to find all that have valid entries.
- - Efficiency of a hash table is only as good as it's hash function. A bad one will introduce too many key conflicts and will lead to bad performance.
One other data structure to think was a tree. This will offer average insert and lookup times. It is more space efficient compared to a hash table. However, there is also the implementation complexity due to need for re-balancing a binary tree on inserts and also I did not have a binary-tree implementation in C. So I finally decided that the hash table is the way to go. So, the hash table would be (key, value) = (filename, fileinfo-struct). This hash table would be associated with a folder object(DIRINFO) to hold all files within the folder.
The folder info(DIRINFO) and file info(FILEINFO) structures used in the code:
// Structure to hold the files that have same name within // the same directory tree. This is required because the hashtable // in DIRINFO uses filename as key, which means there will be name // conflicts. typedef struct _dupWithin { int nCurFiles; int nCurSize; PFILEINFO *apFiles; }DUPFILES_WITHIN, *PDUPFILES_WITHIN; typedef struct _DirectoryInfo { WCHAR pszPath[MAX_PATH]; int nDirs; int nFiles; // Was hash comapre used when this DIRINFO was built? BOOL fHashCompare; // Use a hashtable to store file list. // Key is the filename, value is a FILEINFO structure - if hash compare is turned OFF // Key is hash string, value is a PCHL_LLIST that is the // linked list of files with the same hash string - if hash compare is turned ON CHL_HTABLE *phtFiles; // List of FILEINFO of files that have the same name in // the same dir tree. - if hash compare is turned OFF DUPFILES_WITHIN stDupFilesInTree; }DIRINFO, *PDIRINFO; // Structure to hold information about a file typedef struct _FileInfo { BOOL fIsDirectory; BOOL fAccessDenied; // This structure must know about the duplicacy of a file // because otherwise the directory must hold an additional // list of duplicate files. BYTE bDupInfo; LARGE_INTEGER llFilesize; BYTE abHash[HASHLEN_SHA1]; SYSTEMTIME stModifiedTime; FILETIME ftModifiedTime; WCHAR szPath[MAX_PATH]; WCHAR szFilename[MAX_PATH]; }FILEINFO, *PFILEINFO;
And so, I started to write the code and reached a good first version which was able to find duplicates based on the file's name, size and last-modified-date. I was even able to use it to find duplicates and delete them. Great! The tool served it's initial purpose. Now I wanted to implement recursive diff, i.e., diff a full folder tree and not just one level. Traversing the folder tree is easy, a BFS using a queue is quite trivial once you have a queue data structure. I did not have a queue implementation but I did have a linked-list implementation. I just added the Queue implementation to my CHelpLib project using linked list as the underlying data structure. Now I had to tackle the problem of how to store the files' information when recursively traversing the folder tree. Two options came to my mind:
- Use a tree of folder-info structs to hold information of all files in each folder in the tree starting from the specified folder as the root. This brings the complication of finding duplicate files within two trees. I vaguely remembered a sub-graph compare problem from my algorithms class. But wait, this wasn't even a sub-graph compare problem. It was more trivial than that - how the heck do I search for a file by name among the two folders with an algorithm better than O(n*n). Remember that a duplicate file could be in any folder in the tree rooted at the specified folder.
- Use only one hash table, as described earlier, to store the files' information from all folders in the tree. This offers the same fast inserts and is trivial to find duplicate files since the file name is a key.
Can I Haz Moar Features?
Next feature to add was file comparison using the hash of the file content(file-digest). I decided to use SHA1 as my hashing function since MD5 has more probability of hash collisions. Although, a SHA1 hash collision has also been found and I should probably use SHA2? I wanted to balance speed and collision-resistance. SHA1 is slightly faster than SHA2, however small the difference. I know, I know, I didn't put too much effort into thinking out this part. I can change the hash function easily enough. Anyway, coming back to the feature. In order to find duplicate files using their hashes, the file names no longer matter; only the file digest matters. The file name cannot be used as a key to the hash table in order to find duplicate files. The key to the hash table has to be 160bit(20byte) hash value of a file represented as a string. This brings another issue - files within same hash(duplicate files) within the same folder tree. I solved this by having a list of file-info structs as the value of the hash table entry. So, the hash table of the hash-comparison feature was (key, value) = (hash-value-as-string, list-of-file-info-structs).
By this time, the code had gotten quite complicated mainly due to the two very different ways of using the hash table to store files' information. One was using the file name as key and the other used the file's hash-value. So I decided to separate the folder traversal for the usual comparison and hash comparison features. This made is much easier to maintain and read the code.
Lessons learned
- Data structures used can make or break a program. They can either make a program fast or painfully slow, a memory hogger or space efficient, easy to maintain vs. PITA spaghetti code. Choose wisely.
- Start with a initial set of requirements and design accordingly. As you add more requirements, remember to go back to the design board since you may have to rethink the whole design(worst-case) when new requirements are added.
- Object oriented concepts are your friend in making any challenging problem easy to solve and implement in code. Encapsulation and abstraction are very powerful and quite sufficient. You do not necessarily have to use polymorphism and inheritance all the time.
Source code is on github at: https://github.com/shishir993/fdiffdelete. Built using Microsoft VS2010 Professional and Windows SDK 8.100.26837. Dependent on my other C library CHelpLib.
Some screenshots of the tool at work:
- Menu options and left folder loaded
- Comparing System32 and SysWOW64 folders (seems that both have identical files?)