Monday, June 27, 2011

Nested tar archives extractor (.tar/ .tgz/ .tbz/ .tb2/ .tar.gz/ .tar.bz2) written in python

I recently wrote a small script in python which can be used to recursively extract nested tar archives about which I have discussed in this post (in quick question-answer format).

For what audience is this post intended?
I have primarily written this post for sharing the program so it is mainly for those who have some background in python programming. However, if you just want to use this as a utility, then you  can simply skip to the download part.
(This post has been written assuming that the underlying Operating System is Linux. However, it should be applicable for Windows too.)

So, lets begin...

What is a nested tar archive?
It is a tar archive containing other tar archives which may further contain many more tar archives (and so on...)

So what does this program do?
It extracts tar archives recursively. (What/How? It can be made clear from the examples below.)

What's different in this?
Ordinary extractors normally just extract a tar archive once, ie they won't extract any other tar archives (if any)  that are present in it. If it has more tar archives and you want to extract them too, then you have to yourself extract each of these archives. This can be a real headache if there are many tar archives (and are nested many levels deep). I have tried to make this thing easy using this tool.

Can you show me some examples?
Yes, why not.

Example #1
If this is the directory structure of a tar archive -
parent/
    xyz.tar/
        a
        b
        c
        d.tar/
            x
            y
            z

After extracting using a regular extractor -
parent/
    xyz.tar
    xyz/
        a
        b
        c
        d.tar/       # this was not extracted.
            x
            y
            z

After extracting using my extractor -
parent/
    xyz.tar
    xyz/
        a
        b
        c
        d/
            x
            y
            z
(Structure of the tar archive xyz.tar)
(On extracting using a regular extractor. Note that 'd.tar' is still not extracted.)
(On extracting using my extractor. Note 'd.tar' has been extracted to the folder 'd')
Example #2
While extracting, my extractor also takes care about not replacing/overwriting already existing folders.
So if this is the directory structure of a .tar archive -
parent/
    xyz.tar/
        a
        b
        c
        d.tar/
            x
            y
            z
        a.tar/       # note if I extract this directly, it will
                     # replace/overwrite the contents of folder 'a'.
            m
            n
            o
            p

After extracting using my extractor -
parent/
    xyz.tar
    xyz/
        a
        b
        c
        d/
            x
            y
            z
        a 1/          # extracted 'a.tar' to the folder 'a 1' as
                      # folder 'a' already exists in the same folder.
            m
            n
            o
            p

What do I need on my PC to run this program?
Either python 2.6 or python 2.7. I have written this program in python 2.7 and it requires no extra packages to be installed. It should also work on prior python 2.x versions but I have not tested it on any of these.

Is the program easy to understand?
Yes, it is a very simple. Also, I have provided plenty of documentation (comments) at every step in my code below, so understanding it won't be difficult at all. In fact, you can easily customize it later to suit your needs.

OK, so where is the code?
Here it is -
#! /usr/bin/env python
# -*- coding: UTF-8 -*-

"""A command line utility for recusively extracting nested tar archives."""

__author__ = "Pushpak Dagade (पुष्पक दगड़े)"
__date__   = "$4 July, 2011 3:00:00 PM$"

import os
import sys
import re
import tarfile
from argparse import ArgumentParser

major_version = 1
minor_version = 1
error_count = 0

file_extensions = ('tar', 'tgz', 'tbz', 'tb2', 'tar.gz', 'tar.bz2')
# Edit this according to the archive types you want to extract. Keep in
# mind that these should be extractable by the tarfile module.

__all__ = ['ExtractNested', 'WalkTreeAndExtract']

def FileExtension(file_name):
    """Return the file extension of file

    'file' should be a string. It can be either the full path of
    the file or just its name (or any string as long it contains
    the file extension.)

    Example #1:
    input (file) -->  'abc.tar.gz'
    return value -->  'tar.gz'
    
    Example #2:
    input (file) -->  'abc.tar'
    return value -->  'tar'
    
    """
    match = re.compile(r"^.*?[.](?P<ext>tar[.]gz|tar[.]bz2|\w+)$",
      re.VERBOSE|re.IGNORECASE).match(file_name)

    if match:           # if match != None:
        ext = match.group('ext')
        return ext
    else:
        return ''       # there is no file extension to file_name

def AppropriateFolderName(folder_fullpath):
    """Return a folder (path) such that it can be safely created in
    without replacing any existing folder in it.

    Check if the folder folder_fullpath exists. If no, return folder_fullpath
    (without changing, because it can be safely created
    without replacing any already existing folder). If yes, append an
    appropriate number to the folder_fullpath such that this new folder_fullpath
    can be safely created.

    Examples:
    folder_name  = '/a/b/untitled folder'
    return value = '/a/b/untitled folder'   (no such folder already exists.)

    folder_name  = '/a/b/untitled folder'
    return value = '/a/b/untitled folder 1' (the folder '/a/b/untitled folder'
                                            already exists but no folder named
                                            '/a/b/untitled folder 1' exists.)

    folder_name  = '/a/b/untitled folder'
    return value = '/a/b/untitled folder 2' (the folders '/a/b/untitled folder'
                                            and '/a/b/untitled folder 1' both
                                            already exist but no folder
                                            '/a/b/untitled folder 2' exists.)
                                        
    """
    if os.path.exists(folder_fullpath):
        folder_name = os.path.basename(folder_fullpath)
        parent_fullpath = os.path.dirname(folder_fullpath)
        match = re.compile(r'^(?P<name>.*)[ ](?P<num>\d+)$').match(folder_name)
        if match:                           # if match != None:
            name = match.group('name')
            number = match.group('num')
            new_folder_name = '%s %d' %(name, int(number)+1)
            new_folder_fullpath = os.path.join(parent_fullpath, new_folder_name)
            return AppropriateFolderName(new_folder_fullpath)
            # Recursively call itself so that it can be check whether a
            # folder with path new_folder_fullpath already exists or not.
        else:
            new_folder_name = '%s 1' %folder_name
            new_folder_fullpath = os.path.join(parent_fullpath, new_folder_name)
            return AppropriateFolderName(new_folder_fullpath)
            # Recursively call itself so that it can be check whether a
            # folder with path new_folder_fullpath already exists or not.
    else:
        return folder_fullpath

def Extract(tarfile_fullpath, delete_tar_file=True):
    """Extract the tarfile_fullpath to an appropriate* folder of the same
    name as the tar file (without an extension) and return the path
    of this folder.

    If delete_tar_file is True, it will delete the tar file after
    its extraction; if False, it won`t. Default value is True as you
    would normally want to delete the (nested) tar files after
    extraction. Pass a False, if you don`t want to delete the
    tar file (after its extraction) you are passing.

    """
    try:
        print "Extracting '%s'" %tarfile_fullpath,
        tar = tarfile.open(tarfile_fullpath)
        extract_folder_fullpath = AppropriateFolderName(tarfile_fullpath[:\
          -1*len(FileExtension(tarfile_fullpath))-1])
        extract_folder_name = os.path.basename(extract_folder_fullpath)
        print "to '%s'..." %extract_folder_name,
        tar.extractall(extract_folder_fullpath)
        print "Done!"
        tar.close()
        if delete_tar_file: os.remove(tarfile_fullpath)
        return extract_folder_name

    except Exception:
        # Exceptions can occur while opening a damaged tar file.
        print '(Error)\n(%s)' %str(sys.exc_info()[1]).capitalize()
        global error_count
        error_count += 1

def WalkTreeAndExtract(parent_dir):
    """Recursively descend the directory tree rooted at parent_dir
    and extract each tar file on the way down (recursively)."""
    try:
        dir_contents = os.listdir(parent_dir)
    except OSError:
        # Exception can occur if trying to open some folder whose
        # permissions this program does not have.
        print 'Error occured. Could not open folder %s\n%s'\
          %( parent_dir, str(sys.exc_info()[1]).capitalize())
        global error_count
        error_count += 1
        return

    for content in dir_contents:
        content_fullpath = os.path.join(parent_dir, content)
        if os.path.isdir(content_fullpath):
            # If content is a folder, walk down it completely.
            WalkTreeAndExtract(content_fullpath)
        elif os.path.isfile(content_fullpath):
            # If content is a file, check if it is a tar file.
            if FileExtension(content_fullpath) in file_extensions:
                # If yes, extract its contents to a new folder.
                extract_folder_name = Extract(content_fullpath)
                if extract_folder_name:     # if extract_folder_name != None:
                    dir_contents.append(extract_folder_name)
                    # Append the newly extracted folder to dir_contents
                    # so that it can be later searched for more tar files
                    # to extract.
        else:
            # Unknown file type.
            print 'Skipping %s. <Neither file nor folder>' % content_fullpath

def ExtractNested(tarfile_fullpath):
    extract_folder_name = Extract(tarfile_fullpath, False)
    if extract_folder_name:         # if extract_folder_name != None
        extract_folder_fullpath = os.path.join(os.path.dirname(
          tarfile_fullpath), extract_folder_name)
        WalkTreeAndExtract(extract_folder_fullpath)
        # Given tar file is extracted to extract_folder_name. Now descend
        # down its directory structure and extract all other tar files
        # (recursively).
        
if __name__ == '__main__':
    # Use a parser for parsing command line arguments
    parser = ArgumentParser(description='Nested tar archive extractor %d.%d'\
      %(major_version,minor_version))
    parser.add_argument('tar_paths', metavar='path', type=str, nargs='+',
      help='Path of the tar file to be extracted.')
    extraction_paths = parser.parse_args().tar_paths
    
    # Consider each argument passed as a file path and extract it.
    for argument in extraction_paths:
        if os.path.exists(argument):
            #print       # a blank line
            ExtractNested(argument)
        else:
            print 'Not a valid path: %s' %argument
            error_count += 1
    if error_count !=0: print '%d error(s) occured.' %error_count
(download source files from PyPI)

What coding conventions does the program follow?
I have followed PEP 8 coding style and PEP 257 docstring conventions while writing the program (with 1 small exception - I have used  maximum line width of 80 characters while PEP 8 suggests a maximum of 79 characters.)

OK, now I have downloaded the source files.
How do I use it to extract 'nested tar archives'?     OR
How do I access it from the terminal?
Simply follow these steps -
  1. Ensure you have either python 2.6 or python 2.7 installed.
  2. Extract the downloaded file "nested.tar.archives.extractor-1.1 [python 2.x].tar.gz"
  3. Copy the file "extractnested.py" to one of the folders in your PATH environment variable
    (Note: For linux users, as you might know, to allow execution of extractnested.py as a bash script, you need to grant it permissions - $ chmod ugo+rx extractnested.py)
  4. Now you can extract any tar archive from the terminal - 
  5. $ extractnested.py path [path ...]
    where path is the path of the tar archive you want to extract.
(Extracting the tar archive 'xyz.tar')
You can also extract more than 1 archives in one command.

(Extracting 'xyz.tar' and 'abc.tar' in the same command.)

Any future plans for development?
Well ....yes. I plan to extend this utility for extracting other archives like zip, 7z, rar, etc so it can be an 'all in one extraction utility'. Also I intend to add this to context menus so that one can simply right click on any compressed archive and extract it with a single click!
____________________________________________________


     I actually wrote this program in response to this question asked on StackOverflow which then had an active bounty of 50 rep. points (StackOverflow users will understand what I am saying!) in the greed of winning the bounty. So I worked on the program for 2 days and then posted it on StackOverflow. Unfortunately, even after a few days the user did not reply and I did not get any bounty, not even a single rep. point. So, upset, I decided to upload this program (with slight changes) to PyPI and also here on the blog.

     Now, coincidentally or not, I don't know, that night after I uploaded this to PyPI and also posted it here, the user replied and rewarded me with the bounty! I got 75 rep. points! Believe me, there was no way the user knew I will be posting this here on the blog! When I saw this the next morning (that I won the bounty), my joy knew no bounds!
     That day, I did not do any work in the office (@internship) as I was busy editing this post!