Project 3: Huffman Coding

Huffman coding is a scheme that assigns variable-length bit-codes to characters, such that the lengths of the codes depend on the frequencies of the characters in a typical message. As a result, encoded messages take less space (as compared to fixed-length encoding) since the letters that appear more frequently are assigned shorter codes. This is performed by first building a Huffman coding tree based on a given set of frequencies. From the tree, bit-codes for each character are determined and then used to encode a message. The tree is also used to decode an encoded message as it provides a way to determine which bit sequences translate back to a character.

 

Assignment:

For this assignment, you will write three programs:

buildtree. Given a file that contains the frequency distribution for all 26 letters of the alpha- bet plus four special characters (space, period, new-line, and end-of-message), buildtree will generate two files: a huffman coding tree file and a file containing bit-code assign- ments.

encode. Given the bit-code file generated by buildtree, and a text file containing a message,

encode will produce a binary file that contains the encoded message.

decode. Given the huffman coding tree file generated by buildtree, and an encoded message generated by encode, decode will reproduce the original message and output that message to a text file.

You may assume that the messages are written in lower-case letters. The frequency file is a 30-line text file, where each line contains a letter (or a special character) followed by a space and a positive integer (string of digits). For the purposes of this frequency file, the special characters have the following codes: ‘-’ for space, ‘.’ for period, ‘!’ for new line, and ‘+’ for end-of-message.

The tree file is a sequential representation of the tree built in buildtree to be read and recreated in decode.

The bit-code file is a 30-line text file. Each line in this file contains a letter (or a special character) followed by a space and then a string of 1’s and 0’s. The order of the characters in this file should be as follows: ‘a’ through ‘z’, space, period, new-line, end-of-message.

The message file consists of lines containing lower case letters, actual spaces (not ‘-’), and periods. When interpreting the message file, a new-line character is implied at the end of each line, and an end-of-message character is implied at the end of the file.

The encoded file is a stream of bytes that encodes the corresponding bits. In the event that the resulting number of bits is not a multiple of 8, add enough bits (with value 0) to form the final byte.

When runing decode, note that the end-of-message character is important since it signifies when to stop decoding, particularly if there are remaining bits in the last byte of the encoded message.

 

 

Sample files are presented later in this document.

Data Structures:

You will use a dynamic node implementation of a binary tree to represent the Huffman coding tree. Note that internal nodes and leaf nodes in the tree contain different data. For simplicity, use the same tree data structure for buildtree and decode.

Building a Huffman tree involves iteratively combining two minimum-valued nodes or sub- trees. In case of ties for minimum values, disambiguate using the following rules:

• Multiple-node subtrees take precedence over single nodes.

• The most recently-created subtree takes precedence over other subtrees.

A single node that occurs earlier in an alphabetized listing of the letters take precedence over other single nodes. Space, period, new-line, and end-of-message (in this order) come after ‘z’ in this alphabetized listing.

The simplest way to implement the rules given above is to maintain a list of trees sorted by weight. The list begins as a list of single nodes ordered by weight, using their alphabetical order if the weights are tied. You will then repeatedly remove the first two items in the list and combine them to form a subtree. The resulting subtree is re-inserted in the list according to its resulting weight. In case there are elements in the list with the same weight as the subtree, insert the subtree right before the first item that has that weight.

Sequential representations of a binary tree are discussed in section 6.5 (pp. 208-210) of the text. You are allowed to use either of the methods suggested and invent your own file format as long as the file you generate is a readable text file that can be interpreted by the TAs.

As usual, you may not use STL but you may pattern your code after the code available in the textbook.

 

Program Invocation:

The programs are to be invoked as follows:

buildtree <frequencyfile>  <treefile>  <codefile> encode <codefile> <messagefile> <encodedmessagefile> decode <treefile> <encodedmessagefile> <messagefile>

If the command line contains an insufficient number of arguments or if any of the specified input files does not exist, the program should print an appropriate error message and exit.

 

 

Sample Input and Output Files:

Refer to the website for more complete versions of these files.

 

frequency.txt

 

c 4

d 2

- 15

a 11

b 7

e 20

! 4

+ 1

f 5

. 0

z 0

w 0

...

(all other letters have frequency 0)

 

codes.txt a 101

b 001

c 0001

d 00001

e 11

f 1001 g ...

h ...

...

z ...

- 01

. ...

! 1000

+ 000001

 

message.txt a bad

face

 

 

encodedmessage.bin

(this is a six-byte binary file)

 

10101001 10100001 10001001 10100011 11000000 00100000

Need a custom answer at your budget?

This assignment has been answered 5 times in private sessions.

© 2024 Codify Tutor. All rights reserved