%--------------------------------------------------%
% vim: ft=mercury ts=4 sw=4 et
%--------------------------------------------------%
% Copyright (C) 2019 The Mercury team.
% This file is distributed under the terms specified in COPYING.LIB.
%--------------------------------------------------%
%
% File: edit_seq.m.
% Stability: medium.
%
% This module finds an edit sequence, which means that given two sequences
% of items, it finds the shortest sequence of edit operations (deletes,
% inserts and/or replaces) that will transform the first sequence
% into the second.
%
% The code is a naive implementation of the Wagner-Fischer algorithm,
% which is documented on its own wikipedia page.
%
% Given two lists of length M and N, its time complexity is O(MN),
% so it is suitable for use only on reasonably short lists.
%
%--------------------------------------------------%
%--------------------------------------------------%
:- module edit_seq.
:- interface.
:- import_module list.
%--------------------------------------------------%
%--------------------------------------------------%
% Given two item sequences A and B, the edit sequence is the sequence
% of edit operations that transforms sequence A into sequence B.
%
% Item numbers start at 1. The item numbers in edit operations reflect
% the *original* position of the relevant item, i.e. they are not affected
% by any edit operations that take place before that position.
%
:- type edit_seq(T) == list(edit(T)).
:- type edit(T)
---> delete(int)
% Delete item #N in sequence A.
; insert(int, T)
% Insert the given item from sequence B
% after item #N in sequence A.
; replace(int, T).
% Replace item #N in sequence A with the given item
% from sequence B.
:- type edit_params
---> edit_params(
% The cost of delete, insert and replace operations
% respectively. Only the *relative* values of the costs matter;
% if these are fixed, their *absolute* values are irrelevant
% (unless they are so high that they cause arithmetic
% overflows).
cost_of_delete :: int,
cost_of_insert :: int,
cost_of_replace :: int
).
% find_shortest_edit_seq(Params, SeqA, SeqB, Edits):
%
% Compute Edits as the cheapest sequence of edit operations
% that will transform SeqA into SeqB, where the cost of each kind of
% edit operation is specified by Params.
%
:- pred find_shortest_edit_seq(edit_params::in, list(T)::in, list(T)::in,
edit_seq(T)::out) is det.
%--------------------------------------------------%
% A diff_seq represents a unified diff with unlimited context,
% such as the output of "diff -u --context=MAXINT".
%
% Each line (or in general, one item) in it can be an item from SeqA
% that is left unchanged, an item from SeqA that is to be deleted, or
% an item (from SeqB) that is to be inserted.
:- type diff_seq(T) == list(diff(T)).
:- type diff(T)
---> unchanged(T)
; deleted(T)
; inserted(T).
% Given an edit sequence computed by find_shortest_edit_seq, return
% the unified diff representing that edit sequence.
%
% The main difference between the edit sequence and the diff sequence
% is that given several consecutive replace edits, a naive representation
% of those edit operations would output interleaved pairs of items
% to be deleted and inserted, while the diff sequence would output
% *all* of the items to be deleted by those replace operations *before*
% printing the insertions of their replacements.
%
:- pred find_diff_seq(list(T)::in, edit_seq(T)::in, diff_seq(T)::out) is det.
%--------------------------------------------------%
% This type and its fields are documented below.
:- type change_hunk(T)
---> change_hunk(
ch_seq_a_start :: int,
ch_seq_a_length :: int,
ch_seq_b_start :: int,
ch_seq_b_length :: int,
ch_diff :: diff_seq(T)
).
% find_change_hunks(ContextSize, DiffSeq, ChangeHunks):
%
% A diff_seq may contain long sequences of unchanged items, which are
% often not of interest. This predicate computes from a diff sequence
% a list of its *change hunks*, which are its interesting parts,
% the parts that contain insertions and/or deletions.
%
% A change hunk looks like this, using the syntax of "diff -u".
% The ContextSize of this example is 3.
%
% @@ -25,6 +25,7 @@
% Roosevelt
% Taft
% Wilson
% +Pershing
% Harding
% Coolidge
% Hoover
%
% This change hunk shows the insertion of one line containing "Pershing"
% into a list of US presidents. The "-25,6" part of the header shows that
% the part of the original sequence (sequence A) covered by this change
% hunk contains six lines, starting at line 25. The "+25,7" part shows that
% the part of the updated sequence (sequence B) contains seven lines,
% starting at line at 25 in that sequence as well. The first four fields
% of the change_hunk type contain these two pairs of numbers.
%
% A change hunk consists of three parts, of which the first and/or last
% may be empty.
%
% - The first part is a sequence of up to ContextSize unchanged items
% (the initial context).
% - The second part is a sequence of unchanged, insertion or deletion
% items that
% * starts with an insertion or deletion item,
% * ends with an insertion or deletion item, and
% * contains at most 2 * ContextSize consecutive unchanged items.
% The start and end item may be the same, as in the example above.
% - The third part is a sequence of up to ContextSize unchanged items
% (the trailing context).
%
% The idea is to surround regions of changes with ContextSize unchanged
% items to provide context (hence the name ContextSize). The first and
% third parts will always contain *exactly* ContextSize unchanged items,
% unless the changed region occurs so close to the start or to the end
% of the item sequence that there are fewer than ContextSize unchanged
% items there.
%
% The reason why there may be up to 2 * ContextSize consecutive unchanged
% items in the middle of a change hunk is that if the limit were any lower,
% then some of those unchanged items would end up *both* in the trailing
% context of one change hunk and the initial context of the next change
% hunk.
%
% To make sense, ContextSize must be least one. This predicate throws
% an exception if ContextSize is zero or negative.
%
:- pred find_change_hunks(int::in, diff_seq(T)::in,
list(change_hunk(T))::out) is det.
%--------------------------------------------------%
%--------------------------------------------------%