Joris Dormans (2004)
In the last chapter we saw that visual language exists. The central question of this chapter is: what should a visual grammar look like?. To answer this question I will elaborate on the architecture of the brain-faculty for language in general and visual language in particular. This will allow me to investigate how visual language is processed by the brain and model visual grammar accordingly. Central to this chapter is Ray Jackendoff's Tripartite Parallel Architecture for the language faculty. Jackendoff's architecture ties in with Chomsky's theory of language and incorporates the deep and surface structure as well as discrete infinity. I will use his Tripartite Parallel Architecture as a template to formulate a model for the visual language faculty and to sketch the outlines of a visual grammar. This approach to language stems from the cognitive tradition which Jackendoff is part of. The cognitive sciences are interested in the functioning of the human brain. Their ultimate aim being the accurate modelling of the human brain and explaining the human behaviour that stems from it. Language is one of the focuses of the cognitive sciences as language is one of the most important human behaviours.
Chomsky stated that dictionaries and traditional grammar textbooks together do not form a complete or accurate description of a language; these cannot account for the infinite possibilities of a language, of discrete infinity (Chomsky 2000: 12). A theory of the structure of language, a grammar, should not merely try to describe a set of utterances, it should be able to generate the set of all possible utterances. This is because “the mind is obviously finite but there is an infinite number of expressions that every person can master and use” (ibid 2000: 11), and therefore it is very unlikely that a person who has mastered a language has somehow memorized all dictionaries and grammar text-books.
According to Chomsky language is not learned as such. He speaks of language acquisition which is more like growing than learning. The acquisition of language is not solely the result of experience because experience is not rich enough to account for the scope and depth of the language acquired by, for example, a ten year old child (Chomsky 2000: 6-7). This potential for acquisition must somehow be incorporated within the part of the mind-brain that is responsible for language; the mind-brain must incorporate a fertile soil for language to grow on. Almost all new-born children are endowed with a language faculty that provides the basis to acquire any language. Young adopted children learn the language of their new country with as much ease as native children: “people aren’t genetically adapted to one language or another” (Chomsky 2000: 13). This natural disposition to learn language has led to the assumption that at least some part of language is shared among all people; some part of language is universal. The system that describes this part of language is known as Universal Grammar (Chomsky 1992: 4).
The assumption that a Universal Grammar underlies the grammar of any language is sometimes problematic. Today many people are inclined to stress the differences between languages, and even deny the existence of universal roots. In this light it is important to discuss the extend of the influence a universal root of language can have on the many languages in use. This influence is both large and small. It is large because Universal Grammar ultimately defines the possibilities for language. The human ability to use (verbal) signs to express their thoughts and to understand their surroundings is incorporated within Universal Grammar. Without this disposition language (as we know it) would not be possible. Likewise the distinction between nouns and verbs can be found in all languages. These boundaries are quite fundamental. It is hard to imagine a language that does not share this trait
The influence of Universal Grammar is small because it does allow for much variance and differentiation among languages, especially when the more recent ‘principles-and-parameters’ approach is taken into account. This approach does not assume there exist deep, fixed rules on which every grammar must be based. Rather, there exists a fixed system of parameters which possible values are finite. These parameters are embedded within a fixed system of principles. Fundamental differences between languages can be traced to different values of the same parameter (Chomsky 1992: 5). The grammar of a particular language is based on some realized state of the Universal Grammar but also consists of additional grammatical rules and principles that are particular to that language.
The point I like to stress is that from the theory of Universal Grammare follows that the mind-brain is equipped with some specialized regions that handle language-processing. These regions (or modules as they are often called) are more or less equal among all humans by biological necessity, and the nature and of these regions and their connections must somehow have an influence on the nature of language. Since we have already concluded that we can also speak of visual language in chapter 1, I assume that something similar must also exist for visual language. Visual language is also incorporated within the mind-brain and therefore the biological disposition of the mind-brain affects the nature and potential of visual language. This universal component of (visual) language, its lower threshold so to say, is for a large part determined by how the mind-brain processes and produces (visual) language; it is determined by the architecture of the (visual) language faculty.
Since Chomsky started to theorize about language in the 1950s, the cognitive sciences, that study the human mind and its functions, have progressed enormously. The results of these studies have influenced and improved Chomsky's initial ideas. One of the more recent versions can be found in the work of Ray Jackendoff, whose theories I will use as a springboard for my own model of the architecture of the visual language faculty.
Although Jackendoff admits he has the ‘privilege to stand on the shoulders of Chomsky’, he elaborates on two weak spots in Chomsky’s Minimalist Program. The first weak spot is Chomsky’s Syntactocentrism. Syntactocentrism is a label to refer to the assumption underlying all versions of Chomsky's Generative Grammars that the most fundamental generative component of grammar is syntax. In this view phonology and semantics are merely descriptive (Jackendoff 1997: 15). Chomsky's syntactocentrism is most explicit when he describes discrete infinity: the most (and often only) source of discrete infinity seems to be syntax. Jackendoff claims that there is no conceptual necessity for a unique source of discrete infinity in verbal language. In fact, Jackendoff's linguistic architecture allows discrete infinity to stem from syntax, semantics and phonology. This brings generative grammar better in line with results of cognitive studies of other parts of the mind-brain. Especially with the visual system in which more components contribute to the what can be called the 'discrete infinity of visual language'. (ibid. 16-17).
The second weak spot is Chomsky’s claim that “the language faculty is nonredundant, in that particular phenomena are not “overdetermined” by language” (Chomsky 1992: 2). Again Jackendoff argues that this is not on par with other mental faculties. Depth-perception, for example, is the result of parallel processed and redundant information of eye convergence, lens focus, stereopsis, occlusion, and texture gradients (Jackendoff 1997:14-15). Redundancy will indeed prove to be a very important concept within visual grammar as the different parts of the brain that deal with syntax, semantics and phonology often contribute to the same interpretation and thus their contributions are sometimes redundant (see section 2.5).
This leads Jackendoff to construct a model of the part the mind-brain that deals with language, which he calls the 'Tripartite Parallel Architecture' (TPA, see figure 2.1). The TPA-model consists of three modules and two interfaces between these modules. The first two modules (the phonological and the syntactic module) and the two interfaces are part of the language faculty, the last, conceptual module is not. It is included in the TPA because the intrinsic characteristics of the conceptual faculty are projected onto the syntactic and phonological modules (ibid. 33-39). In the TPA model of the language faculty, the phonological structure of a representation is analyzed first by the phonological module and the phonological-syntactic interface (IPS) maps the result to the syntactic module. This module analyzes the syntactic structure of the representation and the syntactic-conceptual interface (ISC) maps it to the conceptual module. In the case of sentence production this process is reversed.
Figure 2.1 - Jackendoff's tripartite parallel architecture
The interfaces play a vital role in Jackendoff’s TPA-model. The interfaces contain rules of correspondence by which structural information from one module can be mapped to another. The modules analyse the phonological and the syntactic structure of the representation, and this information is passed from one module to another. These rules of correspondence allow for certain syntactic structures to be meaningful; with these rules syntactic structures can be mapped to meaningful conceptual structures.
Jackendoff distinguishes three types of rules of correspondence: (Jackendoff 1997: 23):
1 Determinative rules: A must correspond to B
2 Default rules: A preferably does correspond to B
3 Permissive rules: A may correspond to B
The distinction between these three rules are important. Without this distinction one might be inclined to think of grammatical rules only as determinative rules. As we will see later, grammatical rules invoked in a representation can often contradict each other. The 'strength' of a rule influences which rule will prevail over another conflicting rule.
Jackendorf identifies perhaps the most basic principle of correspondence in the ISC as follows (ibid. 36):
R1 Rule of embedding – if a phrase P1 is syntactically embedded within phrase P2 than concept C1, expressed by P1, is preferably embedded within concept C2, expressed by P2.
An instance of R1 can be found in the following sentence:
S1 The cat sat on the mat.
In S1 the phrase 'sat on the mat' is syntactically embedded within the phrase 'the cat sat on the mat'. Conceptually 'sat on the mat' is also embedded within 'the cat that sat on the mat' (or attributed to 'the cat'); it tells us to look for a cat (that sits on a mat) and not a mat (that lies under a cat).
Traditional generative grammar would have left out 'preferably' in R1 and replaced with ‘must be’; traditional grammar would define R1 as a determinative rule instead of a default rule. But Jackendorf comes with some interesting counter examples of sentences where R1 cannot be applied and thus cannot be a determinative rule. Consider the following sentences:
S2 A man walked in who was from Philadelphia.
S3 A man from Philadelphia walked in.
These two sentences have the same meaning. Their deep structures can thus be said to be the same. However, their surface structures are different. Sentence S2 makes perfect sense but does not follow R1: the phrase 'who was from Philadelphia' should preferably be linked directly within the words 'a man'. In S2 these are divided by the verb 'walked'. S3 does follow R1 and thus its surface structure resembles the deep structure more closely. This also explains why S3 requires less cognitive effort to comprehends than S2 does.
Another example of differently structured sentences that have the same meaning are:
S4 An occasional sailor walked by.
S5 Occasionally, a sailor walked by.
In S4 the label 'occasionally' is embedded within the phrase 'an occasional sailor' suggesting that 'occasionally' is an attribute of 'a sailor'. This is obviously not the case: 'occasionally' is an attribute of 'sailors walking by'. Therefore the surface structure of S5 again resembles more closely the shared deep structures of S4 and S5, even though in this case the cognitive effort in understanding S4 is not higher than for S5. This is probably the result of S4 drawing on a strong conventional construction that is encountered frequently; it is an occasional convention, so to speak (ibid. 36-38).
In chapter 1 three types of correspondence were mentioned. These were Direct, Conventional and Intuitive correspondence. These types of correspondences can be linked to types of rules. Direct correspondence is always based on a determative rule, such as R2 below. Conventional correspondence is never based on a determative rule: it is usually based on a default rule, such as R3 below. However, not all conventions are equally strong, and some weaker conventional correspondence can be arguably based on permissive rules. Intuitive correspondence can also be based on permissive and default rules (but not determative rules). Rules such are R4 and the derived rules R5 and R6 below are clearly permissive, but I am inclined to call the default rule R1 intuitive rather than conventional.
R2 Rule of positioning – if in a natural image object A is depicted to the left of object B, then in the conceptual representation of the scene object A must be positioned to the left of object B.
R3 Rule of relative value – if the sign used to represent of concept A is larger than the sign used to represent concept B, then the (financial, economic, moral, etc.) value of concept A preferably is larger than the value of concept B.
R4 General rule of intuitive correspondence – if the sign used to represent of concept A is articulated in a particular way from the sign used to represent concept B, then this difference in articulation may correspond to a meaningful difference between A and B within conceptual domain of which both A and B are part.
R5 Rule of intuitive size-volume correspondence – if the sign used to represent the sound A is larger than the sign used to represent sound B, then sound A may be conceptual represented as being louder than sound B.
R6 Rule of intuitive y-position-frequency correspondence – if the sign used to represent the sound A is located higher than the sign used to represent sound B, then sound A may be conceptual represented as being higher in frequency than sound B.
Please note that these rules are examples, constructed intuitively and designed to illustrate the points made in the text above. All rules of correspondence presented so far are rules that can or must be applied to visual representations. This is of course intentional as it is the aim of this thesis to construct a visual grammar.
Jackendoff's TPA can only serve as a template for the purposes of this thesis. It deals with verbal language, not with visual language. I will use it to create a model of my own that is geared to visual language. However, first I must correct a flaw: while Jackendoff calls the TPA-model parallel I do not agree with this. The flow of information processes through three serial-linked modules. It does not contain any parallel paths. It is not the goal of this thesis to correct the TPA, but before using it as a template this flaw needs to be revised.
Neurological evidence suggests there are many areas in the brain involved in the processing of visual stimuli, each of these areas seem to have its own function. These areas are dedicated to depth-perception, feature-detection, motion, etc. From the primary visual cortex there are two major pathways through the brain. One path leads through the inferotemporal cortex and is often referred to as the 'what system' because it plays a major role in visual identification. The other path leads through the posterior parietal cortex. It function is to locate the visual stimuli, often called the 'where system'. Information from both paths flow into the parts of the mind-brain that are responsible for the higher cognitive functions (Reisberg 2001: 44-46).
This evidence calls for the model that consists of at least four parts. There is an articulation module that is dedicated to the analysis of the basic features of the visual stimuli. Its role is similar to Jackendoff's phonological module, but for obvious reasons its visual counterpart cannot be called phonological. The 'where system' is equivalent to the syntactic module, or is at least part of an equivalent module in the visual faculty. The 'what system' has no clear equivalent in Jackendoff's TPA. It will be the lexical module in my model. The articulation module feeds information into both the syntactic and the lexical modules, which both feed information into the conceptual module. In chapter 1 we saw that there exists meaningful means of articulation. These are analysed in the articulation module, therefore I will assume that the articulation module also feeds information directly into the conceptual module. This leads to a model with four modules and five interfaces (IAS, IAL, IAC, ISC and ILC). I will call this model the Modular Parallel Architecture (MPA) of the visual faculty (see figure 2.2).
Figure 2.2 - Modular parallel architecture (MPA)
The flow of information from the eye through the MPA-model goes amongst different paths and often simultaneously. When looking at an image such as the image of a tree in chapter 1 the information is first analysed in the articulation module. The articulation module maps all information to the syntactic lexical and conceptual module. The information flows among various paths to converge in the conceptual module. This is an important point that needs to be stressed. As we will see in section 2.5 it is in the conceptual module where the results of the analysis of the modules of the language faculty are put together and a coherent 'picture' is formed, while all the other modules fulfil only a specialized role.
The MPA has in my opinion one other advantage over the TPA. In the TPA-model it is tempting to falsely attribute to much 'logic' to the syntactic module because it is 'closer' to the conceptual module. Whereas in the MPA all modules have a direct interface to the conceptual module. This leads to less chance of syntactocentrism.
Jackendoff states that the conceptual module in his TPA supports ‘verbal’ logic only. He explicitly states some conceptual information cannot be represented visually and some conceptual information cannot be represented verbally. Thus, according to Jackendoff, there need to exist separate 'modules' for verbal and visual conceptual information (ibid. 43). I do not agree with this view. Unlike Jackendoff, I am inclined to argue that some conceptual information is best represented visually or verbally. One example of such a representation would be maps. It might be possible to describe a map verbally, but this would be very hard indeed; it is simply more efficient to draw a map. This efficiency is due to the direct correspondence that exists between the conceptual information and the means of representation that allow for articulation in the X and Y dimension. When the same level of detail is put into a verbal representation, this representation tends to become very complex. A set of directions, on the other hand, can be, and often is, represented verbally; the conceptual information represented by a set of directions can be conceptualized with a sequential character and therefore is quite suitable for verbal representation, which also is sequential in character.
The visual faculty is limited by the intrinsic characteristics of the visual medium, just like it is limited by the intrinsic characteristics of the conceptual faculty of the mind-brain. Therefore it is only logical that the modules and the interfaces of the visual faculty are differently geared than the modules and the interfaces of the language faculty. The limitation of the language faculty by the intrinsic characteristics of the conceptual faculty is often called projection. In the MPA-model I acknowledge this conceptual projection (C-Projection) but also introduce a form of projection that stems the visual medium (articulation projection or A-Projection). Thus the need for 'objects' and 'subjects' in verbal and to a lesser extend visual representations is a form of C-Projection. On the other hand, the predominance of spatial ordering in visual representation is the result of the characteristics of visual medium and therefore a form of A-Projection. The introduction of A-Projection in the visual faculty is important because I feel that the characteristics of the medium plays a more explicit role in visual representation than in verbal representations. In order to examine this role A-Projection needs to be distinguished from C-Projection.
I think that the most fundamental differences between the visual language faculty and the verbal language faculty can be traced back to difference in A-Projection rather than difference in C-Projection. The means of articulation differ in visual and verbal representations causing to a necessary difference in the nature of A-Projection for the visual and verbal language faculty. While attributing the difference between visual and verbal faculties to a different C-Projection caused by different conceptual modules would be much more speculative. Therefore I will argue that, formally, there is no need to distinguish between separate conceptual modules for the visual and verbal faculties.
Jackendoff formulates another argument for this view. He argues that thought is independent from language. The 'voice in the head' that is commonly associated with thinking is in fact only the conscious result of an unconscious process. Sometimes this conscious result can also be a mental picture if that is more convenient. In order for a thought to be expressed consciously it needs to be (partially) processed by the language faculty (Jackendoff 1997: 186-193). The conceptual module is invoked subconsciously and the differences in verbal and visual thought are due to the also subconscious processing by either the visual or verbal language faculty before the result becomes conscious.
When working with a modular model for the architecture of the mind it is very important to clearly indicate what is, and what is not, the function of each module. When analyzing an image it is far to easy to jump to an interpretation that is the result of delicate construction while failing to see its complex nature. Jackendoff stresses this point by pointing out that relational roles in language-use, such as actors and goals, are not part of the syntactic representation but of its conceptual counter part (ibid. 185). For the sake of formal clarity a grammar needs to be very specific about which module can contribute what element in the construction of meaning. When a module is considered to be a repertoire of possible constellations for its specific domain of representation, than the most important grammatical rules are established by the interfaces that allow the mapping of these representation between the modules, and especially the IPC, ISC and ILC because these map directly into the conceptual module.
Figure 2.3 - The British used guns
Figure 2.4 - Abstraction of it
Figure 2.5 - Archetypical action process
For example, figure 2.3 is called an 'action-process' by Gunter Kress and Theo Van Leeuwen (1996: 58) on the basis of the abstraction featured in figure 2.4. An action-process consists of two participants (see chapter 1) and a vector. A vector is an implicit or explicit line that is often diagonal and connects one or two participants. A vector represents an action; according to Kress & Van Leeuwen it is the visual equivalent of a verb. One participant is called an actor and one a goal based on the function it has in this particular constellation. An 'action-process' serves almost as an elementary grammatical unit, and there are several others of such processes that function in a similar way. The basis on which Kress & Van Leeuwen classify figure 2.3 is a typical constellation made more explicit in figure 2.4 and further abstracted in figure 2.5 which can be considered to be the arch-typical action-process.
One can suspect that the structure shown in figure 2.5 is somehow similar to the conceptual structure that is constructed when looking at figure 2.3. One needs only to replace the labels 'actor', 'vector' and 'goal' with 'the British', 'stalk' and 'aborigines'. This is a very complex construction that can be conceivably realized in the conceptual module, but how did it get there? Or, to put it differently, what articulation, syntactical, and lexical structures can be found in figure 2.3 and how can these be mapped to a conceptual structure that is likely to resemble figure 2.5?
When looking at the level of the articulation module, the articulation structure might include nothing more than the recognition that there are two important forms (F1 and F2) distinguishable from a background and which are articulated in a particular way (A1 and A2). Let also assume for the sake of simplicity that at this level a vector and its articulation ([V3 + A3]) can be recognized. The articulation structure can then be transcribed as follows:
Articulation structure: [[F1 + A1] + [F2 + A2] + [V3 + A3]]
This information is than mapped to the other modules in the MPA. The syntactic module adds syntactic information to the forms, while the lexical module adds lexical information:
Syntactic structure: [[F1 = left] + [F2 = right] + [V3 connects F1 to F2]]
Lexical structure: [[F1 = the British] + [F2 = aborigines] + [V3 = Æ]]
All information converges in the conceptual model where a structure is constructed which might be formally notated as follows:
From this structure inferences are drawn that 'flesh' out the meaning of the structure. Thus F1 can be seen as the 'actor' in this structure because an action-vector pointing away from it is connected to it. The vector itself can be identified as 'to stalk' because its articulation is (partially) realized by F1 whose posture can be identified as stalking, etc.
This example is simplified, and more processes contribute to the construction of meaning attributed to figure 2.3. Notably, the identification of 'the British' and 'aborigines' is helped by the context of the image (which is an Australian history text-book). The participants themselves can be analysed in various identifiable parts, some of which, such as the guns, play an important role in the construction, but which I leave out for now. Also, the fact that the role of actor can be attributed to F1 based on the syntactic information is complemented by the identification of 'the British' given the socio-historical context (or at least this seems to be preferred by the authors of the text-book, cf Kress & Van Leeuwen 1996: 44-45).
Figure 2.6 - Weak actor
Figure 2.7 - Storng actor
Means of articulation does not play a large role in this representation, but compare figures 2.6 and 2.7. In this schematised representation of two action processes, the first can be said to have a 'weak' actor where the second has a 'strong' actor. This modification of the meaning is realized by the means of articulation of the actor and goal participants.
I would like to stress again the importance of convergence. Every module contributes to final analysis in the conceptual module. In the examples discussed above these contributions all directed the construction of the conceptual structure in the same direction; they were highly convergent. This is not necessarily the case. Consider the following sentences:
S6 Bleak is weak.
S7 I like Ike.
S8 A beak is not weak.
In S6 the convergence is strong because the words 'bleak' and 'weak' rhyme and a conceptual similarity suggests itself from its similarity in articulation, this further strengthened by the lexical similarity between the signifieds 'bleak' and 'weak'. One can say S6 is redundantly coded. In S7 the convergence (not the effect) is less strong because although the words rhyme the are not lexically related. In S8 the contributions of the different modules are actually contradicting: 'beak' rhymes with 'weak' again suggesting a conceptual relation but this contradicted by the syntactical structure ('is not') and the conceptual identification of the signifieds 'weak' and 'beak' in which 'hard' is attributed to 'beak'. Contradicting or convergent contributions to the conceptual modules are an important factor in the construction meaning causing a representation to be 'strong', 'well-formed' or 'ambiguous'.
Something similar can be done visually. Consider the strong actor and weak goal configuration in figure 2.7. If the actor in this case something that was conventionally known to be weak and the goal was conventionally known to be weak (a train on a collision course with a paper wall for example) than these lexical meanings would converge with there strong and weak articulation respectively.
So, what should a visual grammar look like? We are now in a position to answer this question. Visual grammar should formally resemble the visual language architecture of the brain, or at least be compatible with it and a grammar thus constructed can give real insight in how meaning is construed from visual representations.
The MPA suggests there should be several groups of grammatical rules that operate on different levels of the representation and are finally superimposed. These different levels should not be confused with embedding as formalized in R1. It rather functions as Roland Barthes' connotational procedures (such as the use of soft focus in photographs, that alter the meaning of the photograph as a whole) that operate on a level independent of syntactical structures. In fact, many of the connotational procedures described by Barthes can be arguably said to be meaningful ways of articulation that are mapped directly to the conceptual module (Barthes 1977: 34-38). When applied to Kress & Van Leeuwen's visual grammar it helps us to distinguish between various grammatical procedures that operate in their meaningful grammatical constellations. For in the light of the discussion so far, Kress en Van Leeuwen only seem to describe typical grammatical constellations (such as the 'action process' discussed in 2.5) in visual representation and attribute a meaning to it. From their work it does not become clear how all different types of constellations can be combined and interact in complex representations. Worse, their constellations seem to draw from different grammatical modules without formally recognizing them, making it difficult to apply their theory to cases which do not fall clearly into one of their grammatical categories. To put it in different words: their grammar is not generative but descriptive; Kress & Van Leeuwen describe three constellations that can be made with the elements 'actor', 'goal' and 'action verb', where formally one can imagine or generate many more. This and more will be discussed in detail in chapter 4.
A grammar based on the MPA can generate an infinite set of grammatical configurations with a relatively simple and small (or at least finite) set of grammatical mechanisms. Because different types of interaction or formally embedded within its structure. This quality of the MPA leads to a certain quality in the potential grammatical configurations that can be derived from it; it formally acknowledges discrete infinity.
One should not expect a visual grammar to give clear and concise account of a particular visual representation. As Jackendoff stated: “meaning comes with inherent uncertainty around the edges” (Jackendoff 1994: 202). This especially holds true for visual representations, which in general are more subjective and less accurate then verbal representations. How does one verify a grammar if one cannot use to make entirely accurate claims about visual representations? This is problematic indeed. I will try to answer this question in the next chapter.
 The reverse 'the mat lies under the cat' sounds rather odd, this is due to an asymmetry caused by what Jackendoff calls figure-ground organization (Jackendoff 1994: 191-193). Also see below.
 I wish not to rule out the possibility that some particular form of information cannot be represented verbally or visually. Rather, such information is in my opinion very rare.
 The use of only one conceptual module that is connected to the verbal and visual language faculty is also very convenient when one is primarily interested in the language faculty. This is of course a gross abstraction, as it is very likely that the actual conceptual module in the brain consists of several highly specialized brain functions that can be found at various location in the brain.
 It may sound odd to speak of lexical structure: in the context of this example the lexical structure is nothing more than a set of lexical items invoked be the identification of the various forms in the representation.