Nicholas LaCara : : Looking for one in the bnc Back to blog index | Back to website

Looking for one in the bnc

Nicholas LaCara – February 2020 – Boston, ma

I've been interested in the distribution of anaphoric one in English for a while. In this blog post, I discuss some of the recent corpus work I've been doing as I try to learn more about how it interacts with possessive structures, focusing on some searches I've done in the bnc using the nltk and some issues I've had to deal with while doing this work. This approach is new to me, so part of the goal here is to document the steps I took while learning how and the reasoning behind the decisions I made.

Part II and Part III are now available!

Anaphoric one

As a linguist, one of the phenomena I'm most interested in is English In linguistics, anaphora refers to phenomena where the meaning of some element or expression is determined by reference to some other element or expression. The element that determines the meaning of an anaphoric expression is usually referred to as the antecedent, even when they follow the anaphoric expressions that refer to them.anaphoric one. One is used to refer back to a noun or noun phrase that is used elsewhere in a discourse. A simple example is given in (1), where the word ones refers back to book.

  1. Mary bought a long book, and I bought three short ones.

Notice, here, that ones can't really be a numeral (such as two, or three). It isn't possible to replace one with another numeral, so it really has to be something other than a numeral or number in this context.

The use of a word like one, while not limited to English, is fairly unusual compared to what many other languages do in similar contexts. For example, in Spanish, instead of using a word like one, it is possible to leave the noun out entirely in a phenomenon known as noun phrase ellipsis, or npe. In the Spanish example in (2), I've represented npe with an underscore (_), which I'll also do elsewhere.

      1. María
      2. Maria
      1. compró
      2. bought
      1. un
      2. a
      1. libro
      2. book
      1. largo,
      2. long
      1. y
      2. and
      1. yo
      2. I
      1. compré
      2. bought
      1. tres
      2. three
      1. _
      2. _
      1. cortos.
      2. short
      1. ‘Maria bought a long book, and I bought three short ones.’

Even within English, one is not used everywhere. Like Spanish, English uses noun phrase ellipsis, but it uses it in different places from where one is typically used. One of those places is after possessives:

  1. I didn't read Mary's book, but I read Sally's _.

What interests me is that there is actually some variation with regard to cases like (3). As I just mentioned, usually npe is used in places where one cannot be, and vice-versa. I find (4) to be ungrammatical. Here, the % symbol means that some but not all speakers find the sentence acceptable.Now, as far as I am aware, all speakers accept (3) as grammatical. Consequently, it was very surprising to me to find that some speakers accept (4) as grammatical as well.

  1. %I didn't read Mary's book, but I read Sally's one.

I have some older work arguing that ellipsis and one cannot be classified as the same sort of anaphoric phenomenon. The project described here fits into a more recent project where I try to better understand the distribution of the two phenomena. For some of that work, see this handout.Since I'm interested in the factors that determine when one is used instead of npe, I decided to take a look through some corpora to see if I could gain any insights into when one is used with possessives. Part of the difficulty in doing this, though, is that the word one is used in a number of different ways. I'm interested in the anaphoric use I showed in (1), which generally stands in for a noun in a broader noun phrase, but one can also be used as a numeral (I bought the one good book), as an indefinite pronoun replacing an entire noun phrase (I didn't buy a Dickens novel because I didn't want one.) and as a so-called impersonal pronoun (One must always respect one's elders). After possessives, the distribution of these is more limited in general (the impersonal pronoun and the indefinite pronoun should not appear in this context), but it is possible for the numeral one to appear immediately after possessives. Furthermore, the numeral one does appear to allow for ellipsis:

  1. Sally bought two books, but I only bought one _.

Since I want to know more about these elements, I need (a lot) more data. In what follows, I detail some ongoing corpus work where I look for how one behaves in possessives in English. Right now, I'm using the British National Corpus (bnc). The bnc is tagged, which is helpful, but the tags are not sufficiently detailed for the distributional questions I'm interested in, and the texts were tagged automatically, so I am processing some of the data manually, retagging the word one when it occurs after possessive -'s and posessive determiners. Before I talk about that, though, I want to discuss why I chose the bnc and what the distribution of one looks like in that corpus.

Why the bnc?

tl;dr: The main reasons for using the bnc are that it's big, it's free, it works with the nltk, and previous work has also utilized it.

I'm using the Natural Language Tool Kit (or nltk) to do the search and started off using the Brown Corpus. Unfortunately, the Brown Corpus wasn't giving me the results I needed (there were no tokens of one following possessives in Brown), so figured I needed a bigger corpus.

I was interested in using the Corpus of Contemporary American English (coca) since I'm American, but coca cannot be downloaded for free and is not set up for use with the nltk, as far as I know.I chose the bnc for a few reasons. First, it can be set up to work with the nltk, which meant I could use it in a workflow that I'm comfortable with. Second, I was aware that previous corpus work on one had used the bnc (Payne et al. 2013), and so it was a good jumping-off place for me. Furthermore, it is much larger than the Brown Corpus. While the Brown Corpus totals some one million words, the bnc is a one-hundred-million-word corpus. Additionally, the content of the bnc is much more recent, having been compiled in the late 20th century rather than the 1960s and 1970s.

Finally, one of the main advantages of using the bnc is that it is tagged, which helps with some aspects of the searches I want to carry out. However, the tagging of the bnc is not without its drawbacks. One of the main issues here is that, owing to its huge size, the bnc was automatically tagged and has not been checked by humans. The tags on one, in particular, are not very reliable. For example, the tag PNI (indefinite pronoun) is used both for anaphoric one as well as for impersonal one, so using the tags in the corpus as-is would not allow us to get a fully accurate picture of how anaphoric one is distributed.

However, the tags in the corpus do allow for a general picture to the distribution of numeral one and can help us find clear cases where anaphoric one follows possessives.

The starting place: An overview of one in the bnc

Anaphoric one and possessives

As an initial search, I looked for tokens of one tagged PNI (indefinite pronoun) where the preceding token was a possessive (including the possessive -'s and possessive determiners (my, your, her, and others; tags POS or DPS), and collected sentences containing such tokens. This yielded 49 results. Since the results were few, I then read through them to see if there were any cases anaphoric one immediately following a possessive.

There are, in fact, several clear cases of anaphoric one immediately following possessives. The following examples, at least to my eye, cannot be anything other than tokens of anaphoric one and not the numeral:

  1. (KSU 417)That car <pause> better condition than what my o-- , my one was.
  2. (KPG 3747)[In a conversation about spiders –NL] They were gonna, my Dad said yeah well my one, no, my Dad's one was called Rhino and the other one was called Elephant and their one died.

However, there are also some examples that are almost certainly mis-tagged. In the following examples, one is most likely a numeral, contrasting the number of serves (three serves to one serve) or hands (both versus one); yet, both tokens of one are tagged PNI (indefinite pronoun) rather than CRD (cardnial numeral):

  1. (B03 2629)They were unbeaten all afternoon, despite six of their opponents having three serves, to their one!
  2. (FPH 1551)Both his hands now enclosing my one.

The cases I have found so far look as though they come from spontaneous speech and not written texts.So there are tokens of anaphoric one following possessors in this corpus, confirming that this does happen (even if I personally consider these cases less than acceptable). However, the tags, as I discuss immediately below, are not particularly reliable.

Distribution of tags

To get an idea of how one behaves in the corpus more broadly, I ran another search to see (i) how tokens of one are tagged throughout the corpus and (ii) how tokens of one are tagged immediately after possessive elements. To do this, I simply searched for all tokens of one or ones in the corpus and kept a tally of how each token was tagged. Furthermore, if the particular token of one or ones followed a word tagged POS or DPS, I kept a separate tally. Overall, there are 306,139 tokens of one or ones in the corpus, of which 579 occur immediately after an element tagged POS or DPS. The results of this search are in the table below.

A couple of things to note about the tags: The automatic tagging system used by the bnc uses the C5 tag set. The relevant tags for the discussion here are PNI (indefinite pronoun), CRD (cardinal numeral), and NN2 (plural noun). The tagger assigns so-called ambiguous tags when the automatic tagger isn't confident of the tag. In the results here, these tags are PNI-CRD and CRD-PNI; the more likely tag is the first of the two.

All tokens After Possessives
Tag Number Percent Number Percent
PNI 78525 25.65% 49 8.46%
PNI-CRD 21966 7.18% 54 9.33%
CRD 189299 61.83% 454 78.41%
CRD-PNI 4814 1.57% 4 0.69%
NN2 11503 3.76% 18 3.11%
All other tags 32 0.01% 0 0%
Total 306139 579
Counts of the different tags used for tokens of one and ones in the bnc, throughout the corpus and specifically after possessives (tags DPS and POS).

The first thing to see is that the frequency of tokens tagged PNI and PNI-CRD is much lower after possessives than in the corpus as a whole. All else being equal, one might expect the frequency of tags after possessives to match the frequency of tags elsewhere in the corpus, but this doesn't seem to be true. Taken together, PNI and PNI-CRD make up a total of 32.83% of tokens in the whole corpus, but only 17.79% of tokens when immediately following a possessive. Since anaphoric one is tagged PNI in the corpus, this might suggest that it is, in fact, less common after possessives. However, as I mention above, impersonal one is also tagged PNI, and since this element, as a pronoun, cannot occur after a possessive under normal circumstances, this could also lead to a lower number of PNI tags after possessives.

A further concern is the ambiguous tag PNI-CRD. As the corpus documentation points out, specifically in reference to tagging tokens of one, ‘the reliability of the ambiguity tag PNI-CRD (in which the pronoun is rated more likely) is somewhat low’. Overall, it appears there is just 3-in-8 chance that PNI tag is the right one for elements tagged PNI-CRD. However, this extends to all tokens tagged PNI, not just one.

There are a few other things worth pointing out here (as discussed above, the fact that some tokens tagged PNI should probably be CRD), but the main take away is that the information provided by the tags is simply not reliable enough as-is to be make any clear conclusions about the distribution of anaphoric one (or at least, it is not robust enough for me to be comfortable enough making any empirical claims based on these results). Because of this, I've decided to go through the examples and re-tag them to get a better picture of what the distribution looks like.

Re-tagging one

So to get better, more accurate data about the distribution of anaphoric one in the corpus, I've decided to go back and re-tag tokens of one with more descriptive tags that better let me see the phenomena I'm interested in. This seems appropriate to me since, as mentioned above, the bnc tags have undergone minimal post-editing, According to the reference guide, ‘some manual tagging was undertaken to correct some particularly blatant errors, mainly foreign or classical words embedded in English text’. and the fact that there are errors and other inaccuracies in the tags for the tokens I'm interested in. What needs to be done at this point is: (a) decide what information needs to be represented in the new tags, (b) establish some criteria for how to assign those tags, (c) develop a way of writing those tags to the data, and (d) coming up with a representative sample to compare the possessive data to and re-tagging that.

How to tag the data?

First, I need a tag set that adequately distinguishes differences in the types of one that could feasibly be found in the corpus so I can get a more accurate view of what each token of one is. This will be a custom tag set just for looking at the distribution of one, and it will mix part-of-speech, morphological, and semantic information (since there are two different pronouns one that ought to be distinguished). The tags I'm envisioning right now are:

  1. 1A1 - Singular anaphoric one that stands in for a noun.
  2. 1A2 - Plural anaphoric one that stands in for a noun.
  3. 1PA - Pronominal anaphoric one that stands in for a full DP.
  4. 1PI - Impersonal pronoun one
  5. 1CN - Cardinal numeral one
  6. 1UA - Unclassified token of one (ambiguous in context)
  7. 1UC - Unclassified token of one (not enough context)
  8. 1UU - Unclassified token of one (context unclear)

This is probably a more fine-grained system than I need, given that after a possessive only anaphoric and numeral one are expected to appear in any large number, but it is better to have more detail in this instance than not enough.

I'm also considering a tag for one when it appears a title. There are several sentences I have found where one is the first word of a title that immediately follows a possessive. For example, in (10), one (correctly tagged CRD) follows the possessive Simpson's, but one is part of the title of Simpson's work, not part of a possessive noun phrase.

  1. (A06 65)Of the British playwrights, Pinter is often thought to be on the edges of ‘absurdism’ and you could also read N.F. Simpson's One Way Pendulum and Cresta Run.

The other thing that is necessary are some criteria by which to assign these tags. Although I will be tagging, in part, based on my personal intuition as a native English speaker, I've been trying to find a few criteria that distinguish between the various varieties of one, especially in cases where the correct tag may not be immediately clear:

  1. Is it plural?
    If one is plural ones, then it is anaphoric one (1A2). Criteria a. and b. are also used by Payne et al. (2013).
  2. Does one appear before an adjective?
    Anaphoric one does not appear before an adjectives, so any token of one before an adjective should be counted as a numeral (1CN).
  3. Does the antecedent contain a numeral?
    If there is a numeral in the antecedent with which one appears to contrast, then one is most likely a numeral (1CN). After an adjective, is most likely anaphoric one (1A1 or 1A2).
  4. Does the token of one stand on its own?
    If one occurs without determiners, modifying adjectives, or other adjuncts, it is usually not anaphoric one (1A1 or 1A2), but one of the pronominal variants (1PA or 1PI).

The data

As discussed here, I've already extracted from the corpus the sentences containing tokens of one that I'm interested in (tokens following possessives). Along with these sentences, I've also extracted the immediately preceding sentence since anaphoric one can refer to nouns in previous utterances and it is not always possible to tell if a token of one is a numeral or an anaphoric case without this additional context. After looking through some of the examples of spontaneous speech, however, I'm no longer sure one sentence will be enough. More context will probably be necessary for these cases, and I need to think about how I want to deal with that. Fortunately, the script I've developed for doing this (described below) is easy enough to adapt to get a greater number of previous sentences.I'm hoping one sentence of previous material will be enough to make a determination, but I can always go back and extract more data if necessary.

I'll also need to sample the corpus to get sentences to compare these results to. This is an area that is totally new to me (which is part of why I'm working on this project!), so it's probably one of the last things I will do. Once I get this sample, I will work through re-tagging instances of one in it, too, so that distributional information about one can be compared to those tokens appearing after possessives.

The process

The main goal is to create new tagged corpus data I can further manipulate with the nltk. I've written a Python script that takes the tagged data I've extracted from the corpus, finds individual tokens of one and asks me how I'd like to tag it. After it presents each token of one found in the data, it writes the new tags to those tokens and, when I quit the script, writes all of the re-tagged sentences to a text file using json. The script uses the saved data in the json file to figure out how many sentences I've tagged so far so that I can start re-tagging from where I left off. I'm hoping this will speed up the task of re-tagging hundreds of tokens (fortunately I have a lot of free time right now).

That's all I have for now. Once I have some preliminary results, I'll write about them on this blog and let you know how it went!