Okay, time to look at using combinations of the three sampling methods. I am going to start by using a combination of temperature followed by one of the other two sampling techniques: top-k or top-p. In all the combinations we test, temperature will always be applied first. And top-k will always be applied before top-p. (The apply_sampling function is, hopefully, already written that way—though I have not yet tested combinations of sampling constraints.)

Two Technique Combinations

Let’s start with the three combinations using two techniques each.

Temperature and Top-k

This is likely the most straightforward combination. However, I am at this point not sure how many combinations of values to try out. I would like to use a few different values of top-k for each of a few values of temperature. But, I also don’t want to get too carried away. That said, I will start by getting somewhat carried away.

As soon as I started on trying to code this, things definitely went south. There must be a better approach than I am using, but for now, it will have to do. No matter how messy it is. I am guessing this is a case where using classes might have been more efficient. But I am not quite ready to refactor this code.

I have moved my previous code working with one technique at a time into an if block. The new code will go in the related else block. I am hoping that there will be no need for more blocks (i.e. elif). I also did the same for the code creating the outs variable. A list for the previous experiments; but a dictionary for the current experiments. I needed some way to account for the various combinations of values.

In the processing loop, I split the textual representation of the value combinations into the two values, convert to the appropriate numeric type and use those in the call to apply_sampling(). I use the textual key to record the length of available choices for each combination to display in the experimental output to the terminal. Otherwise the code is pretty much the same as that in the previous post when using only one sampling technique at a time.

    use_what = "temperature top-k"
... ...
      case "temperature top-k":
        s_vals.append(t_temps[2:5])
        s_vals.append(t_topk[0:5])
... ...
    if " " not in use_what:
      outs = [p_txt[:] for _ in range(len(s_vals))]
    else:
      outs, nbr_k = {}, {}
      if len(s_vals) == 2:
        for t in s_vals[0]:
          for k in s_vals[1]:
            outs[f"{t} {k}"] = p_txt[:]
... ...
    with torch.no_grad():
      if len(s_vals) not in [2, 3]:
        while len(outs[0]) < max_wds:
... ...
      else:
        print(f"use_what: {use_what}\nouts.keys(): {list(outs.keys())}")
        k_out1 = list(outs.keys())[0]
        while len(outs[k_out1]) < max_wds:
          for cmb_key, out in outs.items():
            v_temp, v_topk = map(float, cmb_key.split(" "))
            # topk value must be int
            v_topk = int(v_topk)
    
            inp = torch.tensor([[corpus.vocab.word2idx[w] for w in out]])
            inps = inp.to(cfg.device)
            outp, (hh, hc) = lstm(inps, (hh, hc))
            logits = outp[0][-1]

            if use_what == "temperature top-k":
              nw_logits = apply_sampling(logits, temp=v_temp, topk=v_topk)
              # l_lens.append(len(nw_logits))
              nbr_k[cmb_key] = nw_logits.gt(-float('Inf')).sum()

            p = nn.functional.softmax(nw_logits, dim=0).detach().cpu().numpy()
            nxt_tk_idx = cfg.rng.choice(len(nw_logits), p=p)
            outs[cmb_key].append(corpus.vocab.idx2word[nxt_tk_idx])

        for cmb_key, outp in outs.items():
          o_txt = tidy_output(" ".join(outp))
          print(f"{use_what}({cmb_key}) ({nbr_k[cmb_key]}) -> {o_txt}")

And here’s the output of one experimental run.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 50
... ...
        loading runs\rk1_sv\lstm_50.pt
initial input: ['the', 'prince', 'was']
use_what: temperature top-k
outs.keys(): ['0.5 25', '0.5 50', '0.5 100', '0.5 500', '0.75 25', '0.75 50', '0.75 100', '0.75 500', '1.0 25', '1.0 50', '1.0 100', '1.0 500', '1.25 25', '1.25 50', '1.25 100', '1.25 500']
temperature top-k(0.5 25) (25) -> the prince was going to see him. anna went up to her, and
temperature top-k(0.5 50) (50) -> the prince was looking at him with a scared eyes. she was standing with
temperature top-k(0.5 100) (100) -> the prince was experiencing. but he was not merely getting up. her eyes
temperature top-k(0.5 500) (500) -> the prince was on a long while, and had never forgotten his own conversation
temperature top-k(0.75 25) (25) -> the prince was in a corner of the drawing room. and the doctor had
temperature top-k(0.75 50) (50) -> the prince was in petersburg. anna’s face was telling her as he
temperature top-k(0.75 100) (100) -> the prince was in petersburg, not to moscow, and had just to get
temperature top-k(0.75 500) (500) -> the prince was not at home. he’s very low spirited,”
temperature top-k(1.0 25) (25) -> the prince was in one transient room, and was a big beauty in her
temperature top-k(1.0 50) (50) -> the prince was not well talking of the problem. “it’s not only
temperature top-k(1.0 100) (100) -> the prince was in her house. when on the day. kitty did not
temperature top-k(1.0 500) (500) -> the prince was sitting in shelter, while she had reached the theater to go
temperature top-k(1.25 25) (25) -> the prince was very well. “you know this inequality is that one would give
temperature top-k(1.25 50) (50) -> the prince was the feeling. and the more, and i have found his
temperature top-k(1.25 100) (100) -> the prince was being in a smile. he looked at him when she hid
temperature top-k(1.25 500) (500) -> the prince was in english use to a german countess after it. there was

Temperature and Top-p

Okay, let’s move on to the next combination. Should be pretty straightforward. I don’t see it being significantly different for the preceding one. And, I expect that will be the case for the last two technique combination as well.

Going to be some code repetition. But I will try to eliminate as much of that as possible.

Well, the coding was rather easy. But, I didn’t account for the fact that the temperature would affect the probabilities that top-p would be dealing with. And, the number of viable choices would be affected accordingly. See the following terminal output. The single number in brackets (second set of brackets) is the number of tokens available from which to select the next token. Well, at least for that iteration.

... ...
initial input: ['the', 'prince', 'was']
use_what: temperature top-p
... ...
temperature top-p(0.5 0.25) (2) -> the prince was saying
temperature top-p(0.5 0.5) (3) -> the prince was a
temperature top-p(0.5 0.75) (13) -> the prince was a
temperature top-p(0.5 0.9) (38) -> the prince was in
temperature top-p(0.75 0.25) (3) -> the prince was in
temperature top-p(0.75 0.5) (12) -> the prince was at
temperature top-p(0.75 0.75) (46) -> the prince was the
temperature top-p(0.75 0.9) (119) -> the prince was sent
temperature top-p(1.0 0.25) (7) -> the prince was a
temperature top-p(1.0 0.5) (33) -> the prince was on
temperature top-p(1.0 0.75) (114) -> the prince was known
temperature top-p(1.0 0.9) (276) -> the prince was who
temperature top-p(1.25 0.25) (16) -> the prince was so
temperature top-p(1.25 0.5) (70) -> the prince was without
temperature top-p(1.25 0.75) (212) -> the prince was to
temperature top-p(1.25 0.9) (486) -> the prince was very

Some pretty low numbers for the available options in early iterations. Don’t expect that is a really good thing.

So need to do some thinking before continuing. Expect I am going to have to sort out some function to produce a suitable top-p value based on the temperature value. And that might only work for the current model. Not sure, at this time, if there is a way to generalize it.

Well, after a night’s rest, I am thinking I will write a function. It will take a number of parameters specifying things like the least and maximum number of token choices, the current logits, and a temperature value. May need others, but for the moment think those will be enough. It will, after some arithmetic, return the requested number of top-p values that satisfy the passed arguments.

Note: a few articles I looked at suggested that combining temperature and top-p was not a good idea.

Generally speaking, one does not modify both Temperature and Top P at the same time. This is mostly because you destroy any hope of intuition if things don’t work. They both heavily influence the outcome, and they could easily cancel each other or amplify each other’s impact to the point where neither is meaningful.

How to Tune LLM Parameters for Top Performance, Erik Hyrkas

Apparently, I ran into just such a situation. But, even bad experiences can lead to knowledge.

Get Range of Top-p Values

I am for now going to ignore the warning above and continue with the current combination experiment.

Additionally, I have decided against writing that function. I am simply going to select a much higher set of top-p values. They will be more or less evenly distributed between \(.75\) and \(.99\). I am also going to refactor the code to save the number of available tokens for each iteration for each combination. Curiousity.

... ...
    # generate list of nbr values evenly distributed between pmin and pmax. inclusive
    def get_topps(nbr=4, pmin=.75, pmax=.99):
      topps = [pmin]
      adj = int(((pmax * 100) - (pmin * 100)) / nbr) / 100
      print(adj)
      for i in range(1, nbr - 1):
        topps.append(round(topps[i-1] + adj, 2))
      topps.append(pmax)
      return topps
... ...
      case "temperature top-k" | "temperature top-p":
        s_vals.append(t_temps[2:6])
        if use_what == "temperature top-k":
          s_vals.append(t_topk[0:4])
        else:
          topps = get_topps(nbr=4, pmin=.75, pmax=.99)
          s_vals.append(topps)
... ...
    # new line of code at bottom of block
    if " " not in use_what:
      outs = [p_txt[:] for _ in range(len(s_vals))]
    else:
      outs, nbr_k = {}, {}
      if len(s_vals) == 2:
        for t in s_vals[0]:
          for k in s_vals[1]:
            outs[f"{t} {k}"] = p_txt[:]
            nbr_k[f"{t} {k}"] = []
... ...
            if use_what == "temperature top-k":
              nw_logits = apply_sampling(logits, temp=v_smp1, topk=v_smp2)
              # l_lens.append(len(nw_logits))
              nbr_k[cmb_key].append(nw_logits.gt(-float('Inf')).sum().cpu().item())
            elif use_what == "temperature top-p":
              nw_logits = apply_sampling(logits, temp=v_smp1, topp=v_smp2)
              nbr_k[cmb_key].append(nw_logits.gt(-float('Inf')).sum().cpu().item())

            p = nn.functional.softmax(nw_logits, dim=0).detach().cpu().numpy()
            nxt_tk_idx = cfg.rng.choice(len(nw_logits), p=p)
            outs[cmb_key].append(corpus.vocab.idx2word[nxt_tk_idx])
          # breaks

        for cmb_key, outp in outs.items():
          o_txt = tidy_output(" ".join(outp))
          print(f"{use_what}({cmb_key}) -> {o_txt}\n\t{nbr_k[cmb_key]}")

And, the terminal output. Which really does seem to reinforce the warning above.

(mclp-3.12) PS F:\learn\mcl_pytorch\proj8> python nlp.py -rn rk1 -bs 32 -se 50
... ...
        loading runs\rk1_sv\lstm_50.pt
initial input: ['the', 'prince', 'was']
use_what: temperature top-p
... ...
temperature top-p(0.5 0.75) -> the prince was saying to him. “i’ve not been going to the
        [17, 3, 1, 1, 6, 2, 2, 4, 3, 11, 2, 3]
temperature top-p(0.5 0.81) -> the prince was going to see him.” “so i’m going to
        [15, 3, 3, 3, 2, 6, 12, 2, 2, 1, 2, 1]
temperature top-p(0.5 0.87) -> the prince was in the house. the doctor had gone to the bedroom,
        [31, 2, 8, 1, 12, 10, 6, 6, 5, 1, 7, 2]
temperature top-p(0.5 0.99) -> the prince was brought to the house. the princess, with a friendly smile
        [150, 17, 6, 62, 8, 35, 64, 25, 36, 5, 76, 11]
temperature top-p(0.75 0.75) -> the prince was, and the princess said to the mother. she had long
        [52, 3, 14, 30, 13, 4, 3, 24, 4, 27, 6, 13]
temperature top-p(0.75 0.81) -> the prince was in the doorway, and vronsky had heard that the doctor had
        [68, 2, 12, 2, 1, 26, 13, 33, 14, 9, 56, 5]
temperature top-p(0.75 0.87) -> the prince was very well, and, after the first time, was kept
        [97, 27, 8, 1, 28, 28, 27, 27, 5, 6, 17, 109]
temperature top-p(0.75 0.99) -> the prince was in an old room. “come, i’m very glad
        [451, 96, 512, 220, 5, 145, 11, 147, 47, 3, 163, 27]
temperature top-p(1.0 0.75) -> the prince was in love with the prince. stepan arkadyevitch’s voice was
        [115, 19, 3, 11, 51, 2, 43, 1, 19, 1, 136, 33]
temperature top-p(1.0 0.81) -> the prince was received. when she went upstairs, anna had already got out
        [155, 17, 37, 15, 17, 9, 3, 14, 34, 64, 95, 7]
temperature top-p(1.0 0.87) -> the prince was as a princess, kitty and her sister, grisha had not
        [224, 84, 142, 21, 35, 49, 48, 24, 33, 36, 17, 109]
temperature top-p(1.0 0.99) -> the prince was softened he was experiencing friends, and from the scene of excitement
        [921, 85, 273, 798, 379, 40, 204, 502, 256, 956, 53, 772]
temperature top-p(1.25 0.75) -> the prince was perfectly wrong, so that all that was he was talking of
        [212, 249, 5, 14, 28, 17, 21, 42, 184, 26, 180, 15]
temperature top-p(1.25 0.81) -> the prince was quite drunk. the colonel made his letter to his grandfather.
        [285, 313, 5, 67, 276, 78, 55, 270, 21, 30, 115, 5]
temperature top-p(1.25 0.87) -> the prince was often there, and that some new after the most important elder
        [396, 367, 68, 81, 136, 140, 544, 463, 517, 339, 261, 387]
temperature top-p(1.25 0.99) -> the prince was heard and standing in those head came into the karenins’.
        [1433, 639, 1488, 508, 634, 1317, 365, 222, 294, 1875, 254, 412]

Still some pretty low available token counts for some iterations in each of the value combinations.

Done I Believe

I am beginning to believe there is no value in looking at any other combinations.

Combining top-k and top-p can effectively be done by using a smaller top-k value. Or by using top-p with a suitable value. And combining all three, at the moment makes no sense too me at all. Though that may change with a different algorithm.

And, without a fair bit more training of the model I see no point in generating longer outputs.

So, until next time, may your experiments prove more fruitful than this one. Though I have to admit, this last bit of effort really helped me get a feel for how these three techniques affected the generated text.

Resources

How to Tune LLM Parameters for Top Performance: Understanding Temperature, Top K, and Top P

Too Old To Code

MCL with Pytorch: Recurrent Neural Networks, Part VIII