The amount of news and thought leadership on artificial intelligence (AI) is overwhelming. But in the deluge of information, some important nuances are being lost. As you consider your AI strategy, it’s important to understand a bit about the technical side of how AI models are built.

Here’s one thing to understand:

How Data Is Divided To Build And Test Your AI Model

When starting an AI initiative, you of course need data. You also need to use that data for multiple purposes: you want to build your model using some of that data and you also want to be able to check whether that model will still work when it is fed new data. The traditional approach to doing this is – at the very start of the project – to divide the data into 3 chunks. The first and second are used to train and test your model; there are a number of ways to use these chunks to achieve this. The third chunk is set off to the side at the start and is never used in the model build process. This is important, as it will be the way you will check how well your model works on “new” data, data it has never been exposed to before.

Each chunk is given a name, but some teams use different names. For example, one of us was taught using the labels “train”, “test”, and “holdout.” But Wikipedia’s article on this topic uses the labels “train”, “validation”, and “test.” There is no set and agreed-upon standard for what labels to use to describe these 3 chunks of data.

Why Is This Important?

This is important to understand because it may be impacting how your personal data and your company’s data is being used by AI tools. Two examples from high-profile companies that have been in the news recently.

Example 1: OpenAI and ChatGPT

We all learned in the spring of 2023 that we shouldn’t use proprietary data as inputs to ChatGPT. At that time, OpenAI’s Terms of Use stated that all inputs to ChatGPT could be used to further train OpenAI models. Since then, OpenAI has changed its Terms of Use. As of this writing, section 3c) states “We do not use Content that you provide to or receive from our API (“API Content”) to develop or improve our Services. We may use Content from Services other than our API (“Non-API Content”) to help develop and improve our Services.”

Great – if we use ChatGPT via an API we don’t have to worry, our inputs won’t be used to further “develop or improve” OpenAI models. Presumably that means neither building AI models nor updating them.

What if we don’t use an API and we just use ChatGPT directly? Those same Terms of Use state “If you do not want your Non-API Content used to improve Services, you can opt out by filling out this form.” Also great – we just need to fill out a form and then we’ll be good. Right?

Almost. When you go to the form, the language is different. Rather than excluding your data from use to “develop and improve”, this form only states “You can opt out of having your data used to improve our models by filling out this form.” Presumably, that means your data can still be used to build new models, just not improve existing models. Now we’re in a rabbit hole of how OpenAI defines “develop” in contrast to “improve” and whether we as users are comfortable with how our data could be used.

Example 2: Zoom

Very recently, Zoom started making headlines when Alex Ivanovs and others raised concerns with Zoom’s Terms of Service and particularly the explicit call-out that Zoom had the right to use Customer Content for uses “including AI and ML training and testing.” After an update to their TOS to caveat that this would be done only with customer consent, Zoom updated their terms a third time on August 11, 2023 to state “Zoom does not use any of your audio, video, chat, screen sharing, attachments or other communications-like Customer Content (such as poll results, whiteboard and reactions) to train Zoom or third-party artificial intelligence models” (emphasis ours).

Zoom won’t train AI models with Customer Content. Presumably Zoom can still test AI models on this data and run existing models on this data. This means your Customer Content can still be used by Zoom in ways that are valuable to Zoom. This is a) a bit different than how OpenAI and ChatGPT could use your data, and b) may be a concern if your Customer Content in Zoom contains confidential or proprietary information, personally-identifiable information (PII), etc.

Summary

Understanding a bit about the ways AI models are built can be helpful to identifying potential risks of third-party AI models and tools. As always, include perspectives from your privacy, compliance, information security, and regulatory experts.