Large language models, such as ChatGPT, are changing the way that software is developed and evaluated. There is a lot of hype about replacing programmers with these AI systems. Recent research has shown that many of the anecdotal stories promoting the application of LLMs in software development are overstated. This talk provides a description of the basic technology used in these models and how the systems gain the capability to generate and evaluate software. One key driver is the training data used when building the foundational models. In particular, generated and evaluated code tends to reflect the state of the practice in development on which the LLM was trained. The training data is filled with insecure programming practices that LLMs faithfully repeat. In this presentation, we use the data collected over a decade of program evaluation covering over 100 million lines of code to create experiments for evaluating the performance of LLMs. Specifically, we analyzed thousands of programs written in C, C++ and Java to be evaluated by ChatGPT 3.4, ChatGPT 4, CoPilot and ChatGPT 4o. Both secure and insecure programs were evaluated. From these experiments, common themes emerged of the behavior of LLMs when evaluating software. We share these results to help practioners understand when the technology can provide benefits, the risks of using these systems and how the risks introduced by LLMs might be mitigated. We also illustrate how the technology has matured over time and provide some discussion of the future application of LLMs for creating and evaluating secure software.