Adapting Reality - Building an agent to modify point clouds

Learn how to build an agent that modifies 3D point clouds using natural language. Discover techniques for LLM spatial reasoning, structured scene graphs, and systematic evaluation for reliable performance.

Overview

I’m presenting an agent and natural-language interface that understands and modifies 3D point clouds. Users can ask it to complete various tasks - identifying objects, segmenting regions, even moving elements - and it executes them directly on the point cloud data.

The core challenge is getting an LLM to reason meaningfully over spatial 3D structures, which requires careful prompt engineering, tooling, and a structured scene graph as the agent’s “world model” rather than raw point data. Getting reliable behavior meant running systematic evals: testing the agent across varied phrasings, ambiguous spatial queries, and edge cases like occluded or overlapping objects - then iterating on the tool definitions and system prompt until performance was consistent.

The builder takeaway: treat your data representation as a first-class design decision - the right abstraction layer between the LLM and your domain data is what makes or breaks agent reliability.

Video

Transcript

Generated 3 months ago

Summary

Generating a talk summary...

View full transcript

Speaker 0: First up, we have Johannes, and the format would be, we don't do the introduction that much. We would jump into the code. We have the networking time. You can enjoy the pizza and also talk to the presenters and get to know each other, which I guess you already started talking to each other. So I would give the stage to Johannes to come up and, give us his presentation. Speaker 1: Okay. Hi, everybody. I I get a mic. Okay. Hi, everybody. Speaker 1: Let me set the stage. We're in a rolling mill. It's a steel processing facility where you have cylindrical rolls, furnaces, crop shears for cutting, and steel slabs that need to be processed into their final shape. And all of this is, with sensors, so we record everything that goes on here for purposes of predictive maintenance, for example. And then you have, your maintenance engineers or process engineers, and they can then supervise this facility with their time series data, some events, and, big signal trees. Speaker 1: So these are all the signals that we're recording, quite a lot of them. So the idea was let's build an agent for them, so that they can, interact with it in natural language. So, we used Langraph. Speaker 2: Yeah. What was the use case again? Like, why is that the natural language for this? Speaker 1: Well, you have this data. And as a process engineer, you have to decide what signal do I even want to look at. You have to find it here in the signal tree. You have to find the time range. Maybe you want to compare 2 signals or correlate between events in the signal. Speaker 1: So there's a lot going into this. You really need to know the system very intricately and, yeah, perform calculations by hand or in codes. And it's just much easier to have an LLM supporting you in that. So I can give a quick demo what this could look like. Is this somewhat large enough? Speaker 1: Yeah. I think so. So We always start with, setting a time range, and then we can ask something like what's the average difference between the current drawn by rolling stand 1 and rolling stand 2? So 2, systems in this facility. And this is quite a complex request. Speaker 1: You know? You need to find 2 time series and compare them. So, it wouldn't make sense to pass this raw data into the LLM, and, also, anyone who's worked with the APIs before knows, that's not really advisable to pass millions of raw data points into an LMM if you want to be able to pay your rent at the end of the month. So what we do is, we have a note that caches all the raw output data by the tools and once and then, injects it into a Python environment, and then the agent will just write codes, read the data only in the Python environment, and then read only the results from it. So it never actually sees the data. Speaker 1: And that way, we can enable complex calculations and, in the meantime, also save a lot of tokens costs. So let's have a look at the traces. So we use lang graph, and for tracing, we use, lang fuse. Basically, it's a web UI to, see what happens in the background. So for instance, here you have your agentic loop. Speaker 1: So you have an agent. It can call tools, and you have your note that cache the checks for each tool output if there's raw data in it. And if there is, it caches it and loads it into the Python environment. And then the agent will call some tools, for example, here to find out what data sources there are, to read the schema, so what signals there are, and then read the data. Okay. Speaker 1: This is not complete. Speaker 3: So in this example, how many data points did it ingest? Speaker 1: So it's a resolution of about 20 Eichenseer, so that would be huge amounts of data. And all the agent ingests is what signals are there, and, it Nürnberg k. I don't know why this trace isn't working. Sorry. Okay. Speaker 1: Yes. Something is not oh, yeah. There we go. And then this is the final most important tool call. So sorry. Speaker 1: The UI doesn't format it very nicely. But, basically, this is a string of Python code that the agent wrote, and it understands what the data looks like. And it can read it from the registry and then basically output here a dict, which contains the answer to a question. So, for example, this Speaker 2: It's Speaker 1: a in memory cache. It's just a dict an in memory dict right now, but, yeah, since it's a prototype. And this is the answer we get, which is 5.4 ampere to our initial question. And, yeah, as you can see, it's sort of a black box. You know, the agent does something. Speaker 1: It calculates something in code. We can't really know is it correct or is it not. So I laid an emphasis on that it tells us what's the databases was, so what signals were used, and also what the method was. So that as a process engineer, you can look at, does this make sense and spot any possible mistakes. Yeah. Speaker 1: So this works quite nicely, and we can do really complicated data analysis requests. The main failure mode that we had was it wouldn't know what signals to pick. So there's as I showed before, there's a lot of them. They're usually weird names like RPM, FM4 act. You know? Speaker 1: You don't really know what to do with it. So the solution to this was to add memory so the agent can write memory facts about specific channels or subsets of channels. So, basically, once it picks the wrong channel and you detect it as a user, it experiences the failure itself. It will write a correction and then never repeat the mistake. So that was how that was solved. Speaker 1: And another important thing during development was, you will iterate full of things. You will change your system prompt. You will change the model you use. You will add capabilities, take abilities away. So what I would really recommend if you want to build something like this is to, have, like, an evaluation pipeline, and this could look something like this here in Langview. Speaker 1: So you have a question, and then you define a correct answer and then have an LLM as a judge score it on correctness or conciseness or whatever you want. And we also added some custom evaluators. So for example, here we said, k. You need at least 4 tool calls to answer this request, and then you can score it on how many more it needs. So you want a minimum number of tool calls. Speaker 1: So this would be this evaluator, just to give you a quick look, it would read this minimum number and then count the actual number and give you a score on that, and you can do many more evaluators. For example, checking if 2 calls were successful or if the code that was written was successful and stuff like that. So, yeah, that's how we build this thing. Yeah. Please. Speaker 4: Given that, LLM produces Speaker 5: This Speaker 4: is, arbitrary code, with and runs it on a machine which has access to production system. How do you assure that the code is not, malicious, and do you have some guardrails, against this produced code? Speaker 1: Yeah. Definitely important considerations, so you can never be sure that you're not prompt injected somewhere. So the solution is to sandbox it. So, like, on a virtual machine or in a cloud, so that the code never runs on your actual system. So in this case, it's not sandboxed, but it's also still a prototype. Speaker 1: But yeah. Don't prompt inject me, please. Yeah. So, yeah, another question. Speaker 3: Yeah. Perhaps I should ask you offline for a few more details, especially, around the topic of, processing the, time series data. That, if you were to ask a question Michael, please find out, when, in the time frame of last week, my mill, my plant was not working properly or was close to overload or close to, creating garbage or something. How would that then work? Because when if you have 20 milliseconds, spacing and you have, I don't know, couple 100, channels or so Yeah. Speaker 3: That's a lot of material that the system would need to, ingest, compute, and Yeah. Make sense of. Speaker 1: Yeah. So this is kind of an open question because it could refer to any of the channels. And in this example, it's, like, 700 signals, so it would have to search for all of them. Some of them are very high resolution. So 1 solution we built is there's also the ability not to fetch raw data, but aggregated data, like the mean over a certain time span. Speaker 1: So But I wouldn't want to do that myself. Yeah. The agent can do that itself. So it has access to raw data or aggregated data. But, yeah, probably on, like, a very open question, it's hard. Speaker 1: You'd have to define a little bit what could an anomaly look like. It would have to know. Speaker 3: You should find out automatically Speaker 1: what an anomaly is. Yeah. I think it's a big topic, what an anomaly actually is. Yeah. Yeah. Speaker 1: Yeah. So you can't use it for, like, Speaker 0: yeah. 1 last question, and I guess you can talk more, in the networking time. Speaker 1: Yeah. But I would say to that point, you can't do anything with it. Like, this is sort of suited towards helping navigate what signals there are and more constrained questions. And if you'd want to use it for anomaly detection, for example, which is a big topic, you definitely need to expand the tools. And maybe I'd say it's not best to use an LLM for that, but maybe build your own model to detect anomalies and then make it available to the LLM to use to detect anomalies. Speaker 6: Can I use this agentically that I define a question? I say I have a 5 sigma event coming, but before that, there's a small ripple. Please write me a filter or, like, some function that continuously checks the 20 millisecond, data set and alerts me for the small ripple before my plant burns down 20 minutes down the line. Speaker 1: Yeah. So this couldn't, I think, because it doesn't run automatically or you can't cannot schedule it to run, like, every 20 minutes is what I got from your question. Speaker 6: So I cannot put in a filter like a crone chop or something? Speaker 1: No. No. Okay. No. You couldn't. Speaker 1: But it's a good idea. Yeah. Speaker 0: Thanks. Thank you so much. And okay. Next up, we have Arvin. And as Johannes didn't have the right setup, we went a little over, so he could use the time. Speaker 0: And please keep your question for the end of the talk. We will have some time for you to ask your questions. Thank you. Speaker 2: Alright. Hi, everybody. Just a quick question. How much time do I have? I'm totally unprepared. Speaker 2: 5 to 7. So Speaker 0: we can get a better look. You could have also used the mic for hands free stuff Speaker 1: if you Speaker 7: want that too. Speaker 2: Before we start, just let me know if you would like the other mic. Speaker 7: Great. Speaker 1: I don't Speaker 2: care about the mic. I care more about, showing you guys the right screen. Speaker 0: It usually acts as a secondary screen. If you mirror, then it would show your, what you already Or Speaker 2: I can maybe Speaker 0: Yeah. Okay. Speaker 2: Move it. Okay. Yeah. Because we don't have so much time, I'll just quickly okay. I don't see anything. Speaker 2: So yeah. So what we do, we do, oh, how do I take it? Alright. So my name is Armin. I've been working in AI for nearly 8 years now. Speaker 2: My last stint was, I was head of AI at IU Group. And 1 of the things I learned about making AI work in enterprises is that there are unique challenges, and maybe I'd like to focus on that so you guys, kind of get a sense of the difference between kind of building an AI agent for a small prototype versus building an AI platform that actually works in enterprise context. And besides all these boring topics that come on top like, filtering of personal personal information or authentication or security or logging that's GDPR compliant and stuff like that. There's, I think, some important aspects to consider. Typically, when we looked at how companies approach AI prototypes is that they start with some agent that can do some little tiny tasks, some basic rack on some existing data. Speaker 2: And that's actually where the problem starts. Right? Existing data. It doesn't exist. And it very often is also not very clean. Speaker 2: So the result is companies build, let's say, a chatbot that knows about the policies and and users may have used for that 1 1 day in a year or so. So people don't kind of, like, use it and adapt it. And if companies wanna actually build agents that are much more potent, for example, for the service IDENTIFICATION or IT service desk automation, they often get into the situation, okay, we don't have the data ready. So, actually, we'll have a very big data, overhead project running. So what we thought is that the optimal pipeline for an AI agent to work in existing companies is on the a, it needs to be able to work with existing data no matter how fuzzy it is and no matter how unprepared it is for AI. Speaker 2: And the second thing is it needs to, yeah, what it needs is kind of like a rag on humans. Right? So most knowledge is not available in data, but it's stored in the mind of human beings. So an AI agent to to really work, it also needs to be able to to ask humans and learn from humans or to delegate tasks to humans. And what we what we did, just as a short summary, is we build a AI enterprise platform. Speaker 2: Here you see the integration in Microsoft Teams. It's kind of like the interface. I'm sure I have enough time to show you some examples, but what we what we did here is oh, that's a terrible terrible way to work if you don't mirror it. I cannot it doesn't really support zooming out here. Damn. Speaker 2: Maybe I can drag at least. Hey. I can. So it's a little bit sorry for the what you see here is kinda like what we do is we we take an existing data. Here's our own Microsoft 3 65 environment, and we build a big knowledge graph. Speaker 2: We do graph rec in that sense, but it's different from how other people's build graphs based on existing data. The important part is really to make while data and enterprise work, without preparing it, basically, you need to understand that in a classical agent, if you set up an agent, you give it some data that is specifically selected for this agent. So kind of like when the agent finds data based on a question, it kind of always assumes, well, I found it. It's relevant. Right? Speaker 2: And that's very true in a very narrow case. But in enterprise setting, you find tons of data, that is actually not relevant to the question. Right? So what you need to be able is, to an AI, to be able to judge about the data that it has. And information is not encoded in raw data in in documents, but in the context the documents live in. Speaker 2: Right? So this is actually what we build. We build kind of like a context graph where you see here this group is a Microsoft Teams group. I'm a member of it. It does the group has multiple channels. Speaker 2: The group is, by the way, it's our go to market group. This channel is I can't read it. Competitor Research. Yeah. This channel does have folders, and these folders do have files. Speaker 2: And so what happens is if the AI, for example, finds any statement of files here, then it will understand that it's actually not statements made by us or that the content of these files is not relevant to questions that concern internals of our company because it understands the context. Right? It understands, oh, it's actually documents that are probably from competitors and not from, the employee asking about it. So this is 1 way how we structure our architecture to deal with, like, the wide way that data appears in enterprises. And the second thing is we use these graphs to also be able to find the right person to address with questions or with tasks, depending on data. Speaker 2: Right? So just to give you an example, we also build these kind of knowledge graphs for customer currently. It's an IT service desk case where you have tons of historical tickets, you have comments about tickets, and, of course, you have people associated with the tickets. And, these accounts are associated with teams, right, IT teams. So what we what we offer there is is a teams integrated chatbot that will be able to automate, like, 50% of the typical IT questions that come in, like, I can't log in, my password needs to be reset, or whatever. Speaker 2: It knows about the self-service that exist. It knows about the knowledge that exists from Viki's, which is basically nothing. And it has all this knowledge from historical tickets. And because it has this structure, it can not only kind of, like, use this knowledge to directly answer the user, but it can also identify the right team to assign a ticket to, right, for when it creates a new ticket for the user, which is also 1 thing it does. Yeah. Speaker 2: So and this is the second reason why we chose this architecture, building these contextual knowledge graphs, because you can use that in multiple setups. I already gave you 1. We also do it quite generally where we enable the AI agent to start a group chat on behalf of the user who is asking something together with the right participants from the company, and then to not only for the user to get an answer, but also for the whole solution, to learn by growing this knowledge graph, basically. And this is what we believe in a a very nice way because you can basically start out without preparing any data. You can just, yeah, have an have an AI agent that will automate whatever it is capable of automating and then will forward, everything else to humans. Speaker 2: And then we continuously learn from from what gets in. So you continuously and strategically build your own knowledge base for your reg agent. I think before I go deeper, I'll just see whether you have any questions. Speaker 8: Yeah. Speaker 0: Okay. Okay. Speaker 7: Can I can I think of this as an agent that has access to tools and the number of tools depends on its position in the graph or the range of tools? Speaker 2: The tools don't depend on the graph. Actually, it does have 1 of the tools as the as a knowledge graph tool, basically, which enables it to to search the knowledge graph. And how search works is we start out with semantic similarity, basic rack. Right? So you need to find initial notes in the graph, and then we traverse along the nodes by certain routes. Speaker 2: Right? So if we find a file or like a file is also, of course, hierarchically structured in in big chunks and small chunks and small chunks apart of big chunks. So if it finds 1 of the pieces, depending on what it finds, it will it will kind of aggregate all the other nodes around it so it really understands, okay, I found this excerpt from the file that belongs to this chapter of the file. The file is summarized in the summary note, and this is in the folder that has a summary note. The file was created by user x y z 3 years ago, and the user's actually, team lead of whatever team. Speaker 2: And then it will be able to holistically evaluate whatever it found there. Alright. Got it. Speaker 5: Thank you. Speaker 2: But there are also other tools, of course, depending on the context. For example, we do have, like, this ticket creation thing that we did in in Jira already and then matrix 42. So there are some tools that come with that. And, yeah. Speaker 9: And, so just taking the view from the previous question. So this graph, which has been created, is an input to the agent, and then the agent responds to the question based on this graph. So you start with creating this graph as an enterprise level, but do you have an automated way to build this up because this is dynamically changing every day? Speaker 2: Yeah. Yeah. We are syncing syncing systems. Right? So we've out of the box, now works everything in Microsoft, of course, SharePoint, Jira, Confluence, like a little bit. Speaker 2: So we would have still to still have to put a work in it. Matrix 42, as I said, already implemented. Yeah. So it's basically just a a thinking job that runs every x minutes or hours depending on the configuration. Speaker 0: Any more questions? Okay. Then, I guess we are finished or Speaker 2: Yeah. I mean, we don't have so much time, right? Speaker 0: Yeah. That's that's that's Speaker 2: So, yeah. If you have any questions, just, tell us. Please. Feel free. Speaker 0: The network time is for such thing. Thank you, Armin. Next up, we have Alex. Speaker 5: Does that work? Yeah. Oh, it does work. Perfect. Yeah. Speaker 5: That's I'm just trying to do that now. Yeah. Alright. Who of you guys knows people knows what a point cloud is? Point cloud? Speaker 5: Not everybody, right? So it looks like this. Here's a bit of a better view on the whole thing. So this is a point cloud. It's made from a laser scan. Speaker 5: So you know what a lidar is maybe? Like cars, cars have that and robots and everything. So you can use these. It's like a small laser emitter that emits lasers and measures reflections. You can measure distances with that. Speaker 5: But you can also use that to make, to scan a room like this. So you take this laser scanner, you basically wave it around like so, and then out comes a point cloud that looks like this. So you have a quite accurate representation of the environment around you. And you can use that for all kinds of use cases, for example, for, in architecture or in construction. They use that a lot. Speaker 5: So you have a representation of a building that you are building, then your document progress and everything. And I was thinking a while ago. So I was working on a research project, where I had to deal with point clouds. And I ended up doing like, visualizing them here in this tool called Cloud Compare. So this is not what I built. Speaker 5: Yeah. It's not what this talk is not about, Cloud and Compare. I had my console open on the next to me, and I was using cloud code there. So usually how you how you deal with these with these this kind of data is you have a lot of proprietary software around there. It's usually Windows based that allows you to, to modify modify point clouds like this. Speaker 5: It can do all kinds of operations, like cut it out and do analysis and stuff. It's quite, quite tricky usually. You can use tools like CloudCompare or there is, like, Python libraries for that. So there's a a big, big ecosystem in in Python and also c and c plus plus to to deal with that. It's basically all 3 d. Speaker 5: It's, like, also very much related to 3 d data. And I'm not an expert in that. So but I had to work with these and I had to modify it. So I was using cloud code to write little scripts to modify that. Is that okay with the micro? Speaker 5: Because it's a bit a bit weird. Okay. I'm beeping a bit. Let me know if not. And, and I was I was doing actually, so I was basically sitting here like this and having Claude Cohen on the on the side. Speaker 5: And doing little, little adjustments there. So I thought, why not write an agent that can do the same thing. Right? So I write a write a little agent that can modify point cloud for me. But there's 1 problem. Speaker 5: 1 big problem there, which is that LLMs cannot deal with point clouds. Right? Because LLMs can deal with, basically anything that's rule of thumb is usually what if a if a human can read it, then then, then a LLM also understands it. So it's mostly text or also audio and video. But point clouds are just a bunch of lines lines of of numbers. Speaker 5: Right? It's not something that if you look at a point cloud, I'm gonna show you in a second how it looks like, you're not gonna know how it looks, what this actually is. So this is, this is how a point cloud usually looks, you know. Just have coordinates. These are the coordinates for each point. Speaker 5: These are this is a coloring. And this is a semantic label here. Yeah. So this tells it, like, this is a chair. This is a table. Speaker 5: You don't need that. Yeah. This is additional. So, yeah, at least you need these coordinates. So when you look at this, you have no idea what that is a room. Speaker 5: Right? It's just lines of stuff. And the same for an LLM. So this is a bit tricky. You cannot just throw that into chat CPT and it tell you something. Speaker 5: So this was a challenge I had to solve. Like, how can I make an LLM, or therefore an agent, work with point clouds? And, what I ended up is this here. So this is a bit inspired by another tool, like Barb maybe, or Cursor, where you have, like, your artifact on the 1 side and then a chatbot on the other side. And you can see I've already loaded this point cloud in here. Speaker 5: And, you can see this is a bit upside down. So Eichenseer I try to to work with it, it's a bit bobbly. So the reason is that this is not, you see down here, this is like it shows a coordinate system. The set axis is up. So this is 1 was 1 problem I had. Speaker 5: So and this is 1 1 little use case I can show you what it can do. So first thing is we're gonna ask it, like so just a little warm up here. What do you see? And I'm gonna show you, like, 1 or 2 use cases of what that can do, and then I we go into code so you can, we can look at the code and a few key problems that's solved there. And if you have some more time, then we can also look at some other use cases. Speaker 5: Because this is a pretty easy task already, what do you see? Obviously. And you see it does some it does some stuff. And you can this toggle here shows me all the all the steps it takes. It does some. Speaker 5: Right? It's mostly Python code, so it is very much based on writing Python code, like the first speaker also I think showed. It reads the file here and does stuff. Then it comes back with a nice analysis. So it says, okay, I have this point cloud here. Speaker 5: And it already, usually, it depends on what model you use. It comes up with some idea on what this could be. So I think this has it's a large indoor point cloud. So it understood, for some reason, that this is a room, you know, which is I find pretty cool, actually, because it's like there's, like, no information that this is a room in there. This is really just, like, these lines of of points. Speaker 5: So it's not so bad. You I'd already understood that there are walls, floors, ceilings, furniture in there, which I think is is also is also pretty cool. And then we can also ask you to do to modify this point cloud. So we can say, modify the point cloud so, that the y axis is up. Yeah. Speaker 5: So I'm using I'm using this for you. It's too lazy to type. And then it should come back and write some, Python code and modify this point cloud. So I understood that. And the meanwhile, why it works, we can look in Lang in LangSmith. Speaker 5: So we saw Langfuse, and this is Langsmith. So it's a similar tool like Langfuse, which I happen to be using. I have I have an opinion about it. You can ask me if you want to. But it works pretty well. Speaker 5: So we see exactly what the, what the what it's doing right now. And you see Python rebel calls Speaker 7: here. And, Speaker 5: yeah, it's been, surprisingly slow. Oh, it should be okay now. And it does a little it take does a versioning here, so each time it needs to write a new version of that cloud, and it loads it now. And you see this is 1 of the major problems that I have here, is that point clouds are quite large. So this little room is, like, 70 megabyte already. Speaker 5: And that is considered a very, very small cloud. So usually it's easily 8 gigabytes. You know? So now we have our point cloud here somehow upside down. Don't know why that happened. Speaker 5: Okay. Weird. But I think it it flipped it on the on the hand, you know, because it has no idea probably which of these are are the floor and the ceiling. So it doesn't know if it's up and down. You know? Speaker 5: So, well, way to go here, I guess. It's also the first time that this didn't work out, obviously. So after doing it 20 times, 20 first time showing you people, it's not freaking out. And we can also do some other stuff, like, remove the ceiling and, color the floor for darker. So let's see if it understands what the floor is doing. Speaker 5: K. While we wait, we can, have a look at the code, to see, how this works out. So as I told you, the major challenge is that you cannot just, like, as with a text file, you cannot just throw it in your LNM, and it does something with it. So the the whole application is built in like like this all. The application layer is Next app, which I have, 99% Vibe coded, to be honest. Speaker 5: So there's not so much code writing, in that area for me. But I wrote the, the agent here is mostly written by me. So this is the center center part here. And I'm using can scroll down a bit. I'm using, Langraf's create agent. Speaker 5: Which 1 is it? That 1. This is the standard agent that Langraph provides. So there's I don't know how many of you are working with Langraph or Langtryne. Probably a few. Speaker 5: So luckily, since, like, half a year or end of last year, you don't need to, write your, React agents yourself. So this comes out of the box. And, you see that we are providing a couple of arguments here. We are providing some middleware, the file system for file file reading and to do list, which is also built in from from Langrove itself. And then we are having this dynamic model middleware. Speaker 5: And, the way how this middleware works in, in Langrove is that each time something there's like a lot of it has a lot of hooks inside the agents or at certain points in its life cycle. It just calls back into these, these middlewares. In this case, each time before it calls the model, it calls this function and sets, the exact model it wants to call because it then goes to OpenRouter, and OpenRouter decides, like, routes it to the to the Michael provider in the Michael. So we can get, cloud models and test GPD models and whatever we want. And then you see it has a bunch of tools that it can use. Speaker 5: So it has a Python rebel, which is like a Python console for writing code. It has these calculate clearance and apply point cloud transformation tools, which is like fixed scripts for doing, specific operations on the point cloud, which is like moving stuff, or calculating distances. Because this is like code that's like 1 and a half pages or 1 page sometimes. And that's very easy to just give it this tool. And this is a highlight location tool, which can highlight stuff on the point cloud for for users to inspect. Speaker 5: And I see that Riza is already getting here, pushing a little bit. If you want, we can, we can go over it in detail a bit later because it's a bit comprehensive. But do you see what it has done here? Uh-huh. You see? Speaker 5: It has it has removed the ceiling, and the ceiling, it has things yeah. It has to flip the whole thing upside down. So it believes this this is the is the ceiling here and this is the floor. Otherwise, it would have worked. Too bad, But that's that's how it happens. Speaker 5: Good learning, actually. Maybe next try try try try with Opus. Maybe that's better. Speaker 0: Thank you so much. Speaker 5: It can do much more stuff. It can also move objects a lot around and everything, but maybe it's like for next time or for later. Yeah. It's like like semantic understanding in point clouds is super fascinating. It's not so easy. Speaker 5: You know? Speaker 0: You can also use the mic. So questions? Speaker 10: Not too deep into point clouds myself, but isn't there software that also for generating this and kind of figures out what is the ceiling and what is the floor also to kind of remove noise and also for these tables? So couldn't this information also be used from, like, classical ML, I would say, that kind of created this? Isn't that information already available and would then improve how the agents work with it? Speaker 5: Yeah. So, like, the second part of that demo would have been, be putting in a point cloud that is already completely labeled, and then you can do a lot more stuff. Speaker 1: Mhmm. Speaker 5: Yeah. That's right. But, actually, it's, this is still like, if you have a perfectly segmented and labeled point cloud, then this becomes way easier. But this is also the hard part. And also you have to see, like, this point cloud here is super tidy. Speaker 5: So it's very nice. So Speaker 2: it's just Speaker 0: it has Speaker 10: using the, coordinates and then figuring out where the stuff is and some Python code to group stuff probably. Speaker 5: That's what this does. Yeah. Speaker 10: Yeah. Okay. Yeah. Yeah. Yeah. Speaker 10: So Impressive. Speaker 0: We can do 1 thing. If you if you don't have any more question, maybe you can what we have, then you can, after the question, shortly show us what would be when it's labeled. Yeah. I'm curious. I don't know. Speaker 0: Do you wanna see it? Speaker 1: Yeah. Just a quick question on the same thing on the understanding of the point cloud. Have you considered sending a screenshot of the workspace alongside of every prompt so the model can also Yeah. Look at the Speaker 5: That's actually something I'm I'm, working on right now. Yeah. Okay. So and this works incredibly well. This, I've tried this a couple of times and this in in improves the, understanding so much. Speaker 5: Just having you can actually it's you can extend this a little bit. I can make, multiple pictures from multi multiple sides. It's not so hard for it, and it has a much better understanding of what it is. And, next next thing I would like to try is to use something like, Sam, segment anything, for segmentation because you could do that. So this is actually a quite a this is a common workflow is to protect the to take a picture, segment it, and then protect it back into 3 d. Speaker 5: Something you can do. Yeah. It'd be more tricky. Speaker 0: So still not Please quickly show us where we would be. Speaker 5: Yeah. And let me find, let me delete all that stuff because there's, no no data upload yet. It's, like, not so convenient. We delete that stuff and, load a point cloud that is already labeled. And you see this has labels. Speaker 5: It has a semantic label and object ID. So this knows that each point belongs to a door, for example, or to a table or to a chair. And now we can see what happens if you open that. If anyone knows how to write a desktop app that does the same thing, especially packing lang graph into a desktop app, then let me know. Cloud will know. Speaker 5: Cloud will know. Yeah. That's true. But this is like, yeah. There are a couple of considerations there. Speaker 5: So what do you do with instrumentation, for example? How do you get Python code running on this? I don't know. It's like in in this industry, actually, in this scene, the, a lot of code actually, a lot of, applications use CUDA as well. So and this is all very desktop heavy for that reason. Speaker 5: Because if you have, like, hundreds of megabytes or gigabytes in data, then you can't it's very hard to work on the web. So it's just they're just inventing streaming of these huge point clouds. Okay. So, now there it is. We can ask it, what do you see? Speaker 2: Where's the interface? Speaker 5: Which interface? That's from Claude Code. Okay. Yeah. I think might in the in the beginning, there might have been some, there's a Langraf has some some little chatbot. Speaker 5: I think that's where where it started, but then it, changed all around. Why do you ask? Is there any Speaker 2: Just curious. Looks good. Speaker 5: Yeah. Okay. Now it, yeah, it says indoor room, and and it could, it could, count all all the objects in there. 8 tables for penis penis furniture, 1 door. Obviously, because it can read all the labels. Speaker 5: Yeah. So it's, much easier. Yeah. But it still had to probably write, Python code, yeah, to analyze that. And then we can also say, get the distances between all tables, and find those that have, less than 1 dot 2 meters Speaker 0: clearance. Speaker 5: Clearance. So okay. That is actually the use case where it started because, yeah, as I told, this is from a research project. And the challenge in this project was how can we like, the task it had to do is, is, ensure accessibility compliance in these rooms, Make sure that there is enough spacing between all the tables. That's why I'm doing this 1 to 2 meters. Speaker 5: And it highlighted already, it highlighted violations here. It says, like, okay, these table tables are not too close together. Yeah. Could be okay. And then it can also yeah. Speaker 5: It already suggests that it applies fixes. And I think it does that because the it has a tool that can, can change the that can move the the objects. So it infers, I guess, from the tool prompts that it has that this is something it might do. I don't know if it would suggest that if it wouldn't have that tool. Let me try it. Speaker 5: So it now has applied the transformations and, it should come up here. It has created a new point cloud. And it's loading that. I need to wait for it to load. Cannot click this. Speaker 5: Now I can, inspect the differences between 2 clouds and I can say accept or reject if I want to do that. See? So I see the blue 1 is the old 1 and, yeah, the brown ones are the new locations there. Now I could say, like, oh, this is cool. I accept it. Speaker 5: Otherwise, reject and make it new. Speaker 0: So So thank you, Alex. I know we Stretched a little here. You might be a little tired. We have 2 more talks, and then you're off to ask more questions if you have a network and enjoy some pizza. So next up, we have Felix. Speaker 7: Test test. Nice. Oh, we are looking to this. Is it yeah let me give me a second it's nice to be here I have 1 question who of you has never had a a pokemon card in his hands or her hands okay have some I can I can pass them around just in 1 So they're cheap, so don't don't don't bother taking them with you? Okay. Speaker 7: So following story. So, I had I had this box of of, trading cards, Pokemon cards at the attic of my prince's house. And, I I wasn't sure if that's actually worth anything or not nowadays. It's from my my youth times. And, so I figured I I will trust the wipe code and app on on the weekend to kind of help me figure out if they're worth anything. Speaker 7: It was more than a weekend. And, it had some requirements. So I don't wanna pay API costs. So, yeah, ideally, the solution should run on my Mac. Should run faster. Speaker 7: I don't wanna wait, like, 4 minutes to get, like, an identification and pricing of my cart. I want it to be, like, less than 30 seconds per cart so I can put a cart on my camera, and continue. So I have a human in the loop component built in. Trust if the AI makes an error, we can correct that. But it shouldn't be, like, too too much work else. Speaker 7: There's no gain in the efficiency. Yeah. I want the approximate card for each. And I don't wanna collect first, like, training data and train a model because, like, I don't know. I thought nowadays with all the open source, visual language models, that should be possible. Speaker 7: However, it's actually not super easy. Thing is, it's hard to figure out how much a card is worth. So there is first a lot of Pokemon cards. It's like $1,818,000 in English language, alone. They come in different languages like in sports. Speaker 7: There was like German, English, and Japanese cards. And, they have a rarity symbol on the bottom here, bottom right corner. So it's either a circle if it's very common. They have a diamond if it's uncommon and a star if it's rare. But they're not indicative of, like, the value of the card. Speaker 7: So for in this example here, the right card is a rare card. It has a star at the bottom, but it's actually worth, like, a euro maybe. The left card here, there the illustrator forgot to paint like clothing on this young lady. So this card is much more worth in the collector scene, than this right card. There are cards that look the same, but they have a small difference. Speaker 7: So here, it has a small stamp. It's called edition 1, first edition card. And this also usually means it's much more worth than a card that doesn't have the stamp. So it's OCR alone doesn't do it because you have to, like, figure out also these small stamps that are on the card. And then most importantly, in school, like, I played with these cards, so they are all on, like, of different quality. Speaker 7: And, this also the use huge effect of the value of a card. And there's, like, a scale. It's called, from mint to, like, poor. And here's 1 example of, like, a poor card that's, like, a wedge here and, has very wide borders. And then, here's a very good quality card. Speaker 7: So here you also see it's important to have, like, scans of the front side and the back side because, you wanna figure out, the quality. And then there's also, like, special cards and promo cards. So, yeah, it's not an easy task. However, I I I quoted a small solution I wanna show you. It basically uses a continuity camera. Speaker 7: That's like an Apple feature where you can use the camera of your phone from your Mac if it's close. I've got a small stand of LEGO bricks. And then here yeah. You see the camera from the phone. They put a card underneath it, and they press space here. Speaker 7: You put go to the side. Press space. And then let's jump back to the terminal. It does some processing. It does some I I I I I dig deep a bit into this topic here of, like, a vision language Michael from caching and so on. Speaker 7: So it does multiple steps. At first, there's some computer vision preprocessing using OpenCV. It uses an open world object localizer to find the card and the picture so we can cut it out. And then we do OCR. I use Cren 3.5 to, get all the text on the card. Speaker 7: And then, depending on the text, I do a lookup in the database, and then I do some subsequent steps. I will show you in a second. So here we see the results. It's on the bottom here. It finds Nido Reno. Speaker 7: This is actually German card, the original. And, gives you some detector detections. And then you can say, well, edit or save. Here, it sounds correct. So let's put save. Speaker 7: And And then it uses some API to get, like, a price here that's, like, a 2¢ card. Alright. Let me jump. Here, I wanna say a few words on the solution, if I may. So, as I said, it's like, we have the front and back scans. Speaker 7: We do some, preprocessing object in open world, object localizer. We crop rotate, do some contrast correction. If it's blurry, we we directly say, well, we can't deal with that. And then comes the interesting part. So, we do a lot of steps given the same front image. Speaker 7: And if you structure a prompt nicely, you save a lot of time in the processing because a lot of you can do is you first pass the 400 image and then your different subsequent prompts. And then you can cache the kVs in the Michael, and then the follow-up prompts will be much faster. This this gives you a lot of time. And then only here, the stage 4 we do the grading, you pass the back image as well. So you first select OCR, parse out database lookup. Speaker 7: If it's a promo card, and then we have another call. If, anything fails, we do a correction call, and then we check this first decision, stand on the card, then do the grading, and then, as you saw, human-in-the-loop, price lookup and then the database. I wanna show you 2 numbers before I finish. So the prompt caching here I can actually also run it. So, but I told you, like, you can first pass the image, create the kv cache of the image alone, and then you do the subsequent calls against this card image. Speaker 7: And hopefully, it just shows now that it's much faster than, than running it without the cache. On the first run, it takes a bit of time to kind of float the model into 1 memory, but you see here, it's like, in a millisecond, like, below a second to answer different questions on the card. And then if you don't do the caching, if you re encode the image every time, you have, like, 4 seconds per question. And similarly, there's, optimized inference code or machine learning code for Apple Silicon. It's called MLX. Speaker 7: If you use MLX versus, say, Ollama, it also gives you a speed up, and I can show you here just Speaker 2: the Speaker 7: results. So for MLX, if you know that natively, you get, like, 4 seconds without prompt caching. And if you use Ollama, the same rule quantization, basically, You get, like, yeah, 6 seconds. And that's what I wanted to share. Thanks. Speaker 7: Thanks for for the attention. Speaker 0: Questions? Seems like someone's tired. Okay. You Speaker 1: So you're doing caching for the sequential calls to the model. Why did you do this and not just, put all the questions into 1 call to the model? Speaker 7: Yeah. That's a very good 1. So I use here a Quench 0.5 of the 4,000,000,000 parameter model and then a 4 bit quantization. And if you, if you wanna do all this stuff like, the identification and the grading, which are, like, kind of, orthogonal tasks in 1 thing, it messes up stuff sometimes. Also, that's why I do the OCR. Speaker 7: If you prompt it to, like, directly give you, the card name, release year, and card number, and so on, it sometimes confuses things. And, so I find it much it looks more better if you first do OCR. It's a very simple task. You just have to read the text and then do the multi step, thing. So it's really that's a limit of the local models, I would say. Speaker 7: And Speaker 5: I saw Speaker 0: some hand over there. Okay. Speaker 11: Thanks, Hallof Felix. I was curious whether there is any way to improve on the grading itself. So currently, it's mostly from the model, right, if I'm not wrong. And then is there any way that you can fine tune this and get better accuracy on the grading? Speaker 7: Yeah. It's a good point. Cool that you know that it's not good. Like, I assumed. Yeah. Speaker 7: So the idea works really well. Like, I had I scanned, I think, like, 400 cards or so, and, like, I had, like, 3 misses. So it was quite good at the detection. But the grading, that's really a difficult thing. It's also not, like, between humans. Speaker 7: They don't agree every time. And then there, it's it's the smallest features, like, small whitening, on the edge that kind of, like, can change the grade. And, also, usually, people, they kind of tilt to see if there are scratches on the surface and usually don't see that if you just have 2 scans from the card. So I think, yeah, there you would need, like, videos probably. Especially if you have, like, hollow cards. Speaker 7: It's like shiny cards. Their switches are very important to recognize them. Sure. Yeah. Should I put it away? Speaker 7: I can't. Yes. I have to slit my belly. Speaker 0: Something physical to share. Thank you. If there's no more questions, we can go to our last talk. Any questions? Okay. Speaker 0: Great. We have Michael here. He already knows how to set this up because he's from CodeCentric. Speaker 8: Alright. I'm Michael, and in my free time, I produce music. And, I started with producing music with samples, so I cannot play any instrument. But I pick a a song that is already there. I transform it and use it, how I need it. Speaker 8: But with time, you, you start to you want to play something because it's easier. You can you have ideas and you want to put them into the music. So the first step for me was how can I find, a note that is fitting to the sample? And, therefore, I use Cloud Code to create, find note feature. I I can record something, and it will tell me which note this is, and this would then fit somehow to the sample. Speaker 8: If you have 1 note, then you want to, to play more notes. So you want to play chords. You want to play chord progressions. So chords that fit together and create some, some mood. And so, therefore, I let the AI create, an inventory of chord progressions that are known. Speaker 8: I I just said give me a list of chord progressions and put them in there. It it created a description for it. So I could scroll through this. I can filter by by mood. So what do I want to achieve with the chord progressions? Speaker 8: And, yeah, that was the start. And if you have this, you want more still, and I, started to, have online classes, piano classes. And after the piano class, the teacher gave me, homework. And for me, because I never learned an instrument, it was hard in the beginning. So, I started with the homework and it didn't feel quite right. Speaker 8: So I had the idea if I could create a training center for this. So something that is exactly what the homework was telling me, and I can train this, with my application. So Speaker 1: the Speaker 8: first homework was chord, inversions. So you have a chord like these 3 notes. You can play them in different orders and it's still the same chord. And I tried the homework, and the problem was I never learned to read the notes on the keyboard. So I was very slow at this. Speaker 8: And so I said, I need another training, that could help me do the real homework. So I said to Claude, please create another training, which is this 1. And if I start, the training, it will show me different notes, and I have to, press them very fast on the keyboard. And it's gamified, so I see how much I can achieve in 1 minute. So I want to get better. Speaker 8: And that's how I started this whole thing, and it grows bigger with every lesson I have. And AI is, is helping me, very much to to achieve, what I need in this in this very moment. So if I would, go on another platform where training is already there, it maybe it leaves stuff out I I do not know yet or, yeah. But here, I can decide my own, tempo for learning this. Also, I have a Wiki. Speaker 8: So whenever I have a topic I want to save here, I just, cloud code, write write a Wiki entry for about this and put it there. So it's somehow also a knowledge base for me. Yeah. And 1 requirement was also, I don't want to have a subscription at any cloud providers. I don't want to run a server somewhere. Speaker 8: So this is all HTML and Chase, JavaScript, and CSS. And I never saw the code, but, Cloud Code did very well. And it's deployed via GitHub Actions on GitHub Pages, so everything is free for now. And also, the at least the iOS browser, has enough APIs to give me everything I need at the moment. Yeah. Speaker 8: Do we have any questions? Yeah. Speaker 2: I I like it. I like it. I also used to produce music 10 years ago also. I actually what I did, I put stickers with the notes on my midi piano, so that's also very helpful. Speaker 8: I have the tool. Yeah. Yeah. Speaker 2: But I immediately, I was thinking, did you think about having, like, a a visual of the piano? Like, because when I saw the chord progressions, you had, like, these yeah. It's it doesn't it's only ah, you're okay. Alright. Speaker 8: I have this. And it plays sound too. So I didn't know if Speaker 2: it travels show you what you have to hit. Yeah. And now it would be also nice if it could also show you what you're actually currently hitting. Okay. Speaker 8: The problem with chords is it's hard to, distinguish between 3 notes. So 1 note is easy to recognize for this thing. But I'd, Speaker 2: I was thinking more about just plugging in the midi signal directly. Speaker 8: Okay. Yeah. I'm not there yet. Speaker 2: Okay. But Speaker 8: a good idea. I don't know how how good it works with the browser application then. Speaker 2: I mean, you just need to read it and then basically yeah. I mean you mean because of, like, it being real time? Is that your concern? Speaker 8: Yeah. Okay. But I I didn't think of this as a good idea. Interesting. Interesting. Speaker 8: Thank you. And I learned that there's a JavaScript library for being a synthesizer. So if I can read the midi signal, then I think this would work. Speaker 0: Any more questions? Also, my mind went blank with a flipped way you can access. I was curious to play around with this later. If you don't have any more questions, thank you, Michael. Then I would like to invite Ali to give a little bit of intro about code centric and so on. Speaker 12: Hi, everyone. Welcome. Today, the agenda was a little bit flipped, so I start in the middle of the party with introducing CodeCentric. My name is Alireza, Alireza Rothbauer. I'm the location manager here in Nuremberg. Speaker 12: We are in Nuremberg since 2017. Speaker 2: When you started? Speaker 12: When I started. Yeah. So, we are IT consultancy. We have 40 people here in Nuremberg. German in Old Germany, we are, like, 500, 600 people. Speaker 12: Our headquarters in Solingen, and, yeah, we are here to support the community in Nuremberg. So it's not the only meetup happening here. So also the AWS, the Java user group, the data NKI user group. A lot of meetups are happening here. And, yeah, I don't want to stand between you and the pizza. Speaker 12: So if you have any questions to CodeCentric, just ask me or my my colleague, Michael. We will happy to, answer all your questions. And, yeah, now let's enjoy the pizza. And afterwards, we have some more projects. Speaker 0: No. That was it. Speaker 12: That was it? Yes. Okay. Then, stay and have a drink, have pizza, and enjoy. Thank you everybody for coming. Speaker 0: Thank you. 1 last thing I forgot to mention. All the speakers oh, you will receive an email afterwards, or you can just go through the website. I guess we'd appreciate your feedback. Please give us a feedback about the talks individually and also about the whole theme. Speaker 0: We appreciate your feedback. Please enjoy. Thank you.

Links

Tech stack