Toyota Says Filled Disk Storage Halted Japan-Based Factories (bleepingcomputer.com) 67
An anonymous reader quotes a report from BleepingComputer: Toyota says a recent disruption of operations in Japan-based production plants was caused by its database servers running out of storage space. On August 29th, it was reported that Toyota had to halt operations on 12 of its 14 Japan-based car assembly plants due to an undefined system malfunction. As one of the largest automakers in the world, the situation caused production output losses of roughly 13,000 cars daily, threatening to impact exports to the global market.
In a statement released today on Toyota's Japanese news portal, the company explains that the malfunction occurred during a planned IT systems maintenance event on August 27th, 2023. The planned maintenance was to organize the data and deletion of fragmented data in a database. However, as the storage was filled to capacity before the completion of the tasks, an error occurred, causing the system to shut down. This shutdown directly impacted the company's production ordering system so that no production tasks could be planned and executed.
Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations. The restoration came on August 29th, 2023, when Toyota's IT team had prepared a larger capacity server to accept the data that was partially transferred two days back. This allowed Toyota's engineers to restore the production ordering system and the plants to resume operations.
In a statement released today on Toyota's Japanese news portal, the company explains that the malfunction occurred during a planned IT systems maintenance event on August 27th, 2023. The planned maintenance was to organize the data and deletion of fragmented data in a database. However, as the storage was filled to capacity before the completion of the tasks, an error occurred, causing the system to shut down. This shutdown directly impacted the company's production ordering system so that no production tasks could be planned and executed.
Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations. The restoration came on August 29th, 2023, when Toyota's IT team had prepared a larger capacity server to accept the data that was partially transferred two days back. This allowed Toyota's engineers to restore the production ordering system and the plants to resume operations.
Absulote Incomptenence ! (Score:1)
Heads will roll... after the culprits slice their own guts out.
Re:Absulote Incomptenence ! (Score:5, Funny)
They should have called little Bobby tables, I hear is an expert at making database space quickly available!
Re: Absulote Incomptenence ! (Score:2)
No, the reason was to save some money by putting all data in the same basket.
The guilty manager already got the bonus and a promotion while inexperienced IT staff had to clean up the disaster.
Can some IT/CE person weigh in here? (Score:2)
Seems like a “you had one job” moment to me but I could be wrong.
Re: Can some IT/CE person weigh in here? (Score:3)
Re: Can some IT/CE person weigh in here? (Score:2, Insightful)
Re: (Score:1)
They will never succeed in Capitalism.
Re: (Score:3, Informative)
Rationing of disk space is pretty common.
Is it? The last job where I had a disk quota was 30 years ago.
Rationing disk space makes as much sense as rationing toilet paper, except the toilet paper I use costs more than the disk space.
Re: (Score:2)
I may be working on a different scale than you...but I am given VMs to work with. They almost always under-spec the VMs, and I need to ask for more disk, and more memory.
The admins don't understand why I keep 6 versions of a database during development, and I don't understand why they can't just give me 10 times more storage. We usually have a back and forth for a couple of months before things get sorted out.
This has been going since we moved to VMs, rather than me just having the entire disk capacity av
Re: (Score:2)
> The admins don't understand why I keep 6 versions of a database during development
Use linked clones.
Re: (Score:2)
If you're talk'n petabytes, then you're on a different scale than me.
If you're talk'n terabytes, then you work for idiots. A terabyte costs, like, $10.
For comparison, the average business spends $60 per employee annually on toilet paper.
Re: (Score:2)
If you're talk'n terabytes, then you work for idiots. A terabyte costs, like, $10.
The going rate for 1TB of enterprise storage is in the $1000 range[1] (not inclusive of maintenance, operating, and backup costs). With that said, skimping on storage is, indeed, stupid, because that $1k (which you'd amortize over 5-7 years) is only 1% of the annual cost of an engineer using the storage to get his job done.
[1] Assumption is you're at the 10s to low 100s of TB scale for that number.
Re: (Score:2)
Major Morris: Our clerk says you want an incubator. No dice.
Hawkeye: Yeah, but you've got three.
Major Morris: That's right. If I give one away, I'll only have two.
Trapper: What's wrong with two?
Major Morris: Two is not as good as three.
https://www.imdb.com/title/tt0... [imdb.com]
Re: (Score:2)
M.A.S.H. reference.....check
Low 6 figure UID.........check
Obscure dad joke sig..check
Alright, here's your card stamp, move along
Re: (Score:3)
Re: (Score:2)
they have a policy that they don't create workstations in their VM environment, so they spun up a Windows server VM
Explanation for this policy is pretty simple: they have Windows Datacenter licensing, and the incremental cost per Server VM is $0 compared to Windows 10/11 which is a PITA to license for virtualization at small volumes (it's basically impossible to license a VM, you're typically licensing everyone who touches virtual client OSEs).
Re: (Score:3)
More akin to a major food processing and distributor running low on frozen storage space and in the process of reshuffling, having a traffic jam out in the non-climate-controlled hallway in front of the freezers preventing moving anything to where it belongs, as new orders are piling up on the loading dock.
Imperfect analogy but since only those performing warehousing operations would see it (ie not corporate officers, or truck drivers, or suppliers, or delivery drivers), few to pay it any mind until it's to
Re: (Score:1)
Isn't this kinda like the operator of a coal-fired power plant letting their coal pile run down to nothing? Where no manager at any level bothers to look out a window?
A big pile of coal looks very different from a small pile of coal.
A full HDD looks the same as an empty HDD.
It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.
Re: (Score:2)
It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.
Possibly. Or it was management penny-pinching and refusing to buy more storage, then pushing IT to complete the task with limited resources.
Re: (Score:2)
My money says that IT, as stated, had a planned maintenance window to avoid this exact issue, but management insisted they delay because "production." Happens all the time. Management sows what they reap.
Re: (Score:2)
Probably both, general rule of thumb, at 80% utilization you plan for more. If its a single VM then you review available disk space and allocate more as budget permits. If its cloud hosted then you are budgeting a larger dollar spend on that resource which requires approvals which are quite easy to get during an outage.
This sounds like Toyota is using outmoded methods of data storage such as using physical servers for database servers because they still believe the virtualization performance penalty is a t
Re: (Score:2)
This sounds like Toyota is using outmoded methods of data storage such as using physical servers for database servers because they still believe the virtualization performance penalty is a thing to even consider.
Funnily enough this is a thing that I was butting heads with a sales rep over not so long ago. Thanks to all the virtualisation extensions and all the effort put into the hypervisors and also the OS and software itself being VM friendly I wouldn't be concerned about the overhead...
But in order to demonstrate cost savings VM proponents wind up using core heavy CPUs running at substantially lower clocks than the old physical servers and it makes a noticeable difference. Of course if you're doing a bespoke VM
Re: (Score:2)
It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.
Or probably more likely it still was a management screwup because they didn't approve the purchase of a fully redundant storage platform with proper excess capacity. Very likely had a conversation similar to the following:
IT employee: "We need to expand our storage system."
Manager: "We just did that last year."
IT employee: "Yes, I know, but we have been using it at a rate faster than originally projected."
Manager: "We don't have the budget, make due with what you have, we bought twice as much as originally
Re: (Score:2)
We don't know what went on, but the fact that it happened during a planned activity sounds as if it is nothing like letting your coal pile run down to nothing. More likely they had free space, the maintenance activity did *something*, software updates, copying, moving data to another database or something, and likely duplicated the data without realising they didn't have enough space for this temporary data.
Kind of like a coal-fired power plant suddenly having some excavator come in and remove all the coal
Lack of Software Competence (Score:2)
Kaizen (Score:2)
Re: (Score:1)
Kaizen in software has no upper limit.
Kudos to Toyota ... (Score:5, Insightful)
for admitting what looks like a simple but stupid mistake and not trying to wrap this up in some complicated verbiage. We all screw up on occasion, the brave confess so that everyone else can learn.
Re: (Score:2)
Too bad about the faulty throttle by wire software [embeddedgurus.com]. They did admit to "sticky pedals" [go.com] though...
Well they are a customer of Oracle... (Score:3)
Re: (Score:2)
This is *exactly* what I hoped to read in this thread, and now I am not disappointed. At this point it sure does seem like we might just have another Oracle consulting/support 'success' story to poke fun at, perhaps even costlier than Oregon's botched healthcare website [slashdot.org], although $6 billion is a high bar to cross.
One Really Rich Asshole Called Larry Ellison, (ORACLE), maintainer of MySQL since SUN was purchased, which is why I use either MariaDB or PostgreSQL for the websites I develop.
Re: (Score:2)
If you want more Oracle success stories,
https://www.theregister.com/20... [theregister.com]
Birmingham City in the UK pretty much declared decalred bankruptcy and although Oracle wasn't the main cause, it is surely a contributing factor at the least.
An Oracle project just when up in cost fourfold to 100 Million pounds. And it's not even completed, so expect it will go up alot more.
Re: (Score:2)
The situation may have changed in the last 10 years, but I don't think so.
The production side, that controls the factories and just in time systems run on IBM mainframes and DB2. They have back end systems that run on Oracle (and Linux), but they do not control the production systems.
Not a backup (Score:2)
Re: Not a backup (Score:2)
Different virtual machines in the same host.
A hardware server is considered to be too expensive.
Re: (Score:2)
Happens more often than you would think (Score:2)
Re: (Score:2)
That was my thought.
If they are going through tons of different processes during an upgrade, they are making tons and tons of changes, and the logs will fill up.
I know I've had a server stop working due to runaway log files at least once...until you find the 'Delete log files when they get X big' features.
Would you be shocked if floppies were involved? (Score:2)
Re: (Score:3)
Toyota's whole corporate strategy of modernizing their drivetrains is a study in the same thing - they're not the most aggressive in switching to EV's. Some think they are falling behind and will wake up doomed one day. Yet for now they are raking in money - just broke their own all-time quarterly record for profits. Their customers are conservative and expect to get a couple hundred thousand miles or more from their Toyotas.
Re: (Score:2)
Japanese technology looks
Re: (Score:2)
Re: (Score:2)
Prius batteries wearing out is real, and only stress testing them will reveal their condition. The vehicle doesn't report what it knows about the battery condition on any of the displays. They last a good long time, but not forever, and it's a $3500 bill to replace one. For this reason I bought a used Versa instead of a used Prius, even though I would have really enjoyed the additional MPG. I plan to get another car in five years-ish anyway. The Prius just had too many miles to gamble on (~160k).
One has to
Database maintenance issues are now IT? (Score:5, Insightful)
Granted structure is different everywhere i've worked but where I am an applications team handles database maintenance. The IT infrastructure team makes sure the underlying infrastructure is working and that no problems are seen in terms of metrics from alerts but when you run maintenance disk utilization in a database can expand greatly over what it will be day to day e.g. the overwhelming majority of time.
Depending on what commands were run and what developers may be doing to an actual production DB (you'd be surprised how poor the controls are for things like this until it's a problem in a manufacturing company...) said commands could take up 1.5x typical storage, 2x typical storage, or 10x typical storage. So their 500GB database with a 2TB disk may not be enough.
This is greatly oversimplified and i'm just making up an example because we don't know what happened. I do know that i've seen poor data architecture everywhere i've ever worked from tiny companies up to multi-billion dollar businesses. Things that never have downtime because production is 24x7 like a manufacturing plant can have some of the worst tech debt of anywhere.
Re: (Score:2)
Yeah.... I just wanted to say I tend to agree. It's easy for the armchair Slashdot reader/tech geek to poke fun at this, and throw around claims that it was pure incompetence, stupidity, etc. etc.
Maybe so? But the more likely reality is that the people hired to manage a production database of the size and scope required for a major auto manufacturer have a clue what they're doing.
I.T. can provide disk storage and allocate it based on anticipated needs, and even set alerts so they're aware when it starts get
Re:Database maintenance issues are now IT? (Score:5, Informative)
The official statement doesn't mention "IT"...
https://global.toyota/en/newsroom/corporate/39732568.html
Re: (Score:2)
Interesting. I suspect we'll see more companies falling all over themselves to admit incompetence instead of a breach.
Re: (Score:2)
It also doesn't mention "disk".
Missing unit test (Score:2)
Nah, we don't need to test for full hard drive, that will never happen!
Re: (Score:2)
Re: (Score:2)
Nah, we don't need to test for full hard drive, that will never happen!
Why do you think they didn't test for what happens when the HDD is full?
Unit test: /dev/null, implement procedural measures to prevent disk being full.
Test: Fill HDD
Result: Entire system breaks
Solutions: Not fixable in software since we can't magic new free space into existence and can write to but not read from
What? (Score:2)
Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations.
That design is inexcusable.
Re: What? (Score:2)
Unfortunately it's also the normal state in most companies.
A single server with no redundancy can serve multiple sites. This to save money.
The downside is that you lose money due to latency issues, just a few minutes per employee every day.
Where I work there's a central domain controller+dns+dhcp server in another country, and logins takes 3 minutes.
Re: (Score:2)
That design is inexcusable.
The design itself is fine, provided that same system is equipped for the job (redundant CPU's, tons of memory, etc.). Having separate physical machines may not prevent this exact type of failure, as disks can fill up under a variety of circumstances.
It's a continuous full time job to monitor and maintain critical services, and it's all too easy for a runaway subservice to spew incredible amounts of data to storage. I've seen incredibly competent system designers discover incredibly obscure bugs only after t
Archaic Computer Systems Rife in Japan (Score:2)
EPA (Score:2)
Likely policy change. (Score:2)
So then, no more archiving all meetings with multiple views in 4k because "storage is cheap"?
Sounds like they tried to sugar-coat a bitter pill (Score:2)
They probably had been putting off the migration because they knew it was likely to impact the production lines. Someone convinced them there might be a way to slide in the change without shutting down production. That person's fate depends a whole lot on their certainty in the ability to switch over painlessly. If the need was really pressing, they may have opted for a half-baked transition plan to prevent down time as that is better than no plan, and the ensuing inevitable down time. Aside from resources
Surprise, surprise, surprise! (Score:2)
Perhaps the disk space was filled by all that driver AND passenger data which car companies including Toyota are now collecting [slashdot.org].
Re: (Score:2)
Re: (Score:2)
Unfortunately, saying that the companies effectively ARE the regulating bodies in many cases doesn't stray far from the truth.
Oops (Score:2)
I'd good of Toyota to actually admit what happened, though I'm sure they didn't have too much choice as the word had likely spread throughout the company as to why everything had to shut down.
Though this bit in particular made me wince a bit:
Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations.
So, Toyota has learned the hard way that their back
Bad plan (Score:1)