Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Japan Transportation

Toyota Says Filled Disk Storage Halted Japan-Based Factories (bleepingcomputer.com) 67

An anonymous reader quotes a report from BleepingComputer: Toyota says a recent disruption of operations in Japan-based production plants was caused by its database servers running out of storage space. On August 29th, it was reported that Toyota had to halt operations on 12 of its 14 Japan-based car assembly plants due to an undefined system malfunction. As one of the largest automakers in the world, the situation caused production output losses of roughly 13,000 cars daily, threatening to impact exports to the global market.

In a statement released today on Toyota's Japanese news portal, the company explains that the malfunction occurred during a planned IT systems maintenance event on August 27th, 2023. The planned maintenance was to organize the data and deletion of fragmented data in a database. However, as the storage was filled to capacity before the completion of the tasks, an error occurred, causing the system to shut down. This shutdown directly impacted the company's production ordering system so that no production tasks could be planned and executed.

Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations. The restoration came on August 29th, 2023, when Toyota's IT team had prepared a larger capacity server to accept the data that was partially transferred two days back. This allowed Toyota's engineers to restore the production ordering system and the plants to resume operations.

This discussion has been archived. No new comments can be posted.

Toyota Says Filled Disk Storage Halted Japan-Based Factories

Comments Filter:
  • Heads will roll... after the culprits slice their own guts out.

  • I’m a physical engineer. Isn’t this kinda like the operator of a coal-fired power plant letting their coal pile run down to nothing? Where no manager at any level bothers to look out a window and notice that the big pile of energy is getting low?

    Seems like a “you had one job” moment to me but I could be wrong.
    • Rationing of disk space is pretty common. Usually someone with no technical competence sets the budget and engineers have to fill out laborious forms to get resources. Penny pinching plus unpredictable storage utilization adds risk when storage costs are misestimated.
      • Let me add, 99% of these kind of dumb looking failures can be prevented by installing engineers in management positions. It's always bad management that causes these failures.
      • Re: (Score:3, Informative)

        Rationing of disk space is pretty common.

        Is it? The last job where I had a disk quota was 30 years ago.

        Rationing disk space makes as much sense as rationing toilet paper, except the toilet paper I use costs more than the disk space.

        • I may be working on a different scale than you...but I am given VMs to work with. They almost always under-spec the VMs, and I need to ask for more disk, and more memory.

          The admins don't understand why I keep 6 versions of a database during development, and I don't understand why they can't just give me 10 times more storage. We usually have a back and forth for a couple of months before things get sorted out.

          This has been going since we moved to VMs, rather than me just having the entire disk capacity av

          • by NFN_NLN ( 633283 )

            > The admins don't understand why I keep 6 versions of a database during development

            Use linked clones.

          • If you're talk'n petabytes, then you're on a different scale than me.

            If you're talk'n terabytes, then you work for idiots. A terabyte costs, like, $10.

            For comparison, the average business spends $60 per employee annually on toilet paper.

            • by Zak3056 ( 69287 )

              If you're talk'n terabytes, then you work for idiots. A terabyte costs, like, $10.

              The going rate for 1TB of enterprise storage is in the $1000 range[1] (not inclusive of maintenance, operating, and backup costs). With that said, skimping on storage is, indeed, stupid, because that $1k (which you'd amortize over 5-7 years) is only 1% of the annual cost of an engineer using the storage to get his job done.

              [1] Assumption is you're at the 10s to low 100s of TB scale for that number.

          • Major Morris: Our clerk says you want an incubator. No dice.
            Hawkeye: Yeah, but you've got three.
            Major Morris: That's right. If I give one away, I'll only have two.
            Trapper: What's wrong with two?
            Major Morris: Two is not as good as three.

            https://www.imdb.com/title/tt0... [imdb.com]

            • M.A.S.H. reference.....check
              Low 6 figure UID.........check
              Obscure dad joke sig..check

              Alright, here's your card stamp, move along

        • by lsllll ( 830002 )
          It's because organizations would rather put in hundreds or millions of dollars into SANs and VM infrastructure and then throwing whatever needs they have at that setup. It takes a competent administrator to not only slice and dice the CPUs and the storage, but also manage and monitor the usage to catch rogue processes on VMs that suck up disk bandwidth, compromising other services like database servers that are highly disk speed dependent (which should have been physical machines/clusters with their own di
          • by Zak3056 ( 69287 )

            they have a policy that they don't create workstations in their VM environment, so they spun up a Windows server VM

            Explanation for this policy is pretty simple: they have Windows Datacenter licensing, and the incremental cost per Server VM is $0 compared to Windows 10/11 which is a PITA to license for virtualization at small volumes (it's basically impossible to license a VM, you're typically licensing everyone who touches virtual client OSEs).

    • by TWX ( 665546 )

      More akin to a major food processing and distributor running low on frozen storage space and in the process of reshuffling, having a traffic jam out in the non-climate-controlled hallway in front of the freezers preventing moving anything to where it belongs, as new orders are piling up on the loading dock.

      Imperfect analogy but since only those performing warehousing operations would see it (ie not corporate officers, or truck drivers, or suppliers, or delivery drivers), few to pay it any mind until it's to

    • Isn't this kinda like the operator of a coal-fired power plant letting their coal pile run down to nothing? Where no manager at any level bothers to look out a window?

      A big pile of coal looks very different from a small pile of coal.

      A full HDD looks the same as an empty HDD.

      It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.

      • It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.

        Possibly. Or it was management penny-pinching and refusing to buy more storage, then pushing IT to complete the task with limited resources.

        • My money says that IT, as stated, had a planned maintenance window to avoid this exact issue, but management insisted they delay because "production." Happens all the time. Management sows what they reap.

      • Probably both, general rule of thumb, at 80% utilization you plan for more. If its a single VM then you review available disk space and allocate more as budget permits. If its cloud hosted then you are budgeting a larger dollar spend on that resource which requires approvals which are quite easy to get during an outage.

        This sounds like Toyota is using outmoded methods of data storage such as using physical servers for database servers because they still believe the virtualization performance penalty is a t

        • This sounds like Toyota is using outmoded methods of data storage such as using physical servers for database servers because they still believe the virtualization performance penalty is a thing to even consider.

          Funnily enough this is a thing that I was butting heads with a sales rep over not so long ago. Thanks to all the virtualisation extensions and all the effort put into the hypervisors and also the OS and software itself being VM friendly I wouldn't be concerned about the overhead...

          But in order to demonstrate cost savings VM proponents wind up using core heavy CPUs running at substantially lower clocks than the old physical servers and it makes a noticeable difference. Of course if you're doing a bespoke VM

      • It isn't management's job to micromanage disk space. The management screwup was hiring incompetent IT staff.

        Or probably more likely it still was a management screwup because they didn't approve the purchase of a fully redundant storage platform with proper excess capacity. Very likely had a conversation similar to the following:

        IT employee: "We need to expand our storage system."
        Manager: "We just did that last year."
        IT employee: "Yes, I know, but we have been using it at a rate faster than originally projected."
        Manager: "We don't have the budget, make due with what you have, we bought twice as much as originally

    • We don't know what went on, but the fact that it happened during a planned activity sounds as if it is nothing like letting your coal pile run down to nothing. More likely they had free space, the maintenance activity did *something*, software updates, copying, moving data to another database or something, and likely duplicated the data without realising they didn't have enough space for this temporary data.

      Kind of like a coal-fired power plant suddenly having some excavator come in and remove all the coal

  • The obvious lack of software competence at Toyota explains a lot about their reluctance to embrace new technology.
  • Seems like the IT department needs to use "Kaizen" methods.
  • by Alain Williams ( 2972 ) <addw@phcomp.co.uk> on Wednesday September 06, 2023 @05:12PM (#63828604) Homepage

    for admitting what looks like a simple but stupid mistake and not trying to wrap this up in some complicated verbiage. We all screw up on occasion, the brave confess so that everyone else can learn.

  • by stabiesoft ( 733417 ) on Wednesday September 06, 2023 @05:15PM (#63828610) Homepage
    https://www.oracle.com/custome... [oracle.com] and it was a database migration. Related?
    • This is *exactly* what I hoped to read in this thread, and now I am not disappointed. At this point it sure does seem like we might just have another Oracle consulting/support 'success' story to poke fun at, perhaps even costlier than Oregon's botched healthcare website [slashdot.org], although $6 billion is a high bar to cross.

      One Really Rich Asshole Called Larry Ellison, (ORACLE), maintainer of MySQL since SUN was purchased, which is why I use either MariaDB or PostgreSQL for the websites I develop.

      • If you want more Oracle success stories,

        https://www.theregister.com/20... [theregister.com]

        Birmingham City in the UK pretty much declared decalred bankruptcy and although Oracle wasn't the main cause, it is surely a contributing factor at the least.

        An Oracle project just when up in cost fourfold to 100 Million pounds. And it's not even completed, so expect it will go up alot more.

    • by bungo ( 50628 )

      The situation may have changed in the last 10 years, but I don't think so.

      The production side, that controls the factories and just in time systems run on IBM mainframes and DB2. They have back end systems that run on Oracle (and Linux), but they do not control the production systems.

  • > Toyota explains that its main servers and backup machines operate on the same system So... It's not a backup?
    • Different virtual machines in the same host.

      A hardware server is considered to be too expensive.

    • Reminds me of a telco I worked for. Their data center was fully redundant: double everything, servers, storage, network equipment, down to two separate diesel backup generators. Which were fed from the same tank. Which was empty when the power went out...
  • Disk space, that is. Although the more usual one I've encountered is mysterious out of control log files.
    • That was my thought.

      If they are going through tons of different processes during an upgrade, they are making tons and tons of changes, and the logs will fill up.

      I know I've had a server stop working due to runaway log files at least once...until you find the 'Delete log files when they get X big' features.

  • Japan's ironic techno-conservatism over the past 15 years is becoming epic.
    • Either too much or too little change is bad. It's not an easy thing to optimize.

      Toyota's whole corporate strategy of modernizing their drivetrains is a study in the same thing - they're not the most aggressive in switching to EV's. Some think they are falling behind and will wake up doomed one day. Yet for now they are raking in money - just broke their own all-time quarterly record for profits. Their customers are conservative and expect to get a couple hundred thousand miles or more from their Toyotas.

      • Trust me, the Prius was never mocked as "too progressive" by anyone with even half a brain. It was settled-for out of frustration with the literally criminal intransigence of the rest of the auto industry. And today, Toyota is not even a leading tech player among legacy OEMs, let alone the industry in general. In fact, their "solutions" aren't merely a case of slow adoption, but in many ways represent backward steps that seek to rewind the clock and undo progress already made.

        Japanese technology looks
      • Toyotas also betting on hydrogen, which while having it's own issues, doesn't have the annoying problem of lugging around a heavy battery pack that catches on fire and can't be put out. Hydrogen in closed spaces may be explosive, but it does dissipate better than a fully charged battery that's been wrapped around a tree.
      • Prius batteries wearing out is real, and only stress testing them will reveal their condition. The vehicle doesn't report what it knows about the battery condition on any of the displays. They last a good long time, but not forever, and it's a $3500 bill to replace one. For this reason I bought a used Versa instead of a used Prius, even though I would have really enjoyed the additional MPG. I plan to get another car in five years-ish anyway. The Prius just had too many miles to gamble on (~160k).

        One has to

  • by Jeslijar ( 1412729 ) on Wednesday September 06, 2023 @05:47PM (#63828692) Homepage

    Granted structure is different everywhere i've worked but where I am an applications team handles database maintenance. The IT infrastructure team makes sure the underlying infrastructure is working and that no problems are seen in terms of metrics from alerts but when you run maintenance disk utilization in a database can expand greatly over what it will be day to day e.g. the overwhelming majority of time.

    Depending on what commands were run and what developers may be doing to an actual production DB (you'd be surprised how poor the controls are for things like this until it's a problem in a manufacturing company...) said commands could take up 1.5x typical storage, 2x typical storage, or 10x typical storage. So their 500GB database with a 2TB disk may not be enough.

    This is greatly oversimplified and i'm just making up an example because we don't know what happened. I do know that i've seen poor data architecture everywhere i've ever worked from tiny companies up to multi-billion dollar businesses. Things that never have downtime because production is 24x7 like a manufacturing plant can have some of the worst tech debt of anywhere.

    • by King_TJ ( 85913 )

      Yeah.... I just wanted to say I tend to agree. It's easy for the armchair Slashdot reader/tech geek to poke fun at this, and throw around claims that it was pure incompetence, stupidity, etc. etc.

      Maybe so? But the more likely reality is that the people hired to manage a production database of the size and scope required for a major auto manufacturer have a clue what they're doing.

      I.T. can provide disk storage and allocate it based on anticipated needs, and even set alerts so they're aware when it starts get

    • by leenks ( 906881 ) on Wednesday September 06, 2023 @06:10PM (#63828764)

      The official statement doesn't mention "IT"...

      We would like to apologize once again to our customers, suppliers, and related parties for any inconvenience caused by the suspension of our domestic plants as a result of the malfunction in our production order system at the end of last month.

      The system malfunction was caused by the unavailability of some multiple servers that process parts orders. As for the circumstances, regular maintenance work was performed on August 27, the day before the malfunction occurred. During the maintenance procedure, data that had accumulated in the database was deleted and organized, and an error occurred due to insufficient disk space, causing the system to stop. Since these servers were running on the same system, a similar failure occurred in the backup function, and a switchover could not be made. This led to the suspension of domestic plant operations. The system was restored after the data was transferred to a server with a larger capacity on August 29, and the plants resumed operation on the following day. We would like to report that we have identified the above as the true cause. Countermeasures have also been put in place by replicating and verifying the situation.

      We would also like to reaffirm that the system malfunction was not caused by a cyberattack, and apologize to all parties for any concern this may have caused.

      Going forward, we will review our maintenance procedures and strengthen our efforts to prevent a recurrence, so that we can deliver as many vehicles to our customers as soon as possible.

      https://global.toyota/en/newsroom/corporate/39732568.html

  • Nah, we don't need to test for full hard drive, that will never happen!

    • How would a unit test help with that? When the drive is full, a database will stop working regardless. The mitigation is maintenance and capacity planning, not unit tests.
    • Nah, we don't need to test for full hard drive, that will never happen!

      Why do you think they didn't test for what happens when the HDD is full?

      Unit test:
      Test: Fill HDD
      Result: Entire system breaks
      Solutions: Not fixable in software since we can't magic new free space into existence and can write to but not read from /dev/null, implement procedural measures to prevent disk being full.

  • Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations.

    That design is inexcusable.

    • Unfortunately it's also the normal state in most companies.
      A single server with no redundancy can serve multiple sites. This to save money.
      The downside is that you lose money due to latency issues, just a few minutes per employee every day.
      Where I work there's a central domain controller+dns+dhcp server in another country, and logins takes 3 minutes.

    • That design is inexcusable.

      The design itself is fine, provided that same system is equipped for the job (redundant CPU's, tons of memory, etc.). Having separate physical machines may not prevent this exact type of failure, as disks can fill up under a variety of circumstances.

      It's a continuous full time job to monitor and maintain critical services, and it's all too easy for a runaway subservice to spew incredible amounts of data to storage. I've seen incredibly competent system designers discover incredibly obscure bugs only after t

  • Japan has been behind the 8 ball for a long time in IT system technology. More of this stupid stuff will happen in the future. That IT pros in Japan are treated as 2nd class citizens doesn't help matters.
  • The EPA estimated disk capacity was 10PB but the real-world capacity turned out to be much lower
  • So then, no more archiving all meetings with multiple views in 4k because "storage is cheap"?

  • They probably had been putting off the migration because they knew it was likely to impact the production lines. Someone convinced them there might be a way to slide in the change without shutting down production. That person's fate depends a whole lot on their certainty in the ability to switch over painlessly. If the need was really pressing, they may have opted for a half-baked transition plan to prevent down time as that is better than no plan, and the ensuing inevitable down time. Aside from resources

    • by valley ( 240947 )
      Exactly. Companies assuming they can just spy on everything we do is a ridiculous assumption that should be slapped down by every regulating body.
      • Unfortunately, saying that the companies effectively ARE the regulating bodies in many cases doesn't stray far from the truth.

  • I'd good of Toyota to actually admit what happened, though I'm sure they didn't have too much choice as the word had likely spread throughout the company as to why everything had to shut down.

    Though this bit in particular made me wince a bit:

    Toyota explains that its main servers and backup machines operate on the same system. Due to this, both systems faced the same failure, making a switchover impossible, inevitably leading to a halt in factory operations.

    So, Toyota has learned the hard way that their back

  • "Toyota explains that its main servers and backup machines operate on the same system" seems like a bad plan. Is it really a backup if you can't switch to it and keep it isolated from production?

Avoid strange women and temporary variables.

Working...